Repository: PromtEngineer/agentic-file-search Branch: main Commit: 83c5b4231f44 Files: 59 Total size: 458.6 KB Directory structure: gitextract_mqv4xk8i/ ├── .github/ │ └── workflows/ │ ├── build.yaml │ ├── lint.yaml │ ├── test.yaml │ └── typecheck.yaml ├── .gitignore ├── .pre-commit-config.yaml ├── .python-version ├── ARCHITECTURE.md ├── CLAUDE.md ├── IMPLEMENTATION_PLAN.md ├── Makefile ├── README.md ├── YOUTUBE_DEMO_TESTS.md ├── data/ │ ├── large_acquisition/ │ │ └── TEST_QUESTIONS.md │ ├── test_acquisition/ │ │ └── TEST_QUESTIONS.md │ └── testfile.txt ├── docker/ │ └── docker-compose.yml ├── pyproject.toml ├── scripts/ │ ├── generate_large_docs.py │ └── generate_test_docs.py ├── src/ │ └── fs_explorer/ │ ├── __init__.py │ ├── agent.py │ ├── embeddings.py │ ├── exploration_trace.py │ ├── fs.py │ ├── index_config.py │ ├── indexing/ │ │ ├── __init__.py │ │ ├── chunker.py │ │ ├── metadata.py │ │ ├── pipeline.py │ │ └── schema.py │ ├── main.py │ ├── models.py │ ├── search/ │ │ ├── __init__.py │ │ ├── filters.py │ │ ├── query.py │ │ ├── ranker.py │ │ └── semantic.py │ ├── server.py │ ├── storage/ │ │ ├── __init__.py │ │ ├── base.py │ │ └── duckdb.py │ ├── ui.html │ └── workflow.py └── tests/ ├── __init__.py ├── conftest.py ├── test_agent.py ├── test_cli_indexing.py ├── test_e2e.py ├── test_embeddings.py ├── test_exploration_trace.py ├── test_fs.py ├── test_indexing.py ├── test_models.py ├── test_search.py ├── test_server_search.py └── testfiles/ ├── file1.txt ├── file2.md └── last/ └── lastfile.txt ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows/build.yaml ================================================ name: Build on: pull_request: jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v6 - name: Set up Python run: uv python install 3.13 - name: Build package run: make build ================================================ FILE: .github/workflows/lint.yaml ================================================ name: Linting on: pull_request: jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install uv uses: astral-sh/setup-uv@v6 - name: Set up Python run: uv python install 3.12 - name: Run formatter shell: bash run: make format-check - name: Run linter shell: bash run: make lint ================================================ FILE: .github/workflows/test.yaml ================================================ name: CI Tests - Pull Request on: pull_request: jobs: testing_pr: runs-on: ubuntu-latest strategy: matrix: python-version: ["3.10", "3.11", "3.12", "3.13"] steps: - uses: actions/checkout@v4 with: fetch-depth: 1 - name: Install uv uses: astral-sh/setup-uv@v6 with: python-version: ${{ matrix.python-version }} enable-cache: true - name: Run Tests on Main Package run: make test ================================================ FILE: .github/workflows/typecheck.yaml ================================================ name: Typecheck on: pull_request: jobs: core-typecheck: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 1 - name: Install uv uses: astral-sh/setup-uv@v6 - name: Set up Python run: uv python install - name: Run Mypy run: make typecheck ================================================ FILE: .gitignore ================================================ # Python-generated files __pycache__/ *.py[oc] build/ dist/ wheels/ *.egg-info # Virtual environments .venv # caches *_cache/ # Environment .env # OS files .DS_Store ================================================ FILE: .pre-commit-config.yaml ================================================ --- default_language_version: python: python3 repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.5.0 hooks: - id: check-merge-conflict - id: check-symlinks - id: check-yaml - id: detect-private-key ================================================ FILE: .python-version ================================================ 3.13 ================================================ FILE: ARCHITECTURE.md ================================================ # FsExplorer Architecture Documentation ## Table of Contents 1. [System Overview](#system-overview) 2. [Component Architecture](#component-architecture) 3. [Core Modules](#core-modules) 4. [Workflow Engine](#workflow-engine) 5. [Agent Decision Loop](#agent-decision-loop) 6. [Document Processing Pipeline](#document-processing-pipeline) 7. [Three-Phase Exploration Strategy](#three-phase-exploration-strategy) 8. [Token Tracking & Cost Estimation](#token-tracking--cost-estimation) 9. [CLI Interface](#cli-interface) 10. [Data Flow](#data-flow) 11. [File Structure](#file-structure) 12. [Extension Points](#extension-points) --- ## System Overview FsExplorer is an AI-powered filesystem exploration agent that answers questions about documents by intelligently navigating directories, parsing files, and synthesizing information with source citations. ```mermaid graph TB subgraph "User Interface" CLI[CLI Interface
typer + rich] end subgraph "Orchestration Layer" WF[Workflow Engine
llama-index-workflows] EVT[Event System] end subgraph "Intelligence Layer" AGENT[FsExplorer Agent] LLM[Google Gemini 2.0 Flash
Structured JSON Output] PROMPT[System Prompt
Three-Phase Strategy] end subgraph "Tools Layer" TOOLS[Tool Registry] SCAN[scan_folder
Parallel Scan] PREVIEW[preview_file
Quick Preview] PARSE[parse_file
Deep Read] READ[read
Text Files] GREP[grep
Pattern Search] GLOB[glob
File Search] end subgraph "Document Processing" DOCLING[Docling
Document Converter] CACHE[Document Cache] end subgraph "Filesystem" FS[(Local Filesystem)] PDF[PDF Files] DOCX[DOCX Files] MD[Markdown Files] OTHER[Other Formats] end CLI --> WF WF --> EVT EVT --> AGENT AGENT --> LLM AGENT --> PROMPT AGENT --> TOOLS TOOLS --> SCAN TOOLS --> PREVIEW TOOLS --> PARSE TOOLS --> READ TOOLS --> GREP TOOLS --> GLOB SCAN --> DOCLING PREVIEW --> DOCLING PARSE --> DOCLING DOCLING --> CACHE CACHE --> FS FS --> PDF FS --> DOCX FS --> MD FS --> OTHER style LLM fill:#4285f4,color:#fff style DOCLING fill:#ff6b6b,color:#fff style CACHE fill:#ffd93d,color:#000 style AGENT fill:#6bcb77,color:#fff ``` --- ## Component Architecture ### High-Level Component Diagram ```mermaid graph LR subgraph "Entry Point" MAIN[main.py
CLI Entry] end subgraph "Workflow" WORKFLOW[workflow.py
Event Orchestration] end subgraph "Agent" AGENT_MOD[agent.py
AI Decision Making] end subgraph "Models" MODELS[models.py
Pydantic Schemas] end subgraph "Filesystem" FS_MOD[fs.py
File Operations] end MAIN --> WORKFLOW WORKFLOW --> AGENT_MOD AGENT_MOD --> MODELS AGENT_MOD --> FS_MOD WORKFLOW --> MODELS style MAIN fill:#e1f5fe style WORKFLOW fill:#f3e5f5 style AGENT_MOD fill:#e8f5e9 style MODELS fill:#fff3e0 style FS_MOD fill:#fce4ec ``` ### Module Dependencies ```mermaid graph TD subgraph "fs_explorer package" INIT[__init__.py
Public API Exports] MAIN[main.py] WORKFLOW[workflow.py] AGENT[agent.py] MODELS[models.py] FS[fs.py] end subgraph "External Dependencies" TYPER[typer
CLI Framework] RICH[rich
Terminal UI] WORKFLOWS[llama-index-workflows
Event System] GENAI[google-genai
Gemini API] PYDANTIC[pydantic
Data Validation] DOCLING[docling
Document Parsing] end INIT --> AGENT INIT --> WORKFLOW INIT --> MODELS MAIN --> TYPER MAIN --> RICH MAIN --> WORKFLOW WORKFLOW --> WORKFLOWS WORKFLOW --> AGENT WORKFLOW --> MODELS WORKFLOW --> FS AGENT --> GENAI AGENT --> MODELS AGENT --> FS MODELS --> PYDANTIC FS --> DOCLING style GENAI fill:#4285f4,color:#fff style DOCLING fill:#ff6b6b,color:#fff ``` --- ## Core Modules ### models.py - Data Schemas Defines the structured output format for the AI agent using Pydantic models. ```mermaid classDiagram class Action { +action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction +reason: str +to_action_type() ActionType } class ToolCallAction { +tool_name: Tools +tool_input: list[ToolCallArg] +to_fn_args() dict } class ToolCallArg { +parameter_name: str +parameter_value: Any } class GoDeeperAction { +directory: str } class StopAction { +final_result: str } class AskHumanAction { +question: str } Action --> ToolCallAction Action --> GoDeeperAction Action --> StopAction Action --> AskHumanAction ToolCallAction --> ToolCallArg note for Action "Main container returned by LLM" note for ToolCallAction "Invokes filesystem tools" note for StopAction "Contains final answer with citations" ``` ### agent.py - AI Agent The core intelligence component that interacts with Google Gemini. ```mermaid classDiagram class FsExplorerAgent { -_client: GenAIClient -_chat_history: list[Content] +token_usage: TokenUsage +__init__(api_key: str) +configure_task(task: str) void +take_action() tuple[Action, ActionType] +call_tool(tool_name: Tools, tool_input: dict) void +reset() void } class TokenUsage { +prompt_tokens: int +completion_tokens: int +total_tokens: int +api_calls: int +tool_result_chars: int +documents_parsed: int +documents_scanned: int +add_api_call(prompt_tokens, completion_tokens) void +add_tool_result(result, tool_name) void +summary() str } class TOOLS { <> +read: read_file +grep: grep_file_content +glob: glob_paths +scan_folder: scan_folder +preview_file: preview_file +parse_file: parse_file } FsExplorerAgent --> TokenUsage FsExplorerAgent --> TOOLS ``` ### fs.py - Filesystem Operations All filesystem and document parsing utilities. ```mermaid classDiagram class FilesystemModule { <> +SUPPORTED_EXTENSIONS: frozenset +DEFAULT_PREVIEW_CHARS: int = 3000 +DEFAULT_SCAN_PREVIEW_CHARS: int = 1500 +DEFAULT_MAX_WORKERS: int = 4 } class DocumentCache { <> -_DOCUMENT_CACHE: dict[str, str] +clear_document_cache() void +_get_cached_or_parse(file_path) str } class DirectoryOps { <> +describe_dir_content(directory) str +glob_paths(directory, pattern) str } class FileOps { <> +read_file(file_path) str +grep_file_content(file_path, pattern) str } class DocumentOps { <> +preview_file(file_path, max_chars) str +parse_file(file_path) str +scan_folder(directory, max_workers, preview_chars) str } FilesystemModule --> DocumentCache FilesystemModule --> DirectoryOps FilesystemModule --> FileOps FilesystemModule --> DocumentOps DocumentOps --> DocumentCache ``` --- ## Workflow Engine The workflow engine uses an event-driven architecture based on `llama-index-workflows`. ### Workflow State Machine ```mermaid stateDiagram-v2 [*] --> StartExploration: InputEvent(task) StartExploration --> ToolCall: ToolCallEvent StartExploration --> GoDeeper: GoDeeperEvent StartExploration --> AskHuman: AskHumanEvent StartExploration --> End: StopAction ToolCall --> ToolCall: ToolCallEvent ToolCall --> GoDeeper: GoDeeperEvent ToolCall --> AskHuman: AskHumanEvent ToolCall --> End: StopAction GoDeeper --> ToolCall: ToolCallEvent GoDeeper --> GoDeeper: GoDeeperEvent GoDeeper --> AskHuman: AskHumanEvent GoDeeper --> End: StopAction AskHuman --> WaitForHuman: InputRequiredEvent WaitForHuman --> ProcessHumanResponse: HumanAnswerEvent ProcessHumanResponse --> ToolCall: ToolCallEvent ProcessHumanResponse --> GoDeeper: GoDeeperEvent ProcessHumanResponse --> AskHuman: AskHumanEvent ProcessHumanResponse --> End: StopAction End --> [*]: ExplorationEndEvent note right of StartExploration Initial task processing Describes current directory Asks LLM for first action end note note right of ToolCall Executes filesystem tool Adds result to chat history Asks LLM for next action end note note right of GoDeeper Updates current directory Describes new directory Asks LLM for next action end note ``` ### Event Types ```mermaid graph TB subgraph "Start Events" IE[InputEvent
task: str] end subgraph "Intermediate Events" TCE[ToolCallEvent
tool_name, tool_input, reason] GDE[GoDeeperEvent
directory, reason] AHE[AskHumanEvent
question, reason] HAE[HumanAnswerEvent
response] end subgraph "End Events" EEE[ExplorationEndEvent
final_result, error] end IE --> TCE IE --> GDE IE --> AHE IE --> EEE TCE --> TCE TCE --> GDE TCE --> AHE TCE --> EEE GDE --> TCE GDE --> GDE GDE --> AHE GDE --> EEE AHE --> HAE HAE --> TCE HAE --> GDE HAE --> AHE HAE --> EEE style IE fill:#4caf50,color:#fff style EEE fill:#f44336,color:#fff style TCE fill:#2196f3,color:#fff style GDE fill:#9c27b0,color:#fff style AHE fill:#ff9800,color:#fff ``` ### Workflow Steps ```mermaid sequenceDiagram participant CLI as CLI (main.py) participant WF as Workflow participant Agent as FsExplorerAgent participant LLM as Gemini API participant Tools as Tool Registry participant FS as Filesystem CLI->>WF: InputEvent(task) WF->>Agent: configure_task(initial_prompt) Agent->>LLM: generate_content(chat_history) LLM-->>Agent: Action JSON alt ToolCallAction Agent->>Tools: call_tool(name, args) Tools->>FS: execute operation FS-->>Tools: result Tools-->>Agent: tool result Agent->>Agent: add to chat_history WF-->>CLI: ToolCallEvent (stream) WF->>Agent: configure_task("next action?") Note over WF,Agent: Loop continues else GoDeeperAction WF->>WF: update current_directory WF-->>CLI: GoDeeperEvent (stream) WF->>Agent: configure_task("next action?") Note over WF,Agent: Loop continues else AskHumanAction WF-->>CLI: AskHumanEvent (stream) CLI->>CLI: Wait for user input CLI->>WF: HumanAnswerEvent(response) WF->>Agent: configure_task(response) Note over WF,Agent: Loop continues else StopAction WF-->>CLI: ExplorationEndEvent(final_result) end ``` --- ## Agent Decision Loop ### Single Decision Cycle ```mermaid flowchart TB subgraph "Agent.take_action()" START([Start]) --> SEND[Send chat_history to Gemini] SEND --> RECEIVE[Receive JSON response] RECEIVE --> TRACK[Track token usage] TRACK --> PARSE[Parse Action from JSON] PARSE --> CHECK{Action Type?} CHECK -->|toolcall| EXEC[Execute Tool] EXEC --> RESULT[Get tool result] RESULT --> ADD[Add result to chat_history] ADD --> RETURN1[Return Action, ActionType] CHECK -->|godeeper| RETURN2[Return Action, ActionType] CHECK -->|askhuman| RETURN3[Return Action, ActionType] CHECK -->|stop| RETURN4[Return Action, ActionType] RETURN1 --> END([End]) RETURN2 --> END RETURN3 --> END RETURN4 --> END end style START fill:#4caf50,color:#fff style END fill:#f44336,color:#fff style CHECK fill:#ff9800,color:#000 ``` ### Chat History Evolution ```mermaid sequenceDiagram participant User participant Agent participant LLM Note over Agent: chat_history = [] User->>Agent: configure_task("Initial prompt + directory listing") Note over Agent: chat_history = [user: initial_prompt] Agent->>LLM: generate_content(chat_history) LLM-->>Agent: {action: scan_folder, reason: "..."} Note over Agent: chat_history = [user: initial_prompt, model: action1] Agent->>Agent: Execute scan_folder, add result Note over Agent: chat_history = [user: initial_prompt, model: action1, user: tool_result1] User->>Agent: configure_task("What's next?") Note over Agent: chat_history = [..., user: "What's next?"] Agent->>LLM: generate_content(chat_history) LLM-->>Agent: {action: parse_file, reason: "..."} Note over Agent: chat_history = [..., model: action2] Note over Agent: Pattern continues until StopAction ``` --- ## Document Processing Pipeline ### Docling Integration ```mermaid flowchart LR subgraph "Input Formats" PDF[PDF] DOCX[DOCX] PPTX[PPTX] XLSX[XLSX] HTML[HTML] MD[Markdown] end subgraph "Docling" DC[DocumentConverter] DETECT[Format Detection] PIPELINE[Processing Pipeline] EXPORT[Markdown Export] end subgraph "Output" MARKDOWN[Markdown Text] end PDF --> DC DOCX --> DC PPTX --> DC XLSX --> DC HTML --> DC MD --> DC DC --> DETECT DETECT --> PIPELINE PIPELINE --> EXPORT EXPORT --> MARKDOWN style DC fill:#ff6b6b,color:#fff ``` ### Caching Strategy ```mermaid flowchart TB subgraph "Cache Key Generation" PATH[file_path] --> ABS[os.path.abspath] ABS --> MTIME[os.path.getmtime] MTIME --> KEY["cache_key = f'{abs_path}:{mtime}'"] end subgraph "Cache Lookup" KEY --> CHECK{Key in cache?} CHECK -->|Yes| HIT[Return cached content] CHECK -->|No| MISS[Parse with Docling] MISS --> STORE[Store in cache] STORE --> RETURN[Return content] end subgraph "_DOCUMENT_CACHE" CACHE[(dict: str → str)] end HIT --> CACHE STORE --> CACHE style CACHE fill:#ffd93d,color:#000 ``` ### Parallel Document Scanning ```mermaid flowchart TB subgraph "scan_folder(directory)" START([Start]) --> LIST[List directory files] LIST --> FILTER[Filter by SUPPORTED_EXTENSIONS] FILTER --> POOL[Create ThreadPoolExecutor
max_workers=4] subgraph "Parallel Processing" POOL --> T1[Thread 1
_preview_single_file] POOL --> T2[Thread 2
_preview_single_file] POOL --> T3[Thread 3
_preview_single_file] POOL --> T4[Thread 4
_preview_single_file] end T1 --> COLLECT[Collect Results] T2 --> COLLECT T3 --> COLLECT T4 --> COLLECT COLLECT --> SORT[Sort by filename] SORT --> FORMAT[Format output report] FORMAT --> END([Return summary]) end style START fill:#4caf50,color:#fff style END fill:#4caf50,color:#fff style POOL fill:#2196f3,color:#fff ``` --- ## Three-Phase Exploration Strategy ### Phase Overview ```mermaid flowchart TB subgraph "PHASE 1: Parallel Scan" P1_START([User Query]) --> P1_SCAN[scan_folder] P1_SCAN --> P1_PREVIEW[Get previews of ALL documents] P1_PREVIEW --> P1_CATEGORIZE[Categorize documents] P1_CATEGORIZE --> REL[RELEVANT
Directly related] P1_CATEGORIZE --> MAYBE[MAYBE
Potentially useful] P1_CATEGORIZE --> SKIP[SKIP
Not relevant] end subgraph "PHASE 2: Deep Dive" REL --> P2_PARSE[parse_file on RELEVANT docs] MAYBE -.->|If needed| P2_PARSE P2_PARSE --> P2_EXTRACT[Extract key information] P2_EXTRACT --> P2_CROSS{Cross-references
found?} end subgraph "PHASE 3: Backtracking" P2_CROSS -->|Yes| P3_CHECK{Referenced doc
was SKIPPED?} P3_CHECK -->|Yes| P3_BACKTRACK[Go back and parse
referenced document] P3_BACKTRACK --> P2_EXTRACT P3_CHECK -->|No| P3_CONTINUE[Continue analysis] P2_CROSS -->|No| P3_CONTINUE end subgraph "Final Answer" P3_CONTINUE --> ANSWER[Generate answer
with citations] ANSWER --> SOURCES[List sources consulted] SOURCES --> END([Return to user]) end style P1_START fill:#4caf50,color:#fff style END fill:#4caf50,color:#fff style REL fill:#4caf50,color:#fff style MAYBE fill:#ff9800,color:#000 style SKIP fill:#9e9e9e,color:#fff style P3_BACKTRACK fill:#e91e63,color:#fff ``` ### Cross-Reference Detection ```mermaid flowchart LR subgraph "Document Content" DOC[Parsed Document] end subgraph "Pattern Matching" DOC --> P1["'See Exhibit A/B/C...'"] DOC --> P2["'As stated in [Document]...'"] DOC --> P3["'Refer to [filename]...'"] DOC --> P4["'per Document: [name]'"] DOC --> P5["'[Doc #XX]'"] end subgraph "Action" P1 --> FOUND[Cross-reference found] P2 --> FOUND P3 --> FOUND P4 --> FOUND P5 --> FOUND FOUND --> CHECK{Was referenced
doc SKIPPED?} CHECK -->|Yes| BACKTRACK[Backtrack and parse] CHECK -->|No| CONTINUE[Continue] end style BACKTRACK fill:#e91e63,color:#fff ``` --- ## Token Tracking & Cost Estimation ### TokenUsage Class ```mermaid flowchart TB subgraph "Input Tracking" API[API Call] --> PROMPT[prompt_token_count] API --> COMPLETION[candidates_token_count] PROMPT --> ADD_API[add_api_call] COMPLETION --> ADD_API end subgraph "Tool Tracking" TOOL[Tool Execution] --> RESULT[result string] RESULT --> ADD_TOOL[add_tool_result] ADD_TOOL --> CHARS[tool_result_chars += len] ADD_TOOL --> PARSED{tool_name?} PARSED -->|parse_file| INC_PARSED[documents_parsed++] PARSED -->|preview_file| INC_PARSED PARSED -->|scan_folder| INC_SCANNED[documents_scanned += count] end subgraph "Cost Calculation" ADD_API --> TOTALS[Update totals] TOTALS --> CALC[_calculate_cost] CALC --> INPUT_COST["input_cost = prompt_tokens × $0.075/1M"] CALC --> OUTPUT_COST["output_cost = completion_tokens × $0.30/1M"] INPUT_COST --> TOTAL_COST[total_cost] OUTPUT_COST --> TOTAL_COST end subgraph "Summary Output" TOTAL_COST --> SUMMARY[summary] CHARS --> SUMMARY INC_PARSED --> SUMMARY INC_SCANNED --> SUMMARY end ``` ### Cost Estimation Formula ```mermaid graph LR subgraph "Gemini 2.0 Flash Pricing" INPUT["Input: $0.075 / 1M tokens"] OUTPUT["Output: $0.30 / 1M tokens"] end subgraph "Calculation" PROMPT[prompt_tokens] --> DIV1[÷ 1,000,000] DIV1 --> MULT1[× $0.075] MULT1 --> INPUT_COST[Input Cost] COMP[completion_tokens] --> DIV2[÷ 1,000,000] DIV2 --> MULT2[× $0.30] MULT2 --> OUTPUT_COST[Output Cost] INPUT_COST --> SUM[+] OUTPUT_COST --> SUM SUM --> TOTAL[Total Estimated Cost] end style TOTAL fill:#4caf50,color:#fff ``` --- ## CLI Interface ### Output Formatting ```mermaid flowchart TB subgraph "Event Handling" EVENT{Event Type} EVENT -->|ToolCallEvent| TOOL_PANEL[format_tool_panel] EVENT -->|GoDeeperEvent| NAV_PANEL[format_navigation_panel] EVENT -->|AskHumanEvent| HUMAN_PANEL[Human Input Panel] EVENT -->|ExplorationEndEvent| FINAL_PANEL[Final Answer Panel] end subgraph "Tool Panel Components" TOOL_PANEL --> ICON[Tool Icon 📂📖👁️🔍] TOOL_PANEL --> STEP[Step Number] TOOL_PANEL --> PHASE[Phase Label] TOOL_PANEL --> TARGET[Target File/Directory] TOOL_PANEL --> REASON[Agent's Reasoning] end subgraph "Final Panel Components" FINAL_PANEL --> ANSWER[Answer with Citations] FINAL_PANEL --> SOURCES[Sources Consulted] end subgraph "Summary Panel" SUMMARY[Workflow Summary] SUMMARY --> STEPS[Total Steps] SUMMARY --> CALLS[API Calls] SUMMARY --> DOCS[Documents Scanned/Parsed] SUMMARY --> TOKENS[Token Usage] SUMMARY --> COST[Estimated Cost] end FINAL_PANEL --> SUMMARY ``` ### Visual Elements ```mermaid graph TB subgraph "Panel Styles" TOOL["📂 Tool Call
border: yellow"] NAV["📁 Navigation
border: magenta"] HUMAN["❓ Human Input
border: red"] FINAL["✅ Final Answer
border: green"] SUMMARY["📊 Summary
border: blue"] end subgraph "Tool Icons" I1["📂 scan_folder"] I2["👁️ preview_file"] I3["📖 parse_file"] I4["📄 read"] I5["🔍 grep"] I6["🔎 glob"] end subgraph "Phase Labels" PH1["Phase 1: Parallel Document Scan"] PH2["Phase 2: Deep Dive"] PH3["Phase 1/2: Quick Preview"] end style TOOL fill:#ffeb3b,color:#000 style NAV fill:#e1bee7,color:#000 style HUMAN fill:#ffcdd2,color:#000 style FINAL fill:#c8e6c9,color:#000 style SUMMARY fill:#bbdefb,color:#000 ``` --- ## Data Flow ### Complete Request Flow ```mermaid sequenceDiagram participant User participant CLI as main.py participant WF as Workflow participant Agent as FsExplorerAgent participant LLM as Gemini API participant Tools as Tool Registry participant Docling participant Cache participant FS as Filesystem User->>CLI: uv run explore --task "..." CLI->>CLI: print_workflow_header() CLI->>WF: workflow.run(InputEvent) loop Until StopAction WF->>Agent: configure_task() Agent->>LLM: generate_content() LLM-->>Agent: Action JSON Agent->>Agent: Track tokens alt ToolCallAction Agent->>Tools: TOOLS[name](**args) alt Document Tool Tools->>Cache: Check cache alt Cache Hit Cache-->>Tools: Cached content else Cache Miss Cache->>Docling: Convert document Docling->>FS: Read file FS-->>Docling: Raw bytes Docling-->>Cache: Markdown content Cache-->>Tools: Content end else Filesystem Tool Tools->>FS: Execute operation FS-->>Tools: Result end Tools-->>Agent: Tool result Agent->>Agent: Track tool metrics WF-->>CLI: ToolCallEvent CLI->>CLI: format_tool_panel() else GoDeeperAction WF->>WF: Update directory state WF-->>CLI: GoDeeperEvent CLI->>CLI: format_navigation_panel() else AskHumanAction WF-->>CLI: AskHumanEvent CLI->>User: Display question User->>CLI: Enter response CLI->>WF: HumanAnswerEvent else StopAction WF-->>CLI: ExplorationEndEvent end end CLI->>CLI: Display final answer CLI->>CLI: print_workflow_summary() CLI-->>User: Complete output ``` --- ## File Structure ``` fs-explorer/ ├── src/ │ └── fs_explorer/ │ ├── __init__.py # Public API exports │ ├── main.py # CLI entry point (typer) │ ├── workflow.py # Event-driven workflow orchestration │ ├── agent.py # AI agent + Gemini integration │ ├── models.py # Pydantic action schemas │ └── fs.py # Filesystem + Docling operations ├── tests/ │ ├── conftest.py # Test fixtures and mocks │ ├── test_agent.py # Agent unit tests │ ├── test_fs.py # Filesystem function tests │ ├── test_models.py # Model tests │ ├── test_e2e.py # End-to-end integration tests │ └── testfiles/ # Test data ├── data/ │ ├── large_acquisition/ # Sample PDF documents │ └── test_acquisition/ # Test document set ├── scripts/ │ ├── generate_test_docs.py │ └── generate_large_docs.py ├── pyproject.toml # Project configuration ├── Makefile # Development commands ├── README.md # User documentation └── ARCHITECTURE.md # This file ``` --- ## Extension Points ### Adding New Tools ```mermaid flowchart LR subgraph "Step 1: Define Function" FUNC[def new_tool(args) -> str] end subgraph "Step 2: Register Tool" TOOLS["TOOLS dict in agent.py"] FUNC --> TOOLS end subgraph "Step 3: Update Types" TYPES["Tools TypeAlias in models.py"] TOOLS --> TYPES end subgraph "Step 4: Update Prompt" PROMPT["SYSTEM_PROMPT in agent.py"] TYPES --> PROMPT end style FUNC fill:#e3f2fd style TOOLS fill:#f3e5f5 style TYPES fill:#fff3e0 style PROMPT fill:#e8f5e9 ``` ### Adding New Document Formats ```mermaid flowchart LR subgraph "Docling Supported" PDF[PDF] --> DOCLING[Docling] DOCX[DOCX] --> DOCLING PPTX[PPTX] --> DOCLING XLSX[XLSX] --> DOCLING HTML[HTML] --> DOCLING MD[Markdown] --> DOCLING end subgraph "To Add New Format" NEW[New Format] --> CHECK{Docling
supports?} CHECK -->|Yes| ADD["Add to SUPPORTED_EXTENSIONS"] CHECK -->|No| CUSTOM["Create custom handler
in fs.py"] end DOCLING --> OUTPUT[Markdown] ADD --> OUTPUT CUSTOM --> OUTPUT ``` ### Customizing the System Prompt The system prompt in `agent.py` can be modified to: 1. **Add new exploration strategies** 2. **Change citation format** 3. **Adjust categorization criteria** 4. **Add domain-specific instructions** ```python SYSTEM_PROMPT = """ # Customize this prompt to change agent behavior ## Your custom instructions here ... """ ``` --- ## Performance Characteristics | Metric | Typical Value | Notes | |--------|---------------|-------| | Parallel scan threads | 4 | Configurable via `DEFAULT_MAX_WORKERS` | | Preview size | 1500 chars | ~1 page of content | | Full preview size | 3000 chars | ~2-3 pages | | Document cache | In-memory | Keyed by path + mtime | | Workflow timeout | 300 seconds | 5 minutes for complex queries | | API model | gemini-2.0-flash | Fast, cost-effective | --- ## Security Considerations 1. **API Key**: Stored in environment variable `GOOGLE_API_KEY` 2. **Local Processing**: Documents parsed locally via Docling (no cloud upload) 3. **Filesystem Access**: Limited to current working directory 4. **No Persistent Storage**: Document cache is in-memory only --- *Last updated: 2026-01-03* ================================================ FILE: CLAUDE.md ================================================ # CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Agentic File Search is an AI-powered document search agent that explores files dynamically rather than using pre-computed embeddings. It uses a three-phase strategy: parallel scan, deep dive, and backtracking for cross-references. There is also an optional DuckDB-backed indexing pipeline for pre-indexed semantic+metadata retrieval. **Tech Stack:** Python 3.10+, Google Gemini 3 Flash, LlamaIndex Workflows, Docling (document parsing), DuckDB (indexing), langextract (optional metadata extraction), FastAPI + WebSocket, Typer + Rich CLI. ## Common Commands ```bash # Install dependencies uv pip install . uv pip install -e ".[dev]" # with dev dependencies # Run CLI (agentic exploration) uv run explore --task "What is the purchase price?" --folder data/test_acquisition/ # Run CLI (indexed query - requires prior indexing) uv run explore index data/test_acquisition/ uv run explore query --task "What is the purchase price?" --folder data/test_acquisition/ # Schema management uv run explore schema discover data/test_acquisition/ uv run explore schema show data/test_acquisition/ # Run web UI uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000 # Run tests uv run pytest # all tests uv run pytest tests/test_fs.py # single file uv run pytest -k "test_name" # single test # Lint, format, typecheck (also available via Makefile) uv run pre-commit run -a # lint (or: make lint) uv run ruff check . # ruff only uv run ruff format # format (or: make format) uv run ty check src/fs_explorer/ # typecheck (or: make typecheck) ``` Entry points defined in `pyproject.toml`: `explore` → `fs_explorer.main:app`, `explore-ui` → `fs_explorer.server:run_server`. ## Architecture ### Core Flow (Agentic Mode) ``` User Query → Workflow (LlamaIndex) → Agent (Gemini) → Tools → Docling → Filesystem ``` ### Core Flow (Indexed Mode) ``` User Query → Workflow → Agent → semantic_search/get_document → DuckDB → Ranked Results ``` ### Key Modules (src/fs_explorer/) - **workflow.py**: Event-driven orchestration using `llama-index-workflows`. Defines `FsExplorerWorkflow` with steps: `start_exploration`, `go_deeper_action`, `tool_call_action`, `receive_human_answer`. Uses singleton agent via `get_agent()`. - **agent.py**: `FsExplorerAgent` manages Gemini API interaction. Chat history accumulates in `_chat_history`. `take_action()` sends history to LLM, receives structured JSON `Action`, auto-executes tool calls. `TokenUsage` tracks costs. Also contains the `TOOLS` registry (9 tools), `SYSTEM_PROMPT`, and indexed tool functions (`semantic_search`, `get_document`, `list_indexed_documents`). Index context is managed via module-level `set_index_context()`/`clear_index_context()`. - **models.py**: Pydantic schemas for structured LLM output. `Action` contains one of: `ToolCallAction`, `GoDeeperAction`, `StopAction`, `AskHumanAction`. `Tools` TypeAlias defines all available tool names. - **fs.py**: Filesystem operations. `scan_folder()` uses ThreadPoolExecutor for parallel document processing. `_DOCUMENT_CACHE` (dict) caches parsed documents keyed by `path:mtime`. Docling converts PDF/DOCX/PPTX/XLSX/HTML/MD to markdown. - **main.py**: Typer CLI entry point with subcommands: default (agentic explore), `index`, `query`, `schema discover`, `schema show`. - **server.py**: FastAPI server with WebSocket endpoint `/ws/explore` for real-time streaming. - **exploration_trace.py**: Records tool call paths and extracts cited sources from final answers for the CLI summary. ### Indexing Subsystem (src/fs_explorer/indexing/) - **pipeline.py**: `IndexingPipeline` orchestrates document parsing → chunking → metadata extraction → DuckDB upsert. Walks a folder for supported files, delegates to `SmartChunker` and `extract_metadata()`, handles schema resolution and deleted-file cleanup. - **chunker.py**: `SmartChunker` splits parsed document text into overlapping chunks. - **schema.py**: `SchemaDiscovery` auto-discovers metadata schemas from a corpus folder (file types, heuristic boolean fields like `mentions_currency`/`mentions_dates`). Optionally includes langextract fields. - **metadata.py**: `extract_metadata()` produces per-document metadata dicts. Heuristic fields (filename, extension, document_type, currency/date detection) are always available. Optional langextract integration calls the `langextract` library for entity extraction (organizations, people, deal terms, etc.) via configurable profiles. ### Search Subsystem (src/fs_explorer/search/) - **query.py**: `IndexedQueryEngine` runs parallel semantic (chunk text matching) + metadata (JSON filter) retrieval paths using ThreadPoolExecutor, then merges and ranks via `RankedDocument.combined_score`. - **filters.py**: `parse_metadata_filters()` parses a human-readable filter DSL (`field=value`, `field>=num`, `field in (a, b)`, `field~substring`) into `MetadataFilter` objects. Validates against allowed schema fields. - **ranker.py**: `RankedDocument` dataclass with `combined_score` (semantic * 100 + metadata * 10). `rank_documents()` sorts and limits. ### Storage Subsystem (src/fs_explorer/storage/) - **duckdb.py**: `DuckDBStorage` manages four tables: `corpora`, `documents`, `chunks`, `schemas`. Key operations: `upsert_document`, `search_chunks` (keyword-based scoring), `search_documents_by_metadata` (JSON path filtering via `json_extract_string`), schema CRUD. Corpus/doc/chunk IDs are SHA1-based stable hashes. - **base.py**: `StorageBackend` protocol and shared dataclasses (`DocumentRecord`, `ChunkRecord`, `SchemaRecord`). ### Index Config - **index_config.py**: `resolve_db_path()` resolves DuckDB path with precedence: CLI `--db-path` > `FS_EXPLORER_DB_PATH` env > `~/.fs_explorer/index.duckdb`. ### Workflow Event Types - `InputEvent` → starts exploration - `ToolCallEvent` → tool execution - `GoDeeperEvent` → directory navigation - `AskHumanEvent`/`HumanAnswerEvent` → human interaction - `ExplorationEndEvent` → completion with `final_result` or `error` ### Adding New Tools 1. Implement function in `fs.py` (filesystem) or `agent.py` (indexed) returning `str` 2. Add to `TOOLS` dict in `agent.py` 3. Add to `Tools` TypeAlias in `models.py` 4. Update `SYSTEM_PROMPT` in `agent.py` 5. Update `TOOL_ICONS` and `PHASE_DESCRIPTIONS` in `main.py` ## Environment - `GOOGLE_API_KEY` (required) — in `.env` file or environment variable - `FS_EXPLORER_DB_PATH` (optional) — override default DuckDB location - `FS_EXPLORER_LANGEXTRACT_MAX_CHARS` (optional) — max chars sent to langextract (default 6000) - `FS_EXPLORER_LANGEXTRACT_MODEL` (optional) — model for langextract (default `gemini-3-flash-preview`) ## Testing Tests mock the Gemini client via `MockGenAIClient` in `conftest.py`. Use `reset_agent()` to clear singleton state between tests. The mock always returns a `StopAction` response. Key test files: - `test_agent.py` / `test_e2e.py` — agent and workflow integration - `test_fs.py` — filesystem tools - `test_indexing.py` / `test_cli_indexing.py` — indexing pipeline and CLI - `test_search.py` — search/filter/ranking - `test_exploration_trace.py` — trace and citation extraction Test documents live in `data/test_acquisition/` and `data/large_acquisition/`. Test fixtures for unit tests are in `tests/testfiles/`. ================================================ FILE: IMPLEMENTATION_PLAN.md ================================================ # Implementation Plan: Hybrid Semantic + Agentic Search (Revised) ## Overview Add semantic search with optional metadata filtering to `agentic-file-search` without regressing the current agentic workflow. The revised approach keeps the current CLI and behavior stable first, introduces indexing as opt-in, and only enables auto-detection after compatibility and quality checks pass. - Storage: DuckDB + `vss` (embedded, local file) - Embeddings: Gemini embeddings (API-backed) - Metadata extraction: `langextract` (optional) - Infrastructure model: no external database service (no Docker/Postgres required) --- ## Goals 1. Preserve existing `explore --task` behavior and UX by default. 2. Add a fast indexed path for large corpora. 3. Support metadata-aware filtering when metadata is available. 4. Keep agentic deep-read and cross-reference behavior available. ## Non-Goals (Initial Release) 1. Replacing the existing agentic strategy entirely. 2. Forcing index usage for all queries. 3. Heuristic/NLP folder extraction from free-form task text. --- ## Current Codebase Constraints to Respect 1. CLI currently has one root command (`explore --task`) and no subcommands. 2. Workflow and server currently use shared/global process state (`os.chdir`, singleton agent). 3. Existing tests assert the current 6-tool model and prompt behavior. These constraints require a staged rollout to avoid breaking current users. --- ## High-Level Architecture ```text INDEX TIME ├── Parse documents (Docling) ├── Chunk content (paragraph/sentence-aware) ├── Generate embeddings (provider-configured dimension) ├── [optional] Extract metadata (langextract) └── Persist in DuckDB (corpus-scoped) QUERY TIME ├── Retrieve by semantic search ├── [optional] Retrieve by metadata filter ├── Union + rank results ├── Expand via cross-references where needed └── Agent continues deep exploration using existing tools ``` --- ## Data Model (DuckDB) Use corpus-scoped tables and file freshness fields to prevent collisions and stale indexes. ```sql -- Install and load extension programmatically -- INSTALL vss; LOAD vss; CREATE TABLE IF NOT EXISTS corpora ( id VARCHAR PRIMARY KEY, root_path VARCHAR NOT NULL UNIQUE, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE IF NOT EXISTS documents ( id VARCHAR PRIMARY KEY, corpus_id VARCHAR NOT NULL REFERENCES corpora(id), relative_path VARCHAR NOT NULL, absolute_path VARCHAR NOT NULL, content VARCHAR NOT NULL, metadata JSON NOT NULL DEFAULT '{}', file_mtime DOUBLE NOT NULL, file_size BIGINT NOT NULL, content_sha256 VARCHAR NOT NULL, last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, is_deleted BOOLEAN DEFAULT FALSE, UNIQUE(corpus_id, relative_path) ); -- EMBEDDING_DIM is configured in code at index creation time. CREATE TABLE IF NOT EXISTS chunks ( id VARCHAR PRIMARY KEY, doc_id VARCHAR NOT NULL REFERENCES documents(id), text VARCHAR NOT NULL, embedding FLOAT[${EMBEDDING_DIM}] NOT NULL, embedding_dim INTEGER NOT NULL, position INTEGER NOT NULL, start_char INTEGER NOT NULL, end_char INTEGER NOT NULL ); CREATE TABLE IF NOT EXISTS schemas ( id INTEGER PRIMARY KEY, corpus_id VARCHAR REFERENCES corpora(id), name VARCHAR, schema_def JSON NOT NULL, is_active BOOLEAN DEFAULT FALSE, UNIQUE(corpus_id, name) ); CREATE INDEX IF NOT EXISTS idx_chunks_embedding ON chunks USING HNSW (embedding) WITH (metric = 'cosine'); ``` ### Embedding Dimension Rule `EMBEDDING_DIM` must be a runtime config constant validated at startup. Do not hardcode `1536` across modules. ### DB Location Default: `~/.fs_explorer/index.duckdb` Override via: - `FS_EXPLORER_DB_PATH` - CLI: `--db-path` --- ## CLI Contract and Rollout ### Compatibility Rules (Required) 1. `uv run explore --task "..."` must keep working as-is. 2. Existing non-indexed behavior remains default in initial rollout. 3. New indexed behavior is opt-in first. ### New Commands ```bash # Index management uv run explore index uv run explore index --with-metadata uv run explore index --schema schema.json # Indexed query path uv run explore query --task "..." --folder [--filter "..."] # Schema inspection uv run explore schema --discover uv run explore schema --show --folder # Existing command (backward-compatible) uv run explore --task "..." [--folder ] [--use-index] ``` ### Folder Resolution (Deterministic) For commands that need corpus selection: 1. If `--folder` is provided, use it. 2. Else use current working directory (`.`). 3. Do not parse folder intent from natural language task text in v1. ### Auto-Detection Strategy - v1: explicit `--use-index` only. - v2: optional auto-detect behind feature flag `FS_EXPLORER_AUTO_INDEX=1`. - v3: default auto-detect only after parity tests and quality benchmarks pass. --- ## Server and Concurrency Requirements Before adding indexing/search endpoints: 1. Remove request-level `os.chdir` usage; pass absolute target folder through workflow state. 2. Avoid global singleton agent across concurrent requests; instantiate per workflow run/session. 3. Add per-corpus index lock to avoid concurrent write corruption. 4. Keep read queries concurrent-safe. --- ## Module Structure ```text src/fs_explorer/ ├── storage/ │ ├── __init__.py │ ├── base.py │ └── duckdb.py ├── indexing/ │ ├── __init__.py │ ├── pipeline.py │ ├── chunker.py │ ├── metadata.py │ └── schema.py ├── search/ │ ├── __init__.py │ ├── query.py │ ├── semantic.py │ ├── filters.py │ └── ranker.py ├── embeddings.py └── index_config.py ``` --- ## Files to Modify | File | Changes | |------|---------| | `src/fs_explorer/agent.py` | Add indexed tools and prompt guidance while keeping existing tools | | `src/fs_explorer/models.py` | Extend `Tools` type alias | | `src/fs_explorer/main.py` | Add subcommands + `--folder` + `--use-index` while preserving root command | | `src/fs_explorer/workflow.py` | Remove global/shared run-state assumptions | | `src/fs_explorer/fs.py` | Support safe path resolution without cwd mutation | | `src/fs_explorer/server.py` | Add index/search endpoints and remove `os.chdir` coupling | | `pyproject.toml` | Add `duckdb`, `langextract` | --- ## Implementation Phases ### Phase 0: Contracts and Safety (New) 1. Freeze CLI compatibility requirements (`explore --task` must remain stable). 2. Define deterministic folder resolution contract. 3. Define per-request state model for workflow/server. 4. Add failing tests for compatibility and concurrency assumptions. ### Phase 1: Storage + Embeddings 5. Implement `storage/base.py` (backend interface). 6. Implement `storage/duckdb.py` with corpus-scoped schema. 7. Implement `embeddings.py` with configurable embedding dimension. 8. Add storage/embedding tests (including dimension validation). ### Phase 2: Indexing Pipeline 9. Implement `indexing/chunker.py`. 10. Implement optional `indexing/metadata.py`. 11. Implement `indexing/schema.py`. 12. Implement `indexing/pipeline.py` with freshness checks (`mtime`, hash, deleted files). 13. Add indexing tests. ### Phase 3: Search Pipeline 14. Implement `search/filters.py`. 15. Implement `search/ranker.py`. 16. Implement `search/query.py` (parallel retrieval + union). 17. Implement cross-reference expansion hooks. 18. Add search tests. ### Phase 4: Agent Integration (Opt-in) 19. Add tools: `semantic_search`, `get_document`, `list_indexed_documents`. 20. Keep existing 6 filesystem tools available. 21. Add indexed prompt guidance without removing current strategy. 22. Add tool-selection tests for indexed and non-indexed paths. ### Phase 5: CLI + Server Integration 23. Add `explore index/query/schema` commands. 24. Add `--folder` and `--use-index` to root command. 25. Integrate indexed path into workflow when explicitly requested. 26. Add `/api/index` and `/api/search` endpoints. 27. Remove `os.chdir` in server workflow path. ### Phase 6: Auto-Detect Rollout (Guarded) 28. Add feature-flagged auto-detect (`FS_EXPLORER_AUTO_INDEX`). 29. Add parity checks between indexed and baseline runs on test corpora. 30. Keep fallback to legacy behavior on index errors. ### Phase 7: Testing and Docs 31. Full integration tests. 32. Backward compatibility tests. 33. Concurrency tests for WebSocket/API usage. 34. Performance benchmarks and docs updates. --- ## Revised Design Decisions 1. **Opt-in First**: indexed retrieval starts behind `--use-index` to avoid regressions. 2. **Deterministic Corpus Selection**: explicit `--folder` or `.` fallback only. 3. **Corpus-Scoped Storage**: avoid global path collisions by namespacing. 4. **Freshness Tracking**: incremental reindex using mtime/hash/deletion markers. 5. **No Global Request State**: remove `os.chdir` and shared singleton pitfalls in server flows. 6. **Configurable Embedding Dimension**: validated at runtime; not hardcoded everywhere. 7. **No External DB Service**: embedded local DB only; APIs are still external dependencies. --- ## Verification Steps ```bash # Baseline safety (must stay green) uv run pytest tests/test_models.py tests/test_fs.py tests/test_agent.py -v # Phase 1-3 uv run pytest tests/test_storage.py tests/test_embeddings.py tests/test_search.py -v # Index build + inspect uv run explore index data/test_acquisition/ uv run python -c "import duckdb; db=duckdb.connect('~/.fs_explorer/index.duckdb'); print(db.execute('SELECT COUNT(*) FROM documents').fetchone())" # Opt-in indexed execution uv run explore --task "Search for acquisition terms" --folder data/test_acquisition --use-index # Compatibility execution (legacy path) uv run explore --task "Look in data/test_acquisition/. Who is the CTO?" # CLI checks uv run explore --help uv run explore index --help uv run explore query --help uv run explore schema --help # Full suite uv run pytest tests/ -v ``` --- ## Dependencies to Add ```toml # pyproject.toml dependencies = [ # ... existing ... "duckdb>=1.0.0", "langextract>=1.0.0", ] ``` --- ## Critical Files Summary | Purpose | Path | |---------|------| | Storage interface | `src/fs_explorer/storage/base.py` | | DuckDB backend | `src/fs_explorer/storage/duckdb.py` | | Embeddings | `src/fs_explorer/embeddings.py` | | Chunking | `src/fs_explorer/indexing/chunker.py` | | Metadata extraction | `src/fs_explorer/indexing/metadata.py` | | Schema discovery | `src/fs_explorer/indexing/schema.py` | | Indexing pipeline | `src/fs_explorer/indexing/pipeline.py` | | Query pipeline | `src/fs_explorer/search/query.py` | | Filter parsing | `src/fs_explorer/search/filters.py` | | Result ranking | `src/fs_explorer/search/ranker.py` | | Agent tools/prompt | `src/fs_explorer/agent.py` | | Tool types | `src/fs_explorer/models.py` | | CLI commands | `src/fs_explorer/main.py` | | Workflow safety | `src/fs_explorer/workflow.py` | | Server safety/endpoints | `src/fs_explorer/server.py` | ================================================ FILE: Makefile ================================================ .PHONY: test lint format format-check typecheck build all: test lint format typecheck test: $(info ****************** running tests ******************) uv run pytest tests lint: $(info ****************** linting ******************) uv run pre-commit run -a format: $(info ****************** formatting ******************) uv run ruff format format-check: $(info ****************** checking formatting ******************) uv run ruff format --check typecheck: $(info ****************** type checking ******************) uv run ty check src/fs_explorer/ build: $(info ****************** building ******************) uv build ================================================ FILE: README.md ================================================ # Agentic File Search > **Based on**: [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer) — The original CLI agent for filesystem exploration. An AI-powered document search agent that explores files like a human would — scanning, reasoning, and following cross-references. Unlike traditional RAG systems that rely on pre-computed embeddings, this agent dynamically navigates documents to find answers. ## Why Agentic Search? Traditional RAG (Retrieval-Augmented Generation) has limitations: - **Chunks lose context** — Splitting documents destroys relationships between sections - **Cross-references are invisible** — "See Exhibit B" means nothing to embeddings - **Similarity ≠ Relevance** — Semantic matching misses logical connections This system uses a **three-phase strategy**: 1. **Parallel Scan** — Preview all documents in a folder at once 2. **Deep Dive** — Full extraction on relevant documents only 3. **Backtrack** — Follow cross-references to previously skipped documents ## Watch the video This video explains the architecture of the project and how to run it. [![Watch the demo on YouTube](https://img.youtube.com/vi/rMADSuus6jg/maxresdefault.jpg)](https://www.youtube.com/watch?v=rMADSuus6jg) ## Features - 🔍 **6 Tools**: `scan_folder`, `preview_file`, `parse_file`, `read`, `grep`, `glob` - 📄 **Document Support**: PDF, DOCX, PPTX, XLSX, HTML, Markdown (via Docling) - 🤖 **Powered by**: Google Gemini 3 Flash with structured JSON output - 💰 **Cost Efficient**: ~$0.001 per query with token tracking - 🌐 **Web UI**: Real-time WebSocket streaming interface - 📊 **Citations**: Answers include source references ## Installation ```bash # Clone the repository git clone https://github.com/PromtEngineer/agentic-file-search.git cd agentic-file-search # Install with uv (recommended) uv pip install . # Or with pip pip install . ``` ## Configuration Create a `.env` file in the project root: ```bash GOOGLE_API_KEY=your_api_key_here ``` Get your API key from [Google AI Studio](https://aistudio.google.com/apikey). ## Usage ### CLI ```bash # Basic query uv run explore --task "What is the purchase price in data/test_acquisition/?" # Multi-document query uv run explore --task "Look in data/large_acquisition/. What are all the financial terms including adjustments and escrow?" ``` ### Web UI ```bash # Start the server uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000 # Open http://127.0.0.1:8000 in your browser ``` The web UI provides: - Folder browser to select target directory - Real-time step-by-step execution log - Final answer with citations - Token usage and cost statistics ## Architecture ``` User Query ↓ ┌─────────────────┐ │ Workflow Engine │ ←→ LlamaIndex Workflows (event-driven) └────────┬────────┘ ↓ ┌─────────────────┐ │ Agent │ ←→ Gemini 3 Flash (structured JSON) └────────┬────────┘ ↓ ┌─────────────────────────────────────────┐ │ scan_folder │ preview │ parse │ read │ grep │ glob │ └─────────────────────────────────────────┘ ↓ Document Parser (Docling - local) ``` See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed diagrams. ## Test Documents The repo includes test document sets for evaluation: - `data/test_acquisition/` — 10 interconnected legal documents - `data/large_acquisition/` — 25 documents with extensive cross-references Example queries: ```bash # Simple (single doc) uv run explore --task "Look in data/test_acquisition/. Who is the CTO?" # Cross-reference required uv run explore --task "Look in data/test_acquisition/. What is the adjusted purchase price?" # Multi-document synthesis uv run explore --task "Look in data/large_acquisition/. What happens to employees after the acquisition?" ``` ## Tech Stack | Component | Technology | |-----------|------------| | LLM | Google Gemini 3 Flash | | Document Parsing | Docling (local, open-source) | | Orchestration | LlamaIndex Workflows | | CLI | Typer + Rich | | Web Server | FastAPI + WebSocket | | Package Manager | uv | ## Project Structure ``` src/fs_explorer/ ├── agent.py # Gemini client, token tracking ├── workflow.py # LlamaIndex workflow engine ├── fs.py # File tools: scan, parse, grep ├── models.py # Pydantic models for actions ├── main.py # CLI entry point ├── server.py # FastAPI + WebSocket server └── ui.html # Single-file web interface ``` ## Development ```bash # Install dev dependencies uv pip install -e ".[dev]" # Run tests uv run pytest # Lint uv run ruff check . ``` ## License MIT ## Acknowledgments - Original concept from [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer) - Document parsing by [Docling](https://github.com/DS4SD/docling) - Powered by [Google Gemini](https://deepmind.google/technologies/gemini/) ## Star History [![Star History Chart](https://api.star-history.com/svg?repos=PromtEngineer/agentic-file-search&type=Date)](https://star-history.com/#PromtEngineer/agentic-file-search&Date) ================================================ FILE: YOUTUBE_DEMO_TESTS.md ================================================ # YouTube Demo: FS-Explorer Test Results ## System Overview - **25 PDF documents** (~93 pages total) - **63 cross-references** between documents - **Parallel document scanning** using ThreadPoolExecutor - **Three-phase exploration**: Scan → Filter → Deep Dive + Backtracking --- ## Test Results Summary | Test | Question | Docs Parsed | Tokens Used | Cost | Time | |------|----------|-------------|-------------|------|------| | 1 | CTO & retention bonus | 1 | 27,376 | $0.0021 | ~30s | | 2 | Litigation & insurance | 2 | 41,431 | $0.0033 | ~30s | | 3 | Patents owned | 1 | 27,416 | $0.0022 | ~27s | --- ## TEST 1: Simple Fact Lookup ✅ **Question:** ```bash uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?" ``` **Agent Answer:** > Dr. Sarah Chen is the CTO and their retention bonus is $1,200,000. **Verification (from 07_employee_schedule.pdf):** ``` 1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000 ``` | Metric | Value | |--------|-------| | Documents Scanned | 26 (parallel) | | Documents Parsed | 1 | | Tokens Used | 27,376 | | Cost | $0.0021 | | ✅ Accuracy | **100% Correct** | --- ## TEST 2: Cross-Reference Following ✅ **Question:** ```bash uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?" ``` **Agent Answer:** > The total litigation exposure is between $650,000 and $2,350,000. The Smith v. InnovateTech Solutions, Inc. case, with an exposure of $150,000 - $350,000, is covered by insurance. The DataTech LLC v. InnovateTech Solutions, Inc. case, with an exposure of $500,000 - $2,000,000, does not specify if it is covered by insurance. **Verification (from 12_litigation_schedule.pdf):** ``` 1. Smith v. InnovateTech Solutions, Inc. Exposure: $150,000 - $350,000 (covered by insurance) 2. DataTech LLC v. InnovateTech Solutions, Inc. Exposure: Preliminary assessment $500,000 - $2,000,000 ``` **Cross-Reference Followed:** - Agent noticed reference to "Legal Opinion Letter" - Parsed `24_legal_opinion.pdf` to get complete picture | Metric | Value | |--------|-------| | Documents Scanned | 26 (parallel) | | Documents Parsed | 2 (litigation + legal opinion) | | Tokens Used | 41,431 | | Cost | $0.0033 | | ✅ Accuracy | **100% Correct** | --- ## TEST 3: Multi-Document IP Analysis ✅ **Question:** ```bash uv run explore --task "Look in data/large_acquisition/. What patents does the company own and are there any pending applications?" ``` **Agent Answer:** > The company owns the following patents: > - US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021 > - US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022 > - US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023 > > There is one pending application: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024. **Verification (from 06_ip_schedule.pdf):** ``` US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021 US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022 US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023 Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024 ``` | Metric | Value | |--------|-------| | Documents Scanned | 26 (parallel) | | Documents Parsed | 1 | | Tokens Used | 27,416 | | Cost | $0.0022 | | ✅ Accuracy | **100% Correct** | --- ## Additional Demo Tests ### Purchase Price & Payment Structure ```bash uv run explore --task "Look in data/large_acquisition/. What is the total purchase price and how is it being paid?" ``` **Expected:** $125M total ($80M cash + $30M stock + $15M escrow) ### Closing Conditions Status ```bash uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?" ``` **Expected:** HSR ✅, State filings ✅, MegaCorp consent ✅, GlobalBank pending, Employee retention ✅, Legal opinion ✅, Good standing ordered ### Key Employee Compensation ```bash uv run explore --task "Look in data/large_acquisition/. List all the key employees and their retention bonuses" ``` **Expected:** 5 employees totaling $3.5M in retention bonuses --- ## Key Architecture Points to Highlight ### 1. Parallel Scanning (scan_folder) - Scans ALL 26 documents simultaneously using ThreadPoolExecutor - Takes ~25 seconds for entire folder - Returns quick preview of each document ### 2. Smart Filtering - LLM reviews all previews at once - Identifies which documents are relevant - Avoids parsing irrelevant documents ### 3. Cross-Reference Discovery - Agent watches for document references like: - "See Document: Legal Opinion Letter" - "Per Document: Risk Assessment Memo" - Automatically follows references (backtracking) ### 4. Document Caching - Documents cached after first parse - Backtracking is free (no re-parsing) --- ## Cost Analysis | Scenario | Tokens | Est. Cost | |----------|--------|-----------| | Simple query (1 doc) | ~27K | $0.002 | | Cross-ref query (2-3 docs) | ~40K | $0.003 | | Complex synthesis (5+ docs) | ~60K | $0.005 | | All 25 documents parsed | ~150K | $0.012 | **Key Insight:** Even with 25 documents, costs are minimal because the system only parses what's needed! --- ## Commands to Run Demo ```bash # Setup cd /path/to/fs-explorer export GOOGLE_API_KEY="your-key" # Run any test uv run explore --task "Look in data/large_acquisition/. [YOUR QUESTION]" ``` --- ## What to Show in Video 1. **The folder scan** - Watch as 26 documents are scanned in parallel 2. **Smart filtering** - Note which documents the agent CHOOSES to parse 3. **Cross-reference following** - Show agent backtracking to referenced docs 4. **Token usage summary** - Highlight the efficiency stats at the end 5. **Verification** - Show the actual PDF content matches the answer ================================================ FILE: data/large_acquisition/TEST_QUESTIONS.md ================================================ # Test Questions for Large Document Set ## Document Overview - 25 interconnected documents - Each document 3-6 pages - Extensive cross-references between documents - Total content: ~100+ pages ## Test Questions ### Level 1: Single Document (Easy) ```bash uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?" uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?" uv run explore --task "Look in data/large_acquisition/. What patents does the company own?" ``` ### Level 2: Cross-Reference Required (Medium) ```bash uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?" uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?" uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?" ``` ### Level 3: Multi-Document Synthesis (Hard) ```bash uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?" uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks." uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?" ``` ### Level 4: Deep Cross-Reference (Expert) ```bash uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?" uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation." uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents." ``` ================================================ FILE: data/test_acquisition/TEST_QUESTIONS.md ================================================ # Test Questions for Document Exploration These questions are designed to test the two-stage document exploration approach with cross-reference discovery. ## Test Scenario **Context:** TechCorp Industries is acquiring StartupXYZ LLC. There are 10 documents in this folder related to the acquisition. --- ## Question Set 1: Simple (Single Document) These questions can be answered from a single document: ```bash # Q1: What is the purchase price? explore --task "What is the total purchase price for the StartupXYZ acquisition?" # Q2: When did the NDA get signed? explore --task "When was the Non-Disclosure Agreement between TechCorp and StartupXYZ signed?" # Q3: How many patents does StartupXYZ have? explore --task "How many patents does StartupXYZ own?" ``` **Expected Behavior:** - Agent should preview documents - Identify the relevant document quickly - Parse only that document for the answer --- ## Question Set 2: Medium (2-3 Documents with Cross-References) These questions require following cross-references: ```bash # Q4: What risks were identified and how were they addressed? explore --task "What are the key risks identified in this acquisition and what mitigation measures were put in place?" # Q5: What's the adjusted purchase price? explore --task "The original purchase price was $45M. Were there any adjustments? What is the final amount?" # Q6: What happened with customer consents? explore --task "Which customers required consent for the acquisition, and was consent obtained from all of them?" ``` **Expected Behavior:** - Agent previews documents - Reads Risk Assessment Memo - Notices references to Financial Adjustments, Customer Consents - Follows cross-references to get complete picture --- ## Question Set 3: Complex (Multiple Documents, Deep Cross-References) These questions require synthesizing information from many documents: ```bash # Q7: Complete IP status explore --task "Give me a complete picture of StartupXYZ's intellectual property - what do they own, is it properly certified, and are there any pending matters or risks?" # Q8: Due diligence findings and resolution explore --task "What did the due diligence process uncover, and how were any issues resolved before closing?" # Q9: Full timeline and status explore --task "Create a timeline of this acquisition from NDA signing to closing. What are the key milestones and their status?" # Q10: Closing readiness explore --task "Is this acquisition ready to close? What items are complete and what's still pending?" ``` **Expected Behavior:** - Agent should preview all documents first - Read the most relevant documents (e.g., Closing Checklist references everything) - Follow cross-references to IP Certification, Due Diligence, Risk Assessment, etc. - Synthesize information from 5+ documents --- ## Question Set 4: Adversarial (Tests Cross-Reference Discovery) These questions specifically test if the agent goes back to previously-skipped documents: ```bash # Q11: Following exhibit references explore --task "The Acquisition Agreement mentions 'Exhibit A - Financial Terms'. What are the detailed financial terms?" # Q12: Understanding document relationships explore --task "How does the Legal Opinion Letter relate to other documents in this acquisition?" # Q13: Hidden connection explore --task "Is there anything about MegaCorp in these documents? Why are they important to this deal?" ``` **Expected Behavior:** - Q11: Agent might initially skip Financial Adjustments, but should go back when Acquisition Agreement references Exhibit A - Q12: Agent should trace all documents referenced BY and FROM the Legal Opinion - Q13: MegaCorp is mentioned in Due Diligence, Risk Assessment, and Customer Consents - agent should connect the dots --- ## Scoring Rubric | Metric | Description | |--------|-------------| | **Preview Usage** | Did the agent use `preview_file` before `parse_file`? | | **Selective Parsing** | Did the agent avoid parsing irrelevant documents? | | **Cross-Reference Discovery** | Did the agent follow document references? | | **Backtracking** | Did the agent return to previously-skipped documents when needed? | | **Answer Completeness** | Was the final answer comprehensive and accurate? | --- ## Running a Test ```bash export GOOGLE_API_KEY="your-key" cd /path/to/fs-explorer uv run explore --task "YOUR QUESTION HERE" ``` Watch for: 1. Which documents get previewed 2. Which documents get fully parsed 3. Whether the agent mentions cross-references 4. Whether the agent goes back to read referenced documents ================================================ FILE: data/testfile.txt ================================================ This is a test. ================================================ FILE: docker/docker-compose.yml ================================================ version: '3.8' services: postgres: image: pgvector/pgvector:pg17 container_name: fs-explorer-db environment: POSTGRES_USER: ${POSTGRES_USER:-fs_explorer} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-devpassword} POSTGRES_DB: ${POSTGRES_DB:-fs_explorer} ports: - "${POSTGRES_PORT:-5432}:5432" volumes: - postgres_data:/var/lib/postgresql/data - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro healthcheck: test: ["CMD-SHELL", "pg_isready -U fs_explorer -d fs_explorer"] interval: 5s timeout: 5s retries: 5 restart: unless-stopped volumes: postgres_data: ================================================ FILE: pyproject.toml ================================================ [build-system] requires = ["uv_build>=0.9.10,<0.10.0"] build-backend = "uv_build" [project] name = "fs-explorer" version = "0.1.0" description = "Explore and understand your filesystem better with AI." readme = "README.md" requires-python = ">=3.10" dependencies = [ "docling>=2.55.0", "duckdb>=1.0.0", "fastapi>=0.115.0", "google-genai>=1.55.0", "langextract>=1.0.0", "llama-index-workflows>=2.11.5", "python-dotenv>=1.0.0", "reportlab>=4.4.7", "rich>=13.0.0", "typer>=0.12.5,<0.20.0", "uvicorn>=0.34.0", "websockets>=14.0", ] [dependency-groups] dev = [ "pre-commit>=4.5.0", "pytest>=9.0.2", "pytest-asyncio>=1.3.0", "ruff>=0.14.9", "ty>=0.0.1a33", ] [project.scripts] explore = "fs_explorer.main:app" explore-ui = "fs_explorer.server:run_server" ================================================ FILE: scripts/generate_large_docs.py ================================================ #!/usr/bin/env python3 """ Generate a large set of interconnected legal documents for testing. Creates 25 documents, each 3-5 pages, with extensive cross-references. """ import os from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle from reportlab.lib.units import inch OUTPUT_DIR = "data/large_acquisition" # Document metadata with cross-references DOCUMENTS = { "01_master_agreement": { "title": "MASTER ACQUISITION AGREEMENT", "refs": ["02_schedules", "03_exhibits", "04_disclosure_schedules", "05_ancillary_agreements"], "pages": 5 }, "02_schedules": { "title": "SCHEDULES TO ACQUISITION AGREEMENT", "refs": ["01_master_agreement", "06_ip_schedule", "07_employee_schedule", "08_contract_schedule"], "pages": 4 }, "03_exhibits": { "title": "EXHIBITS TO ACQUISITION AGREEMENT", "refs": ["01_master_agreement", "09_escrow_agreement", "10_stock_purchase"], "pages": 3 }, "04_disclosure_schedules": { "title": "SELLER DISCLOSURE SCHEDULES", "refs": ["01_master_agreement", "11_financial_statements", "12_litigation_schedule"], "pages": 5 }, "05_ancillary_agreements": { "title": "ANCILLARY AGREEMENTS INDEX", "refs": ["13_nda", "14_non_compete", "15_consulting_agreement", "16_transition_services"], "pages": 2 }, "06_ip_schedule": { "title": "SCHEDULE 3.12 - INTELLECTUAL PROPERTY", "refs": ["01_master_agreement", "17_patent_assignments", "18_trademark_registrations"], "pages": 4 }, "07_employee_schedule": { "title": "SCHEDULE 3.15 - EMPLOYEE MATTERS", "refs": ["01_master_agreement", "19_retention_agreements", "20_benefit_plans"], "pages": 4 }, "08_contract_schedule": { "title": "SCHEDULE 3.13 - MATERIAL CONTRACTS", "refs": ["01_master_agreement", "21_customer_contracts", "22_vendor_contracts"], "pages": 5 }, "09_escrow_agreement": { "title": "ESCROW AGREEMENT", "refs": ["01_master_agreement", "03_exhibits", "11_financial_statements"], "pages": 4 }, "10_stock_purchase": { "title": "STOCK PURCHASE DETAILS - EXHIBIT B", "refs": ["01_master_agreement", "11_financial_statements"], "pages": 3 }, "11_financial_statements": { "title": "AUDITED FINANCIAL STATEMENTS", "refs": ["04_disclosure_schedules", "23_audit_report"], "pages": 6 }, "12_litigation_schedule": { "title": "SCHEDULE 3.9 - LITIGATION AND CLAIMS", "refs": ["04_disclosure_schedules", "24_legal_opinion"], "pages": 3 }, "13_nda": { "title": "NON-DISCLOSURE AGREEMENT", "refs": ["01_master_agreement"], "pages": 3 }, "14_non_compete": { "title": "NON-COMPETITION AGREEMENT", "refs": ["01_master_agreement", "07_employee_schedule"], "pages": 3 }, "15_consulting_agreement": { "title": "CONSULTING AGREEMENT - FOUNDER", "refs": ["01_master_agreement", "07_employee_schedule", "19_retention_agreements"], "pages": 4 }, "16_transition_services": { "title": "TRANSITION SERVICES AGREEMENT", "refs": ["01_master_agreement", "25_closing_checklist"], "pages": 4 }, "17_patent_assignments": { "title": "PATENT ASSIGNMENT AGREEMENTS", "refs": ["06_ip_schedule", "01_master_agreement"], "pages": 3 }, "18_trademark_registrations": { "title": "TRADEMARK REGISTRATION SCHEDULE", "refs": ["06_ip_schedule"], "pages": 2 }, "19_retention_agreements": { "title": "KEY EMPLOYEE RETENTION AGREEMENTS", "refs": ["07_employee_schedule", "15_consulting_agreement"], "pages": 4 }, "20_benefit_plans": { "title": "EMPLOYEE BENEFIT PLAN SCHEDULE", "refs": ["07_employee_schedule"], "pages": 3 }, "21_customer_contracts": { "title": "MAJOR CUSTOMER CONTRACT SUMMARIES", "refs": ["08_contract_schedule", "01_master_agreement"], "pages": 5 }, "22_vendor_contracts": { "title": "MAJOR VENDOR CONTRACT SUMMARIES", "refs": ["08_contract_schedule"], "pages": 3 }, "23_audit_report": { "title": "INDEPENDENT AUDITOR'S REPORT", "refs": ["11_financial_statements", "04_disclosure_schedules"], "pages": 4 }, "24_legal_opinion": { "title": "LEGAL OPINION LETTER", "refs": ["01_master_agreement", "12_litigation_schedule", "06_ip_schedule"], "pages": 3 }, "25_closing_checklist": { "title": "CLOSING CHECKLIST AND CONDITIONS", "refs": ["01_master_agreement", "09_escrow_agreement", "16_transition_services", "17_patent_assignments", "21_customer_contracts"], "pages": 4 } } def generate_content(doc_id: str, meta: dict) -> list: """Generate realistic legal document content.""" styles = getSampleStyleSheet() title_style = ParagraphStyle('Title', parent=styles['Heading1'], fontSize=16, spaceAfter=20) heading_style = ParagraphStyle('Heading', parent=styles['Heading2'], fontSize=12, spaceAfter=10) body_style = ParagraphStyle('Body', parent=styles['Normal'], fontSize=10, spaceAfter=8, leading=14) content = [] # Title content.append(Paragraph(meta["title"], title_style)) content.append(Spacer(1, 0.3*inch)) # Document intro with cross-references refs_text = ", ".join([f"Document: {DOCUMENTS[r]['title']}" for r in meta["refs"][:3]]) intro = f""" This document is part of the acquisition transaction between GlobalTech Corporation ("Buyer") and InnovateTech Solutions, Inc. ("Seller") dated as of February 15, 2025. This document should be read in conjunction with {refs_text}, and all other transaction documents. """ content.append(Paragraph(intro.strip(), body_style)) content.append(Spacer(1, 0.2*inch)) # Generate sections based on document type sections = generate_sections(doc_id, meta) for section_title, section_content in sections: content.append(Paragraph(section_title, heading_style)) for para in section_content: content.append(Paragraph(para, body_style)) content.append(Spacer(1, 0.15*inch)) return content def generate_sections(doc_id: str, meta: dict) -> list: """Generate document-specific sections with legal content.""" sections = [] # Add document-specific content if "master_agreement" in doc_id: sections = [ ("ARTICLE I - DEFINITIONS", [ "1.1 'Acquisition' means the purchase by Buyer of all outstanding capital stock of Seller.", "1.2 'Purchase Price' means One Hundred Twenty-Five Million Dollars ($125,000,000), subject to adjustments.", "1.3 'Closing Date' means April 1, 2025, or such other date as mutually agreed.", "1.4 'Material Adverse Effect' means any change that is materially adverse to the business of Seller.", "1.5 'Knowledge of Seller' means the actual knowledge of the officers listed in Schedule 1.5.", ]), ("ARTICLE II - PURCHASE AND SALE", [ "2.1 Subject to the terms hereof, Seller agrees to sell and Buyer agrees to purchase all Shares.", "2.2 The Purchase Price shall be paid as follows: (a) $80,000,000 in cash at Closing; " "(b) $30,000,000 in Buyer common stock per Document: Stock Purchase Details - Exhibit B; " "(c) $15,000,000 in escrow per Document: Escrow Agreement.", "2.3 Purchase Price adjustments are detailed in Document: Audited Financial Statements.", "2.4 Working capital target is $8,500,000 as calculated per Schedule 2.4.", ]), ("ARTICLE III - REPRESENTATIONS AND WARRANTIES", [ "3.1 Organization. Seller is duly organized under Delaware law.", "3.9 Litigation. Except as set forth in Document: Schedule 3.9 - Litigation and Claims, " "there are no pending legal proceedings against Seller.", "3.12 Intellectual Property. All IP is listed in Document: Schedule 3.12 - Intellectual Property. " "Patent assignments are documented in Document: Patent Assignment Agreements.", "3.13 Material Contracts. All contracts exceeding $100,000 annually are in Document: Schedule 3.13 - Material Contracts.", "3.15 Employees. Employee matters are disclosed in Document: Schedule 3.15 - Employee Matters.", ]), ("ARTICLE IV - COVENANTS", [ "4.1 Conduct of Business. Prior to Closing, Seller shall operate in ordinary course.", "4.2 Access. Seller shall provide Buyer access to facilities, books, and records.", "4.3 Confidentiality. Parties shall comply with Document: Non-Disclosure Agreement.", "4.4 Non-Competition. Key employees shall execute Document: Non-Competition Agreement.", ]), ("ARTICLE V - CONDITIONS TO CLOSING", [ "5.1 Buyer's conditions: (a) accuracy of representations; (b) material consents obtained; " "(c) no Material Adverse Effect; (d) receipt of Document: Legal Opinion Letter.", "5.2 Regulatory approvals as specified in Document: Closing Checklist and Conditions.", "5.3 Third-party consents from customers in Document: Major Customer Contract Summaries.", ]), ] elif "financial" in doc_id: sections = [ ("BALANCE SHEET", [ "As of December 31, 2024:", "Total Assets: $47,250,000 (Current: $18,500,000; Non-current: $28,750,000)", "Total Liabilities: $12,300,000 (Current: $8,200,000; Long-term: $4,100,000)", "Stockholders' Equity: $34,950,000", "Working Capital: $10,300,000 (above target of $8,500,000 per Document: Master Acquisition Agreement)", ]), ("INCOME STATEMENT", [ "For fiscal year ended December 31, 2024:", "Total Revenue: $52,400,000 (SaaS: $41,920,000; Professional Services: $10,480,000)", "Cost of Revenue: $15,720,000 (Gross Margin: 70%)", "Operating Expenses: $28,600,000 (R&D: $12,100,000; S&M: $11,500,000; G&A: $5,000,000)", "Operating Income: $8,080,000 (EBITDA: $11,200,000)", "Net Income: $6,464,000", ]), ("REVENUE BREAKDOWN BY CUSTOMER", [ "Top 5 customers represent 62% of revenue (see Document: Major Customer Contract Summaries):", "1. MegaCorp Industries: $12,576,000 (24%) - Contract through 2027", "2. GlobalBank Holdings: $8,384,000 (16%) - Renewal pending", "3. HealthFirst Systems: $5,240,000 (10%) - Multi-year agreement", "4. RetailMax Inc.: $3,668,000 (7%) - Expansion discussion ongoing", "5. TechPrime Solutions: $2,620,000 (5%) - New customer 2024", ]), ("NOTES TO FINANCIAL STATEMENTS", [ "Note 1: Significant Accounting Policies - Revenue recognized per ASC 606.", "Note 2: Deferred Revenue of $4,200,000 represents prepaid annual subscriptions.", "Note 3: Contingent liabilities detailed in Document: Schedule 3.9 - Litigation and Claims.", "Note 4: Related party transactions with founder disclosed in Document: Consulting Agreement - Founder.", ]), ] elif "ip_schedule" in doc_id or "patent" in doc_id: sections = [ ("PATENTS", [ "Seller owns or has rights to the following patents:", "US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021", "US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022", "US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023", "Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024", "Assignment agreements in Document: Patent Assignment Agreements.", ]), ("TRADEMARKS", [ "Registered trademarks (see Document: Trademark Registration Schedule):", "INNOVATETECH (word mark) - Reg. No. 5,123,456 - Software services", "INNOVATETECH (logo) - Reg. No. 5,234,567 - Software services", "DATAFLOW PRO - Reg. No. 5,345,678 - Data analytics software", ]), ("TRADE SECRETS AND KNOW-HOW", [ "Seller maintains trade secrets including proprietary algorithms and processes.", "All employees have executed invention assignment agreements per Document: Schedule 3.15 - Employee Matters.", "Key technical personnel retention addressed in Document: Key Employee Retention Agreements.", ]), ] elif "employee" in doc_id or "retention" in doc_id: sections = [ ("EMPLOYEE CENSUS", [ "Total Employees: 127 (Full-time: 120; Part-time: 7)", "Engineering: 68 employees (Senior: 24; Mid-level: 32; Junior: 12)", "Sales & Marketing: 28 employees", "Customer Success: 18 employees", "G&A: 13 employees", ]), ("KEY EMPLOYEES", [ "The following are Key Employees subject to Document: Key Employee Retention Agreements:", "1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000", "2. Michael Rodriguez - VP Engineering - Leads 45-person team - Retention: $800,000", "3. Jennifer Walsh - VP Sales - $18M quota achievement - Retention: $600,000", "4. David Kim - Principal Architect - Core platform expertise - Retention: $500,000", "5. Amanda Foster - VP Customer Success - 95% retention rate - Retention: $400,000", "Founder consulting terms in Document: Consulting Agreement - Founder.", ]), ("BENEFIT PLANS", [ "Active benefit plans (details in Document: Employee Benefit Plan Schedule):", "401(k) Plan - Company match 4% - $2.1M annual cost", "Health Insurance - PPO and HMO options - $1.8M annual cost", "Stock Option Plan - 2,500,000 shares reserved - 1,800,000 granted", "Treatment of equity awards addressed in Document: Master Acquisition Agreement Section 2.6.", ]), ] elif "customer" in doc_id or "contract_schedule" in doc_id: sections = [ ("MATERIAL CUSTOMER CONTRACTS", [ "Contracts with annual value exceeding $500,000:", "", "1. MEGACORP INDUSTRIES - Master Services Agreement", " Annual Value: $12,576,000 | Term: Through December 2027", " Change of Control: Consent required (OBTAINED February 8, 2025)", " Renewal Terms: Auto-renew with 90-day notice", "", "2. GLOBALBANK HOLDINGS - Enterprise License Agreement", " Annual Value: $8,384,000 | Term: Through June 2025", " Change of Control: 60-day notice required (PROVIDED January 15, 2025)", " Renewal: Currently in negotiation for 3-year extension", "", "3. HEALTHFIRST SYSTEMS - SaaS Subscription Agreement", " Annual Value: $5,240,000 | Term: Through December 2026", " Change of Control: No restrictions", "", "See Document: Closing Checklist and Conditions for consent status.", ]), ("CONSENT REQUIREMENTS", [ "Customer consents required for acquisition (per Document: Master Acquisition Agreement):", "- MegaCorp Industries: OBTAINED (see Exhibit A hereto)", "- GlobalBank Holdings: NOTICE PROVIDED (awaiting acknowledgment)", "- Other customers: No consent required", "Risk assessment in Document: Legal Opinion Letter.", ]), ] elif "litigation" in doc_id: sections = [ ("PENDING LITIGATION", [ "1. Smith v. InnovateTech Solutions, Inc.", " Court: California Superior Court, Santa Clara County", " Claims: Wrongful termination, discrimination", " Status: Discovery phase; trial set for September 2025", " Exposure: $150,000 - $350,000 (covered by insurance)", " Opinion: See Document: Legal Opinion Letter", "", "2. DataTech LLC v. InnovateTech Solutions, Inc.", " Court: US District Court, Northern District of California", " Claims: Patent infringement (US Patent 9,876,543)", " Status: Motion to dismiss pending; hearing March 2025", " Exposure: Preliminary assessment $500,000 - $2,000,000", " IP validity analysis in Document: Schedule 3.12 - Intellectual Property", ]), ("THREATENED CLAIMS", [ "Demand letter received from former contractor re: unpaid invoices ($45,000).", "Resolution expected prior to Closing per Document: Closing Checklist and Conditions.", ]), ("INSURANCE COVERAGE", [ "D&O Insurance: $5,000,000 limit | Deductible: $50,000", "E&O Insurance: $3,000,000 limit | Deductible: $25,000", "General Liability: $2,000,000 limit", ]), ] elif "closing" in doc_id: sections = [ ("PRE-CLOSING CONDITIONS", [ "The following conditions must be satisfied prior to Closing:", "", "1. REGULATORY APPROVALS", " [X] HSR Filing - Early termination granted February 1, 2025", " [X] State filings - Completed in all required jurisdictions", "", "2. THIRD-PARTY CONSENTS", " [X] MegaCorp Industries - Obtained February 8, 2025", " [ ] GlobalBank Holdings - Pending (expected by March 15)", " Per Document: Major Customer Contract Summaries", "", "3. EMPLOYEE MATTERS", " [X] Key employee retention agreements executed", " [X] Founder consulting agreement finalized", " Per Document: Key Employee Retention Agreements", "", "4. LEGAL DELIVERABLES", " [X] Legal opinion - See Document: Legal Opinion Letter", " [ ] Good standing certificates - Ordered", ]), ("CLOSING DELIVERABLES", [ "SELLER DELIVERABLES:", "- Stock certificates endorsed in blank", "- Officer's certificate re: representations", "- Secretary's certificate with resolutions", "- IP assignments per Document: Patent Assignment Agreements", "- Third-party consents per above", "", "BUYER DELIVERABLES:", "- Cash payment: $80,000,000 by wire transfer", "- Stock consideration: 1,500,000 shares per Document: Stock Purchase Details - Exhibit B", "- Escrow deposit: $15,000,000 per Document: Escrow Agreement", ]), ("POST-CLOSING OBLIGATIONS", [ "1. Transition services per Document: Transition Services Agreement (6 months)", "2. Earnout payments per Exhibit C to Document: Master Acquisition Agreement", "3. Escrow release schedule per Document: Escrow Agreement", "4. Employee benefit plan merger per Document: Employee Benefit Plan Schedule", ]), ] elif "escrow" in doc_id: sections = [ ("ESCROW TERMS", [ "Escrow Amount: $15,000,000 (12% of Purchase Price)", "Escrow Agent: First National Trust Company", "Term: 18 months from Closing Date", "", "Release Schedule:", "- 6 months: $5,000,000 released (absent claims)", "- 12 months: $5,000,000 released (absent claims)", "- 18 months: Remaining balance released", "", "Claims may be made for breaches of representations in Document: Master Acquisition Agreement.", ]), ("INDEMNIFICATION", [ "Indemnification provisions per Article VII of Document: Master Acquisition Agreement:", "- Basket: $500,000 (1% of escrow)", "- Cap: $15,000,000 (escrow amount) for general reps", "- Fundamental reps: Full Purchase Price cap", "", "Specific indemnities for matters in Document: Schedule 3.9 - Litigation and Claims.", ]), ] elif "legal_opinion" in doc_id: sections = [ ("OPINIONS RENDERED", [ "Wilson & Associates LLP, counsel to Seller, renders the following opinions:", "", "1. Seller is a corporation duly organized under Delaware law.", "2. Seller has corporate power to execute Document: Master Acquisition Agreement.", "3. Transaction documents are valid and enforceable obligations.", "4. No conflicts with charter documents or material agreements.", "5. Based on review of Document: Schedule 3.9 - Litigation and Claims, pending " "litigation does not pose material risk to transaction.", "6. IP matters reviewed per Document: Schedule 3.12 - Intellectual Property; " "no infringement claims other than disclosed.", ]), ("QUALIFICATIONS AND ASSUMPTIONS", [ "This opinion is subject to standard qualifications regarding:", "- Bankruptcy and insolvency laws", "- Equitable principles", "- Public policy considerations", "", "We have relied upon certificates from officers of Seller and representations " "in Document: Seller Disclosure Schedules.", ]), ] elif "audit" in doc_id: sections = [ ("INDEPENDENT AUDITOR'S REPORT", [ "To the Board of Directors of InnovateTech Solutions, Inc.:", "", "We have audited the accompanying financial statements, which comprise the " "balance sheet as of December 31, 2024, and the related statements of income, " "comprehensive income, stockholders' equity, and cash flows for the year then ended.", "", "OPINION", "In our opinion, the financial statements present fairly, in all material respects, " "the financial position of InnovateTech Solutions, Inc. as of December 31, 2024, " "in accordance with accounting principles generally accepted in the United States.", ]), ("KEY AUDIT MATTERS", [ "1. REVENUE RECOGNITION", " SaaS revenue recognized ratably over subscription period per ASC 606.", " Deferred revenue of $4,200,000 verified to customer contracts.", "", "2. STOCK-BASED COMPENSATION", " Options valued using Black-Scholes model.", " Expense of $2,100,000 recorded in accordance with ASC 718.", "", "3. CONTINGENCIES", " Litigation matters reviewed with counsel (see Document: Schedule 3.9 - Litigation and Claims).", " Accruals of $350,000 determined to be appropriate.", ]), ] else: # Generic sections for other documents sections = [ ("OVERVIEW", [ f"This {meta['title']} is executed in connection with the acquisition transaction.", f"Reference documents: {', '.join([DOCUMENTS[r]['title'] for r in meta['refs'][:2]])}.", ]), ("TERMS AND CONDITIONS", [ "Standard terms apply as set forth in the Master Acquisition Agreement.", "Amendments require written consent of all parties.", ]), ("MISCELLANEOUS", [ "Governing Law: State of Delaware", "Dispute Resolution: Arbitration in San Francisco, California", "Notices: As specified in Master Acquisition Agreement", ]), ] # Add boilerplate to reach target page count for i in range(meta["pages"] - 2): sections.append((f"SECTION {len(sections) + 1}", [ f"Additional provisions related to {meta['title']}.", "All terms defined in Document: Master Acquisition Agreement apply herein.", f"Cross-reference: See {DOCUMENTS[meta['refs'][i % len(meta['refs'])]]['title']} for related provisions.", "The parties acknowledge receipt of all schedules and exhibits referenced herein.", "This section shall survive the Closing Date as specified in Article VIII of the Master Agreement.", ])) return sections def create_pdf(doc_id: str, meta: dict, output_dir: str): """Create a PDF document.""" filepath = os.path.join(output_dir, f"{doc_id}.pdf") doc = SimpleDocTemplate(filepath, pagesize=letter, topMargin=0.75*inch, bottomMargin=0.75*inch, leftMargin=1*inch, rightMargin=1*inch) content = generate_content(doc_id, meta) doc.build(content) print(f" Created: {filepath}") def main(): os.makedirs(OUTPUT_DIR, exist_ok=True) print(f"\nGenerating {len(DOCUMENTS)} large documents in {OUTPUT_DIR}/\n") for doc_id, meta in DOCUMENTS.items(): create_pdf(doc_id, meta, OUTPUT_DIR) # Create test questions questions_path = os.path.join(OUTPUT_DIR, "TEST_QUESTIONS.md") with open(questions_path, "w") as f: f.write("""# Test Questions for Large Document Set ## Document Overview - 25 interconnected documents - Each document 3-6 pages - Extensive cross-references between documents - Total content: ~100+ pages ## Test Questions ### Level 1: Single Document (Easy) ```bash uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?" uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?" uv run explore --task "Look in data/large_acquisition/. What patents does the company own?" ``` ### Level 2: Cross-Reference Required (Medium) ```bash uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?" uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?" uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?" ``` ### Level 3: Multi-Document Synthesis (Hard) ```bash uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?" uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks." uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?" ``` ### Level 4: Deep Cross-Reference (Expert) ```bash uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?" uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation." uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents." ``` """) print(f" Created: {questions_path}") # Summary total_pages = sum(m["pages"] for m in DOCUMENTS.values()) total_refs = sum(len(m["refs"]) for m in DOCUMENTS.values()) print(f"\n{'='*60}") print(f"SUMMARY") print(f"{'='*60}") print(f" Documents created: {len(DOCUMENTS)}") print(f" Total pages: ~{total_pages}") print(f" Cross-references: {total_refs}") print(f" Output directory: {OUTPUT_DIR}/") print(f"{'='*60}\n") if __name__ == "__main__": main() ================================================ FILE: scripts/generate_test_docs.py ================================================ #!/usr/bin/env python3 """ Generate test PDF documents for testing the two-stage document exploration approach. Scenario: TechCorp's acquisition of StartupXYZ Documents have cross-references to test the agent's ability to follow document relationships. """ from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle from reportlab.lib.units import inch import os OUTPUT_DIR = "data/test_acquisition" DOCUMENTS = { "01_acquisition_agreement.pdf": { "title": "ACQUISITION AGREEMENT", "content": """ ACQUISITION AGREEMENT

This Acquisition Agreement ("Agreement") is entered into as of January 15, 2025, by and between TechCorp Industries, Inc. ("Buyer") and StartupXYZ LLC ("Seller").

ARTICLE I - DEFINITIONS

1.1 "Acquisition" means the purchase of all outstanding shares of Seller by Buyer.
1.2 "Purchase Price" means $45,000,000 USD as detailed in Exhibit A - Financial Terms.
1.3 "Closing Date" means March 1, 2025, subject to conditions in Article IV.
1.4 "Employee Matters" shall be governed by Schedule 3 - Employee Transition Plan.

ARTICLE II - PURCHASE AND SALE

2.1 Subject to the terms and conditions of this Agreement, Seller agrees to sell, and Buyer agrees to purchase, all of the issued and outstanding shares of Seller.

2.2 The Purchase Price shall be paid as follows:
(a) $30,000,000 in cash at Closing
(b) $10,000,000 in Buyer's common stock (see Exhibit B - Stock Valuation)
(c) $5,000,000 in earnout payments (see Exhibit C - Earnout Terms)

ARTICLE III - REPRESENTATIONS AND WARRANTIES

3.1 Seller represents and warrants that the financial statements provided in Document: Due Diligence Report are accurate and complete.

3.2 Seller represents that all intellectual property is properly documented in Schedule 1 - IP Assets and is free of encumbrances as certified in Document: IP Certification Letter.

3.3 All material contracts are listed in Schedule 2 - Material Contracts.

ARTICLE IV - CONDITIONS TO CLOSING

4.1 Buyer's obligation to close is subject to:
(a) Receipt of regulatory approval as documented in Document: Regulatory Approval Letter
(b) Completion of due diligence per Document: Due Diligence Report
(c) No material adverse change as defined in Section 1.5

4.2 Both parties acknowledge the risks identified in Document: Risk Assessment Memo.

ARTICLE V - CONFIDENTIALITY

5.1 This Agreement is subject to the terms of the Document: Non-Disclosure Agreement executed between the parties on October 1, 2024.

IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first above written.

_________________________
TechCorp Industries, Inc.
By: James Mitchell, CEO

_________________________
StartupXYZ LLC
By: Sarah Chen, Founder & CEO """ }, "02_due_diligence_report.pdf": { "title": "DUE DILIGENCE REPORT", "content": """ CONFIDENTIAL DUE DILIGENCE REPORT

Prepared for: TechCorp Industries, Inc.
Subject: StartupXYZ LLC
Date: December 20, 2024
Prepared by: Morrison & Associates, LLP

EXECUTIVE SUMMARY

This report summarizes our findings from the due diligence investigation of StartupXYZ LLC in connection with the proposed acquisition described in the Document: Acquisition Agreement.

1. FINANCIAL REVIEW

1.1 Revenue for FY2024: $12.3 million (growth of 45% YoY)
1.2 EBITDA: $2.1 million (17% margin)
1.3 Cash position: $3.2 million as of November 30, 2024
1.4 Outstanding debt: $1.5 million (detailed in Exhibit A - Financial Terms of the Acquisition Agreement)

KEY FINDING: Financial statements are materially accurate. Minor adjustments recommended as noted in Document: Financial Adjustments Memo.

2. INTELLECTUAL PROPERTY

2.1 StartupXYZ holds 12 patents related to AI/ML technology
2.2 All patents verified as valid per Document: IP Certification Letter
2.3 No pending litigation affecting IP (confirmed in Document: Legal Opinion Letter)
2.4 Full IP inventory in Schedule 1 - IP Assets of the Acquisition Agreement

3. EMPLOYEE MATTERS

3.1 Total employees: 47 (32 engineering, 8 sales, 7 operations)
3.2 Key employee retention risk: HIGH for 5 senior engineers
3.3 Retention bonuses recommended per Schedule 3 - Employee Transition Plan
3.4 No pending employment disputes

4. MATERIAL CONTRACTS

4.1 23 active customer contracts reviewed (see Schedule 2 - Material Contracts)
4.2 3 contracts contain change-of-control provisions requiring consent
4.3 Largest customer (MegaCorp) accounts for 28% of revenue - concentration risk noted in Document: Risk Assessment Memo

5. REGULATORY COMPLIANCE

5.1 Company is compliant with all applicable regulations
5.2 HSR filing required - timeline in Document: Regulatory Approval Letter

6. RECOMMENDATIONS

Based on our findings, we recommend proceeding with the acquisition subject to:
(a) Obtaining customer consents for change-of-control contracts
(b) Implementing retention packages for key employees
(c) Addressing items in Document: Financial Adjustments Memo

Respectfully submitted,
Morrison & Associates, LLP """ }, "03_ip_certification.pdf": { "title": "IP CERTIFICATION LETTER", "content": """ INTELLECTUAL PROPERTY CERTIFICATION LETTER

Date: December 15, 2024
To: TechCorp Industries, Inc.
From: PatentWatch Legal Services
Re: IP Certification for StartupXYZ LLC Acquisition

Dear Mr. Mitchell,

In connection with the proposed acquisition of StartupXYZ LLC as described in the Document: Acquisition Agreement, we have conducted a comprehensive review of StartupXYZ's intellectual property portfolio.

CERTIFICATION

We hereby certify the following:

1. PATENTS

StartupXYZ owns 12 U.S. patents as listed in Schedule 1 - IP Assets:
- US Patent 10,123,456: "Neural Network Optimization Method"
- US Patent 10,234,567: "Distributed AI Training System"
- US Patent 10,345,678: "Real-time Data Processing Pipeline"
- [9 additional patents listed in Schedule 1]

All patents are valid, enforceable, and free of liens or encumbrances.

2. TRADEMARKS

StartupXYZ owns 3 registered trademarks:
- "StartupXYZ" (word mark)
- StartupXYZ logo (design mark)
- "IntelliFlow" (product name)

3. TRADE SECRETS

We have reviewed StartupXYZ's trade secret protection protocols. All employees have signed appropriate NDAs. See Document: Non-Disclosure Agreement template.

4. THIRD-PARTY IP

StartupXYZ uses 47 open-source libraries. License compliance verified - no copyleft contamination issues identified.

5. PENDING MATTERS

There is one pending patent application (Application No. 17/456,789) for "Advanced Federated Learning System" expected to issue Q2 2025. This is noted in Document: Risk Assessment Memo as a minor risk item.

6. LITIGATION

No IP-related litigation is pending or threatened. This is confirmed in Document: Legal Opinion Letter.

This certification is provided in connection with the due diligence process and may be relied upon by TechCorp Industries, Inc.

Sincerely,
PatentWatch Legal Services
By: Robert Kim, Patent Attorney """ }, "04_risk_assessment.pdf": { "title": "RISK ASSESSMENT MEMO", "content": """ CONFIDENTIAL RISK ASSESSMENT MEMORANDUM

To: TechCorp Board of Directors
From: Corporate Development Team
Date: December 22, 2024
Re: Risk Assessment - StartupXYZ Acquisition

This memo summarizes key risks identified in connection with the proposed acquisition as documented in the Document: Acquisition Agreement.

1. HIGH-PRIORITY RISKS

1.1 Customer Concentration (HIGH)
- MegaCorp represents 28% of StartupXYZ revenue
- MegaCorp contract contains change-of-control clause
- Mitigation: Obtain consent prior to closing (see Document: Customer Consent Letters)
- Impact if materialized: $3.4M annual revenue at risk

1.2 Key Employee Retention (HIGH)
- 5 senior engineers critical to product development
- 2 have expressed interest in leaving post-acquisition
- Mitigation: Retention packages per Schedule 3 - Employee Transition Plan
- Estimated cost: $2.5M in retention bonuses

2. MEDIUM-PRIORITY RISKS

2.1 Earnout Structure (MEDIUM)
- $5M earnout tied to 2025-2026 performance metrics
- Metrics defined in Exhibit C - Earnout Terms of the Acquisition Agreement
- Risk: Disagreement on metric calculation methodology
- Mitigation: Clear definitions in agreement; third-party arbitration clause

2.2 Integration Costs (MEDIUM)
- Estimated integration costs: $4.2M over 18 months
- Systems integration detailed in Document: Integration Plan
- Risk: Cost overruns of 20-30% typical in tech acquisitions

3. LOW-PRIORITY RISKS

3.1 Pending Patent Application (LOW)
- One patent pending as noted in Document: IP Certification Letter
- Low risk of rejection based on patent attorney's assessment

3.2 Regulatory Approval (LOW)
- HSR filing required but expected to clear without issues
- Timeline in Document: Regulatory Approval Letter

4. FINANCIAL IMPACT SUMMARY

Total risk-adjusted impact: $6.2M - $8.7M
This is reflected in purchase price negotiations per Document: Financial Adjustments Memo

5. RECOMMENDATION

Despite identified risks, we recommend proceeding with the acquisition. The strategic value of StartupXYZ's AI technology platform justifies the purchase price when accounting for risk mitigation costs. All findings are consistent with Document: Due Diligence Report.

6. NEXT STEPS

- Finalize customer consent process
- Execute retention agreements
- Complete regulatory filings
- Prepare for closing per Document: Closing Checklist """ }, "05_financial_adjustments.pdf": { "title": "FINANCIAL ADJUSTMENTS MEMO", "content": """ FINANCIAL ADJUSTMENTS MEMORANDUM

To: Deal Team
From: Finance Department
Date: December 23, 2024
Re: Purchase Price Adjustments - StartupXYZ Acquisition

Following our review in connection with the Document: Due Diligence Report, we recommend the following adjustments to the purchase price as set forth in Exhibit A - Financial Terms of the Document: Acquisition Agreement.

1. WORKING CAPITAL ADJUSTMENT

Target working capital: $1,200,000
Estimated closing working capital: $980,000
Adjustment: ($220,000)

2. DEBT ADJUSTMENT

Previously disclosed debt: $1,500,000
Additional identified debt: $175,000 (capital lease obligations)
Adjustment: ($175,000)

3. REVENUE RECOGNITION ADJUSTMENT

Deferred revenue requiring restatement: $340,000
Impact on EBITDA: ($85,000)
Implied value adjustment (at 15x): ($1,275,000)

4. CONTINGENT LIABILITY RESERVE

As noted in Document: Risk Assessment Memo, we recommend establishing reserves for:
- Customer concentration risk: $500,000
- Integration contingency: $800,000
Total reserve: $1,300,000 (to be held in escrow per Exhibit C - Earnout Terms)

5. SUMMARY OF ADJUSTMENTS

Original Purchase Price: $45,000,000
Working Capital Adjustment: ($220,000)
Debt Adjustment: ($175,000)
Revenue Recognition: ($1,275,000)
Adjusted Purchase Price: $43,330,000

Plus escrow reserve: $1,300,000
Total Cash Required at Closing: $44,630,000

6. PAYMENT STRUCTURE

As revised from Document: Acquisition Agreement Section 2.2:
(a) Cash at closing: $28,330,000 (adjusted)
(b) Stock consideration: $10,000,000 (per Exhibit B - Stock Valuation)
(c) Earnout: $5,000,000 (unchanged, per Exhibit C - Earnout Terms)
(d) Escrow: $1,300,000 (18-month release schedule)

These adjustments have been discussed with Seller's representatives and are subject to final negotiation.

Please refer to Document: Closing Checklist for timeline and requirements. """ }, "06_legal_opinion.pdf": { "title": "LEGAL OPINION LETTER", "content": """ LEGAL OPINION LETTER

Date: December 18, 2024

TechCorp Industries, Inc.
500 Technology Drive
San Francisco, CA 94105

Re: Acquisition of StartupXYZ LLC

Ladies and Gentlemen:

We have acted as legal counsel to StartupXYZ LLC ("Company") in connection with the proposed acquisition by TechCorp Industries, Inc. pursuant to the Document: Acquisition Agreement dated January 15, 2025.

DOCUMENTS REVIEWED

In connection with this opinion, we have reviewed:
1. The Acquisition Agreement and all Exhibits and Schedules
2. Document: Due Diligence Report prepared by Morrison & Associates
3. Document: IP Certification Letter from PatentWatch Legal Services
4. All material contracts listed in Schedule 2 - Material Contracts
5. Corporate records and organizational documents of the Company
6. Document: Non-Disclosure Agreement between the parties

OPINIONS

Based on our review, we are of the opinion that:

1. Corporate Status
The Company is a limited liability company duly organized, validly existing, and in good standing under the laws of Delaware.

2. Authority
The Company has full power and authority to execute and deliver the Acquisition Agreement and to consummate the transactions contemplated thereby.

3. No Conflicts
The execution and delivery of the Acquisition Agreement does not violate any provision of the Company's organizational documents or any material contract, except for change-of-control provisions noted in Document: Customer Consent Letters.

4. Litigation
There is no litigation, arbitration, or governmental proceeding pending or, to our knowledge, threatened against the Company that would have a material adverse effect on the Company or the transactions contemplated by the Acquisition Agreement.

This opinion confirms the representations in the Document: IP Certification Letter regarding absence of IP litigation.

5. Regulatory Compliance
The Company is in material compliance with all applicable laws and regulations. The HSR filing requirements are addressed in Document: Regulatory Approval Letter.

QUALIFICATIONS

This opinion is subject to the following qualifications:
1. We express no opinion on tax matters (see separate tax opinion)
2. This opinion is limited to Delaware and federal law
3. Certain contracts require third-party consents as noted above

This opinion is provided solely for your benefit in connection with the transactions contemplated by the Acquisition Agreement.

Very truly yours,
Wilson & Partners LLP
By: Jennifer Walsh, Partner """ }, "07_nda.pdf": { "title": "NON-DISCLOSURE AGREEMENT", "content": """ MUTUAL NON-DISCLOSURE AGREEMENT

This Mutual Non-Disclosure Agreement ("NDA") is entered into as of October 1, 2024, by and between:

TechCorp Industries, Inc. ("TechCorp")
500 Technology Drive, San Francisco, CA 94105

and

StartupXYZ LLC ("StartupXYZ")
123 Innovation Way, Palo Alto, CA 94301

(each a "Party" and collectively the "Parties")

RECITALS

The Parties wish to explore a potential business relationship, including a possible acquisition of StartupXYZ by TechCorp (the "Purpose"), which is now documented in the Document: Acquisition Agreement.

1. DEFINITION OF CONFIDENTIAL INFORMATION

"Confidential Information" means any non-public information disclosed by either Party, including but not limited to:
- Financial information (as contained in Document: Due Diligence Report)
- Technical information (as certified in Document: IP Certification Letter)
- Business strategies and plans
- Customer and supplier information
- Employee information (as detailed in Schedule 3 - Employee Transition Plan)

2. OBLIGATIONS

Each Party agrees to:
(a) Hold Confidential Information in strict confidence
(b) Not disclose Confidential Information to third parties without prior written consent
(c) Use Confidential Information solely for the Purpose
(d) Limit access to Confidential Information to employees with a need to know

3. TERM

This NDA shall remain in effect for three (3) years from the date first written above, or until superseded by the confidentiality provisions in the Document: Acquisition Agreement Article V.

4. EXCLUSIONS

Confidential Information does not include information that:
(a) Is or becomes publicly available through no fault of the receiving Party
(b) Was rightfully in the receiving Party's possession prior to disclosure
(c) Is rightfully obtained from a third party without restriction
(d) Is independently developed without use of Confidential Information

5. RETURN OF MATERIALS

Upon request or termination, each Party shall return or destroy all Confidential Information, except as required for legal or regulatory purposes.

6. NO LICENSE

Nothing in this NDA grants any rights to intellectual property, except as subsequently agreed in the Document: Acquisition Agreement and Schedule 1 - IP Assets.

IN WITNESS WHEREOF, the Parties have executed this NDA as of the date first above written.

TechCorp Industries, Inc.
By: ______________________
Name: James Mitchell
Title: CEO

StartupXYZ LLC
By: ______________________
Name: Sarah Chen
Title: Founder & CEO """ }, "08_regulatory_approval.pdf": { "title": "REGULATORY APPROVAL LETTER", "content": """ FEDERAL TRADE COMMISSION
PREMERGER NOTIFICATION OFFICE

January 28, 2025

TechCorp Industries, Inc.
500 Technology Drive
San Francisco, CA 94105

StartupXYZ LLC
123 Innovation Way
Palo Alto, CA 94301

Re: Early Termination of HSR Waiting Period
Transaction: Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.

Dear Parties:

This letter confirms that the Federal Trade Commission has granted early termination of the waiting period under the Hart-Scott-Rodino Antitrust Improvements Act of 1976 for the above-referenced transaction.

FILING DETAILS

Filing Date: January 10, 2025
Transaction Value: $45,000,000 (as stated in Document: Acquisition Agreement)
HSR Filing Fee: $30,000
Early Termination Granted: January 28, 2025

EFFECT OF EARLY TERMINATION

The parties may now consummate the transaction at any time. This early termination satisfies the condition precedent set forth in Article IV, Section 4.1(a) of the Document: Acquisition Agreement.

Please note that early termination of the waiting period does not preclude the Commission from taking any action it deems necessary to protect competition.

NEXT STEPS

Per the Document: Closing Checklist, you may now proceed with the closing scheduled for March 1, 2025, subject to satisfaction of other conditions in the Document: Acquisition Agreement.

The Document: Risk Assessment Memo correctly identified this as a low-risk item. The market analysis in the Document: Due Diligence Report supported the determination that this transaction does not raise competitive concerns.

Sincerely,
Premerger Notification Office
Federal Trade Commission """ }, "09_customer_consents.pdf": { "title": "CUSTOMER CONSENT LETTERS", "content": """ CUSTOMER CONSENT STATUS REPORT

Date: February 15, 2025
To: Deal Team
From: Legal Department
Re: Change of Control Consent Status

As required by Schedule 2 - Material Contracts of the Document: Acquisition Agreement, this memo summarizes the status of customer consents for contracts containing change-of-control provisions.

CONSENT STATUS SUMMARY

1. MegaCorp Inc. - OBTAINED
Contract Value: $3.4M annual
Consent Received: February 10, 2025
Notes: MegaCorp requested meeting with TechCorp leadership; meeting held 2/8/25. Consent granted with no additional conditions. This addresses the primary concern noted in Document: Risk Assessment Memo Section 1.1.

2. DataFlow Systems - OBTAINED
Contract Value: $1.2M annual
Consent Received: February 5, 2025
Notes: Standard consent process. No concerns raised.

3. CloudTech Partners - PENDING
Contract Value: $890K annual
Status: Consent requested February 1, 2025
Expected: February 20, 2025
Notes: Legal review in progress at CloudTech. Their counsel has reviewed the Document: Acquisition Agreement and has no objections. Verbal confirmation received; written consent expected shortly.

IMPACT ANALYSIS

Per Document: Due Diligence Report Section 4, there were 3 contracts requiring consent:
- 2 obtained (representing $4.6M annual revenue)
- 1 pending (representing $890K annual revenue)

CLOSING IMPLICATIONS

The Document: Acquisition Agreement Article IV requires "material" customer consents as a closing condition. With MegaCorp consent obtained, this condition is substantially satisfied. The pending CloudTech consent is expected before the March 1 closing date per Document: Closing Checklist.

ATTACHMENTS

Attached hereto:
- Exhibit A: MegaCorp Consent Letter (dated February 10, 2025)
- Exhibit B: DataFlow Systems Consent Letter (dated February 5, 2025)
- Exhibit C: CloudTech Partners Draft Consent (pending signature)

RECOMMENDATION

We recommend proceeding with closing preparations. The risk of CloudTech withholding consent is low based on discussions with their counsel. This is consistent with the risk mitigation strategy in Document: Risk Assessment Memo. """ }, "10_closing_checklist.pdf": { "title": "CLOSING CHECKLIST", "content": """ CLOSING CHECKLIST
Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.

Closing Date: March 1, 2025
Closing Location: Wilson & Partners LLP, San Francisco

I. PRE-CLOSING CONDITIONS

A. Regulatory
[X] HSR Filing submitted - Document: Regulatory Approval Letter
[X] Early termination received (January 28, 2025)
[ ] State regulatory filings (if required)

B. Third-Party Consents
[X] MegaCorp consent - Document: Customer Consent Letters
[X] DataFlow consent - Document: Customer Consent Letters
[ ] CloudTech consent (expected February 20) - Document: Customer Consent Letters

C. Due Diligence Completion
[X] Financial due diligence - Document: Due Diligence Report
[X] Legal due diligence - Document: Legal Opinion Letter
[X] IP due diligence - Document: IP Certification Letter
[X] Risk assessment - Document: Risk Assessment Memo

II. CLOSING DOCUMENTS

A. Transaction Documents
[ ] Executed Document: Acquisition Agreement
[ ] Bill of Sale
[ ] Assignment and Assumption Agreement
[ ] IP Assignment Agreement (per Schedule 1 - IP Assets)

B. Corporate Documents
[ ] Seller's Certificate of Good Standing
[ ] Secretary's Certificate (resolutions, incumbency)
[ ] Buyer's Certificate of Good Standing

C. Financial Documents
[ ] Closing Statement per Document: Financial Adjustments Memo
[ ] Wire transfer instructions
[ ] Escrow Agreement (per Exhibit C - Earnout Terms)
[ ] Stock certificates or book entry (per Exhibit B - Stock Valuation)

D. Employment Documents
[ ] Retention agreements per Schedule 3 - Employee Transition Plan
[ ] Offer letters for key employees
[ ] WARN Act compliance (if applicable)

III. CLOSING FUNDS

Per Document: Financial Adjustments Memo:
[ ] Cash payment: $28,330,000
[ ] Escrow deposit: $1,300,000
[ ] Stock issuance: $10,000,000
Total at Closing: $39,630,000

IV. POST-CLOSING

[ ] File UCC termination statements
[ ] Update corporate records
[ ] Integration kickoff per Document: Integration Plan
[ ] Employee communications
[ ] Customer notifications
[ ] Press release

V. RESPONSIBLE PARTIES

Buyer's Counsel: Morrison & Associates LLP
Seller's Counsel: Wilson & Partners LLP
Escrow Agent: First National Trust

VI. KEY CONTACTS

TechCorp: James Mitchell (CEO), (415) 555-0100
StartupXYZ: Sarah Chen (CEO), (650) 555-0200
Legal (Buyer): John Morrison, (415) 555-0300
Legal (Seller): Jennifer Walsh, (415) 555-0400 """ } } def create_pdf(filename: str, title: str, content: str): """Create a PDF document.""" filepath = os.path.join(OUTPUT_DIR, filename) doc = SimpleDocTemplate(filepath, pagesize=letter, topMargin=1*inch, bottomMargin=1*inch, leftMargin=1*inch, rightMargin=1*inch) styles = getSampleStyleSheet() title_style = ParagraphStyle( 'CustomTitle', parent=styles['Heading1'], fontSize=16, spaceAfter=30, alignment=1 # Center ) body_style = ParagraphStyle( 'CustomBody', parent=styles['Normal'], fontSize=11, leading=14, spaceAfter=12 ) story = [] story.append(Paragraph(title, title_style)) story.append(Spacer(1, 0.5*inch)) # Split content into paragraphs and add them paragraphs = content.strip().split('

') for para in paragraphs: para = para.replace('
', '
') story.append(Paragraph(para, body_style)) doc.build(story) print(f"Created: {filepath}") def main(): # Create output directory os.makedirs(OUTPUT_DIR, exist_ok=True) print(f"\nGenerating {len(DOCUMENTS)} test documents in {OUTPUT_DIR}/\n") for filename, doc_info in DOCUMENTS.items(): create_pdf(filename, doc_info["title"], doc_info["content"]) print(f"\n✅ Generated {len(DOCUMENTS)} documents successfully!") print(f"\nDocument cross-reference map:") print("=" * 60) print(""" Acquisition Agreement (01) ├── references: Exhibit A, B, C, Schedule 1-3 ├── referenced by: ALL other documents │ Due Diligence Report (02) ├── references: Acquisition Agreement, IP Cert, Risk Assessment ├── referenced by: Legal Opinion, Risk Assessment, Regulatory │ IP Certification (03) ├── references: Acquisition Agreement, Schedule 1, NDA ├── referenced by: Due Diligence, Legal Opinion │ Risk Assessment (04) ├── references: Acquisition Agreement, Due Diligence, IP Cert ├── referenced by: Financial Adjustments, Customer Consents │ Financial Adjustments (05) ├── references: Due Diligence, Risk Assessment, Acquisition Agreement ├── referenced by: Closing Checklist │ Legal Opinion (06) ├── references: Acquisition Agreement, Due Diligence, IP Cert, NDA ├── referenced by: Closing Checklist │ NDA (07) ├── references: Acquisition Agreement, Due Diligence, IP Cert ├── referenced by: IP Cert, Legal Opinion │ Regulatory Approval (08) ├── references: Acquisition Agreement, Due Diligence, Risk Assessment ├── referenced by: Closing Checklist │ Customer Consents (09) ├── references: Acquisition Agreement, Risk Assessment, Schedule 2 ├── referenced by: Closing Checklist │ Closing Checklist (10) └── references: ALL documents """) if __name__ == "__main__": main() ================================================ FILE: src/fs_explorer/__init__.py ================================================ """ FsExplorer - AI-powered filesystem exploration agent. This package provides an intelligent agent that can explore filesystems, parse documents, and answer questions about their contents using Google Gemini for decision-making and Docling for document parsing. Example usage: >>> from fs_explorer import FsExplorerAgent, workflow >>> agent = FsExplorerAgent() >>> # Use with the workflow for full exploration >>> result = await workflow.run(start_event=InputEvent(task="Find the purchase price")) """ from .agent import FsExplorerAgent, TokenUsage from .workflow import ( workflow, FsExplorerWorkflow, InputEvent, ExplorationEndEvent, ToolCallEvent, GoDeeperEvent, AskHumanEvent, HumanAnswerEvent, get_agent, reset_agent, ) from .models import Action, ActionType, Tools __all__ = [ # Agent "FsExplorerAgent", "TokenUsage", # Workflow "workflow", "FsExplorerWorkflow", "InputEvent", "ExplorationEndEvent", "ToolCallEvent", "GoDeeperEvent", "AskHumanEvent", "HumanAnswerEvent", "get_agent", "reset_agent", # Models "Action", "ActionType", "Tools", ] ================================================ FILE: src/fs_explorer/agent.py ================================================ """ FsExplorer Agent for filesystem exploration using Google Gemini. This module contains the agent that interacts with the Gemini AI model to make decisions about filesystem exploration actions. """ import os import re from pathlib import Path from typing import Callable, Any, cast from dataclasses import dataclass from dotenv import load_dotenv from google.genai.types import Content, HttpOptions, Part from google.genai import Client as GenAIClient from .models import Action, ActionType, ToolCallAction, Tools from .fs import ( read_file, grep_file_content, glob_paths, scan_folder, preview_file, parse_file, ) from .embeddings import EmbeddingProvider from .index_config import resolve_db_path from .search import ( IndexedQueryEngine, MetadataFilterParseError, supported_filter_syntax, ) from .storage import DuckDBStorage # Load .env file from project root _env_path = Path(__file__).parent.parent.parent / ".env" if _env_path.exists(): load_dotenv(_env_path) # ============================================================================= # Token Usage Tracking # ============================================================================= # Gemini Flash pricing (per million tokens) GEMINI_FLASH_INPUT_COST_PER_MILLION = 0.075 GEMINI_FLASH_OUTPUT_COST_PER_MILLION = 0.30 @dataclass class TokenUsage: """ Track token usage and costs across the session. Maintains running totals of API calls, token counts, and provides cost estimates based on Gemini Flash pricing. """ prompt_tokens: int = 0 completion_tokens: int = 0 total_tokens: int = 0 api_calls: int = 0 # Track content sizes tool_result_chars: int = 0 documents_parsed: int = 0 documents_scanned: int = 0 def add_api_call(self, prompt_tokens: int, completion_tokens: int) -> None: """Record token usage from an API call.""" self.prompt_tokens += prompt_tokens self.completion_tokens += completion_tokens self.total_tokens += prompt_tokens + completion_tokens self.api_calls += 1 def add_tool_result(self, result: str, tool_name: str) -> None: """Record metrics from a tool execution.""" self.tool_result_chars += len(result) if tool_name == "parse_file": self.documents_parsed += 1 elif tool_name == "scan_folder": # Count documents in scan result by counting document markers self.documents_scanned += result.count("│ [") elif tool_name == "preview_file": self.documents_parsed += 1 def _calculate_cost(self) -> tuple[float, float, float]: """Calculate estimated costs based on Gemini Flash pricing.""" input_cost = ( self.prompt_tokens / 1_000_000 ) * GEMINI_FLASH_INPUT_COST_PER_MILLION output_cost = ( self.completion_tokens / 1_000_000 ) * GEMINI_FLASH_OUTPUT_COST_PER_MILLION return input_cost, output_cost, input_cost + output_cost def summary(self) -> str: """Generate a formatted summary of token usage and costs.""" input_cost, output_cost, total_cost = self._calculate_cost() return f""" ═══════════════════════════════════════════════════════════════ TOKEN USAGE SUMMARY ═══════════════════════════════════════════════════════════════ API Calls: {self.api_calls} Prompt Tokens: {self.prompt_tokens:,} Completion Tokens: {self.completion_tokens:,} Total Tokens: {self.total_tokens:,} ─────────────────────────────────────────────────────────────── Documents Scanned: {self.documents_scanned} Documents Parsed: {self.documents_parsed} Tool Result Chars: {self.tool_result_chars:,} ─────────────────────────────────────────────────────────────── Est. Cost (Gemini Flash): Input: ${input_cost:.4f} Output: ${output_cost:.4f} Total: ${total_cost:.4f} ═══════════════════════════════════════════════════════════════ """ # ============================================================================= # Tool Registry # ============================================================================= @dataclass(frozen=True) class IndexContext: """Execution context for indexed retrieval tools.""" root_folder: str db_path: str _INDEX_CONTEXT: IndexContext | None = None _EMBEDDING_PROVIDER: EmbeddingProvider | None = None _FIELD_CATALOG_SHOWN: bool = False _ENABLE_SEMANTIC: bool = False _ENABLE_METADATA: bool = False def set_search_flags( *, enable_semantic: bool = False, enable_metadata: bool = False ) -> None: """Configure which indexed retrieval paths are active.""" global _ENABLE_SEMANTIC, _ENABLE_METADATA _ENABLE_SEMANTIC = enable_semantic _ENABLE_METADATA = enable_metadata def get_search_flags() -> tuple[bool, bool]: """Return (enable_semantic, enable_metadata).""" return _ENABLE_SEMANTIC, _ENABLE_METADATA def set_embedding_provider(provider: EmbeddingProvider | None) -> None: """Set the embedding provider for vector search in indexed tools.""" global _EMBEDDING_PROVIDER _EMBEDDING_PROVIDER = provider def set_index_context(folder: str, db_path: str | None = None) -> None: """Enable indexed tools for a specific folder corpus.""" global _INDEX_CONTEXT, _EMBEDDING_PROVIDER _INDEX_CONTEXT = IndexContext( root_folder=str(Path(folder).resolve()), db_path=resolve_db_path(db_path), ) # Auto-create embedding provider if API key available if _EMBEDDING_PROVIDER is None: try: _EMBEDDING_PROVIDER = EmbeddingProvider() except ValueError: pass def clear_index_context() -> None: """Disable indexed tools for the current process.""" global _INDEX_CONTEXT, _EMBEDDING_PROVIDER, _FIELD_CATALOG_SHOWN global _ENABLE_SEMANTIC, _ENABLE_METADATA _INDEX_CONTEXT = None _EMBEDDING_PROVIDER = None _FIELD_CATALOG_SHOWN = False _ENABLE_SEMANTIC = False _ENABLE_METADATA = False def _get_index_storage_and_corpus() -> tuple[ DuckDBStorage | None, str | None, str | None ]: if _INDEX_CONTEXT is None: return None, None, "Index context is not configured. Re-run with `--use-index`." storage = DuckDBStorage(_INDEX_CONTEXT.db_path) corpus_id = storage.get_corpus_id(_INDEX_CONTEXT.root_folder) if corpus_id is None: return ( None, None, f"No index found for folder {_INDEX_CONTEXT.root_folder}. " "Run `explore index ` first.", ) return storage, corpus_id, None def _clean_excerpt(text: str, max_chars: int = 320) -> str: squashed = re.sub(r"\s+", " ", text).strip() if len(squashed) <= max_chars: return squashed return f"{squashed[:max_chars]}..." def semantic_search(query: str, filters: str | None = None, limit: int = 5) -> str: """Search indexed chunks and return ranked excerpts.""" storage, corpus_id, error = _get_index_storage_and_corpus() if error: return error assert storage is not None and corpus_id is not None engine = IndexedQueryEngine(storage, embedding_provider=_EMBEDDING_PROVIDER) try: hits = engine.search( corpus_id=corpus_id, query=query, filters=filters, limit=limit, enable_semantic=_ENABLE_SEMANTIC, enable_metadata=_ENABLE_METADATA, ) except MetadataFilterParseError as exc: return f"Invalid metadata filter: {exc}\n{supported_filter_syntax()}" except ValueError as exc: return f"Metadata filter error: {exc}" if not hits: if filters: return f"No indexed matches found for query={query!r} with filters={filters!r}." return f"No indexed matches found for query: {query!r}" lines = [ "=== INDEXED SEARCH RESULTS ===", f"Query: {query}", ] if filters: lines.append(f"Filters: {filters}") lines.append("") for idx, hit in enumerate(hits, start=1): position = hit.position if hit.position is not None else "" lines.extend( [ f"[{idx}] doc_id: {hit.doc_id}", f" path: {hit.absolute_path}", f" match: {hit.matched_by}", f" chunk_position: {position}", f" semantic_score: {hit.semantic_score}", f" metadata_score: {hit.metadata_score}", f" score: {hit.score:.2f}", f" excerpt: {_clean_excerpt(hit.text)}", "", ] ) lines.append( "Use get_document(doc_id=...) to read full content for the most relevant documents." ) # Include a rich field catalog on the first search so the agent can # construct effective metadata filters. global _FIELD_CATALOG_SHOWN if not _FIELD_CATALOG_SHOWN: active_schema = storage.get_active_schema(corpus_id=corpus_id) if active_schema is not None: schema_fields = active_schema.schema_def.get("fields") if isinstance(schema_fields, list) and schema_fields: field_names = [ str(f["name"]) for f in schema_fields if isinstance(f, dict) and isinstance(f.get("name"), str) ] field_values = storage.get_metadata_field_values( corpus_id=corpus_id, field_names=field_names, ) field_descs: list[str] = [] for field in schema_fields: if not isinstance(field, dict) or not isinstance( field.get("name"), str ): continue name = field["name"] ftype = field.get("type", "string") desc = field.get("description", "") entry = f"{name} ({ftype})" if desc: entry += f": {desc}" vals = field_values.get(name, []) if ftype == "boolean": entry += " Values: true, false" elif ftype in {"integer", "number"} and vals: nums = [] for v in vals: try: nums.append(float(v)) except (TypeError, ValueError): pass if nums: entry += f" Range: {min(nums):.6g}-{max(nums):.6g}" elif vals: if "enum" in field: entry += f" Values: {field['enum']}" else: entry += f" Values: {', '.join(repr(v) for v in vals)}" elif "enum" in field: entry += f" Values: {field['enum']}" field_descs.append(entry) if field_descs: lines.append("") lines.append( "Available filter fields for semantic_search(filters=...):" ) for desc in field_descs: lines.append(f" - {desc}") _FIELD_CATALOG_SHOWN = True return "\n".join(lines) def get_document(doc_id: str) -> str: """Return full document content by id from the active index context.""" storage, _, error = _get_index_storage_and_corpus() if error: return error assert storage is not None document = storage.get_document(doc_id=doc_id) if document is None: return f"No indexed document found for doc_id={doc_id!r}" if document["is_deleted"]: return f"Document {doc_id} is marked as deleted in the index." return ( f"=== DOCUMENT {doc_id} ===\n" f"Path: {document['absolute_path']}\n\n" f"{document['content']}" ) def list_indexed_documents() -> str: """List indexed documents for the active corpus.""" storage, corpus_id, error = _get_index_storage_and_corpus() if error: return error assert storage is not None and corpus_id is not None documents = storage.list_documents(corpus_id=corpus_id, include_deleted=False) if not documents: return "No indexed documents found for the active corpus." lines = ["=== INDEXED DOCUMENTS ==="] for idx, document in enumerate(documents, start=1): lines.append( f"[{idx}] doc_id={document['id']} path={document['absolute_path']}" ) lines.append("") lines.append("Use semantic_search(...) to find relevant doc_ids.") return "\n".join(lines) TOOLS: dict[Tools, Callable[..., str]] = { "read": read_file, "grep": grep_file_content, "glob": glob_paths, "scan_folder": scan_folder, "preview_file": preview_file, "parse_file": parse_file, "semantic_search": semantic_search, "get_document": get_document, "list_indexed_documents": list_indexed_documents, } # ============================================================================= # System Prompt # ============================================================================= SYSTEM_PROMPT = """ You are FsExplorer, an AI agent that explores filesystems to answer user questions about documents. ## Available Tools | Tool | Purpose | Parameters | |------|---------|------------| | `scan_folder` | **PARALLEL SCAN** - Scan ALL documents in a folder at once | `directory` | | `preview_file` | Quick preview of a single document (~first page) | `file_path` | | `parse_file` | **DEEP READ** - Full content of a document | `file_path` | | `read` | Read a plain text file | `file_path` | | `grep` | Search for a pattern in a file | `file_path`, `pattern` | | `glob` | Find files matching a pattern | `directory`, `pattern` | | `semantic_search` | Search indexed chunks and metadata-filtered docs, then union/rank results | `query`, `filters`, `limit` | | `get_document` | Read full indexed document by document id | `doc_id` | | `list_indexed_documents` | List indexed documents for active corpus | none | ## Indexed Retrieval Strategy When indexed tools are available: 1. Start with `semantic_search` to quickly find relevant documents. 2. Use `get_document` for the top candidate doc IDs. 3. If indexed tools report index is unavailable, fall back to filesystem tools (`scan_folder`, `parse_file`, etc.). Filter syntax for `semantic_search(filters=...)`: - `field=value` - `field!=value` - `field>=number`, `field<=number`, `field>number`, `field The total purchase price is $125,000,000 [Source: 01_master_agreement.pdf, Section 2.1], > consisting of $80M cash [Source: 01_master_agreement.pdf, Section 2.1(a)], > $30M in stock [Source: 10_stock_purchase.pdf, Section 1], and > $15M in escrow [Source: 09_escrow_agreement.pdf, Section 2]. ### Citation Rules 1. **Every factual claim needs a citation** - dates, numbers, names, terms, etc. 2. **Be specific** - include section numbers, article numbers, or page references when available 3. **Use the actual filename** - not paraphrased names 4. **Multiple sources** - if information comes from multiple documents, cite all of them ### Final Answer Structure Your final answer should: 1. **Start with a direct answer** to the user's question 2. **Provide details** with inline citations 3. **End with a Sources section** listing all documents consulted: ``` ## Sources Consulted - 01_master_agreement.pdf - Main acquisition terms - 10_stock_purchase.pdf - Stock component details - 09_escrow_agreement.pdf - Escrow terms and release schedule ``` ## Example Workflow ``` User asks: "What is the purchase price?" 1. scan_folder("./documents/") Reason: "Scanned 10 documents. Categorizing: - RELEVANT: purchase_agreement.pdf (mentions 'Purchase Price' in preview) - RELEVANT: financial_terms.pdf (contains pricing tables) - MAYBE: exhibits.pdf (referenced by other docs) - SKIP: employee_handbook.pdf, hr_policies.pdf (unrelated to pricing)" 2. parse_file("purchase_agreement.pdf") Reason: "Found purchase price of $50M in Section 2.1. Document references 'Exhibit B for price adjustments' - need to check exhibits.pdf next." 3. parse_file("exhibits.pdf") [BACKTRACKING] Reason: "Backtracking to exhibits.pdf because purchase_agreement.pdf referenced it for adjustment details. Found working capital adjustment formula in Exhibit B." 4. STOP with final answer including citations: "The purchase price is $50,000,000 [Source: purchase_agreement.pdf, Section 2.1], subject to working capital adjustments [Source: exhibits.pdf, Exhibit B]..." ``` """ def _build_system_prompt(enable_semantic: bool, enable_metadata: bool) -> str: """Build a system prompt with retrieval-path guidance appended.""" if enable_semantic and enable_metadata: hint = ( "\n\n## Retrieval: Semantic + Metadata\n" "An index is available. Start with `semantic_search` using optional " "`filters` for best results, then use filesystem tools for deep dives." ) elif enable_semantic: hint = ( "\n\n## Retrieval: Semantic Only\n" "An index is available. Use `semantic_search` WITHOUT the `filters` " "parameter for similarity search, then use filesystem tools for details." ) elif enable_metadata: hint = ( "\n\n## Retrieval: Metadata Only\n" "An index is available. Use `semantic_search` with the `filters=` " "parameter for metadata filtering, then use filesystem tools for details." ) else: return SYSTEM_PROMPT return SYSTEM_PROMPT + hint # ============================================================================= # Agent Implementation # ============================================================================= class FsExplorerAgent: """ AI agent for exploring filesystems using Google Gemini. The agent maintains a conversation history with the LLM and uses structured JSON output to make decisions about which actions to take. Attributes: token_usage: Tracks API call statistics and costs. """ def __init__(self, api_key: str | None = None) -> None: """ Initialize the agent with Google API credentials. Args: api_key: Google API key. If not provided, reads from GOOGLE_API_KEY environment variable. Raises: ValueError: If no API key is available. """ if api_key is None: api_key = os.getenv("GOOGLE_API_KEY") if api_key is None: raise ValueError( "GOOGLE_API_KEY not found within the current environment: " "please export it or provide it to the class constructor." ) self._client = GenAIClient( api_key=api_key, http_options=HttpOptions(api_version="v1beta"), ) self._chat_history: list[Content] = [] self.token_usage = TokenUsage() def configure_task(self, task: str) -> None: """ Add a task message to the conversation history. Args: task: The task or context to add to the conversation. """ self._chat_history.append( Content(role="user", parts=[Part.from_text(text=task)]) ) async def take_action(self) -> tuple[Action, ActionType] | None: """ Request the next action from the AI model. Sends the current conversation history to Gemini and receives a structured JSON response indicating the next action to take. Returns: A tuple of (Action, ActionType) if successful, None otherwise. """ response = await self._client.aio.models.generate_content( model="gemini-3-flash-preview", contents=self._chat_history, # type: ignore config={ "system_instruction": _build_system_prompt(_ENABLE_SEMANTIC, _ENABLE_METADATA), "response_mime_type": "application/json", "response_schema": Action, }, ) # Track token usage from response metadata if response.usage_metadata: self.token_usage.add_api_call( prompt_tokens=response.usage_metadata.prompt_token_count or 0, completion_tokens=response.usage_metadata.candidates_token_count or 0, ) if response.candidates is not None: if response.candidates[0].content is not None: self._chat_history.append(response.candidates[0].content) if response.text is not None: action = Action.model_validate_json(response.text) if action.to_action_type() == "toolcall": toolcall = cast(ToolCallAction, action.action) self.call_tool( tool_name=toolcall.tool_name, tool_input=toolcall.to_fn_args(), ) return action, action.to_action_type() return None def call_tool(self, tool_name: Tools, tool_input: dict[str, Any]) -> None: """ Execute a tool and add the result to the conversation history. Args: tool_name: Name of the tool to execute. tool_input: Dictionary of arguments to pass to the tool. """ try: result = TOOLS[tool_name](**tool_input) except Exception as e: result = ( f"An error occurred while calling tool {tool_name} " f"with {tool_input}: {e}" ) # Track tool result sizes self.token_usage.add_tool_result(result, tool_name) self._chat_history.append( Content( role="user", parts=[ Part.from_text(text=f"Tool result for {tool_name}:\n\n{result}") ], ) ) def reset(self) -> None: """Reset the agent's conversation history and token tracking.""" self._chat_history.clear() self.token_usage = TokenUsage() ================================================ FILE: src/fs_explorer/embeddings.py ================================================ """ Embedding provider for vector-based semantic search. Wraps the Google GenAI embedding API for batch and single-query embedding with configurable model, dimensions, and batch size. """ from __future__ import annotations import os from typing import Any from google.genai import Client as GenAIClient _DEFAULT_MODEL = "gemini-embedding-001" _DEFAULT_DIM = 768 _DEFAULT_BATCH_SIZE = 50 class EmbeddingProvider: """Generate text embeddings via Google GenAI.""" def __init__( self, *, api_key: str | None = None, model: str | None = None, dim: int | None = None, batch_size: int | None = None, client: Any | None = None, ) -> None: self.model = model or os.getenv("FS_EXPLORER_EMBEDDING_MODEL", _DEFAULT_MODEL) self.dim = dim or int(os.getenv("FS_EXPLORER_EMBEDDING_DIM", str(_DEFAULT_DIM))) self.batch_size = batch_size or int( os.getenv("FS_EXPLORER_EMBEDDING_BATCH_SIZE", str(_DEFAULT_BATCH_SIZE)) ) if client is not None: self._client = client else: resolved_key = api_key or os.getenv("GOOGLE_API_KEY") if resolved_key is None: raise ValueError( "GOOGLE_API_KEY not found. " "Provide api_key or set the environment variable." ) self._client = GenAIClient(api_key=resolved_key) def embed_texts( self, texts: list[str], *, task_type: str = "RETRIEVAL_DOCUMENT", ) -> list[list[float]]: """Embed a list of texts in batches. Returns a list of embedding vectors in the same order as *texts*. """ all_embeddings: list[list[float]] = [] for start in range(0, len(texts), self.batch_size): batch = texts[start : start + self.batch_size] result = self._client.models.embed_content( model=self.model, contents=batch, config={ "task_type": task_type, "output_dimensionality": self.dim, }, ) for emb in result.embeddings: all_embeddings.append(list(emb.values)) return all_embeddings def embed_query(self, query: str) -> list[float]: """Embed a single query text for retrieval.""" result = self._client.models.embed_content( model=self.model, contents=[query], config={ "task_type": "RETRIEVAL_QUERY", "output_dimensionality": self.dim, }, ) return list(result.embeddings[0].values) ================================================ FILE: src/fs_explorer/exploration_trace.py ================================================ """ Helpers for recording exploration path and referenced files. """ from __future__ import annotations import os import re from dataclasses import dataclass, field from typing import Any FILE_TOOLS: frozenset[str] = frozenset({"read", "grep", "preview_file", "parse_file"}) # Matches citations like: [Source: filename.pdf, Section 2.1] SOURCE_CITATION_RE = re.compile(r"\[Source:\s*([^,\]]+)") def normalize_path(path: str, root_directory: str) -> str: """Return an absolute path using root_directory for relative inputs.""" if os.path.isabs(path): return os.path.abspath(path) return os.path.abspath(os.path.join(root_directory, path)) def extract_cited_sources(final_result: str | None) -> list[str]: """Extract source labels from final answer citations while preserving order.""" if not final_result: return [] seen: set[str] = set() ordered_sources: list[str] = [] for raw_source in SOURCE_CITATION_RE.findall(final_result): source = raw_source.strip() if source and source not in seen: seen.add(source) ordered_sources.append(source) return ordered_sources @dataclass class ExplorationTrace: """ Collects a step-by-step path and files referenced by tool calls. Paths are normalized to absolute paths to make replay/debugging easier. """ root_directory: str step_path: list[str] = field(default_factory=list) referenced_documents: set[str] = field(default_factory=set) def record_tool_call( self, *, step_number: int, tool_name: str, tool_input: dict[str, Any], resolved_document_path: str | None = None, ) -> None: """Record a tool call in the exploration path.""" path_entries: list[str] = [] directory = tool_input.get("directory") if isinstance(directory, str) and directory: path_entries.append(f"directory={normalize_path(directory, self.root_directory)}") file_path = tool_input.get("file_path") if isinstance(file_path, str) and file_path: normalized_file_path = normalize_path(file_path, self.root_directory) path_entries.append(f"file={normalized_file_path}") if tool_name in FILE_TOOLS: self.referenced_documents.add(normalized_file_path) if resolved_document_path: normalized_doc_path = normalize_path(resolved_document_path, self.root_directory) path_entries.append(f"document={normalized_doc_path}") self.referenced_documents.add(normalized_doc_path) parameters = ", ".join(path_entries) if path_entries else "no-path-args" self.step_path.append(f"{step_number}. tool:{tool_name} ({parameters})") def record_go_deeper(self, *, step_number: int, directory: str) -> None: """Record a directory navigation event in the exploration path.""" resolved_dir = normalize_path(directory, self.root_directory) self.step_path.append(f"{step_number}. godeeper (directory={resolved_dir})") def sorted_documents(self) -> list[str]: """Return a sorted list of referenced documents.""" return sorted(self.referenced_documents) ================================================ FILE: src/fs_explorer/fs.py ================================================ """ Filesystem utilities for the FsExplorer agent. This module provides functions for reading, searching, and parsing files in the filesystem, including support for complex document formats via Docling. """ import os import re import glob as glob_module from concurrent.futures import ThreadPoolExecutor, as_completed from pathlib import Path from docling.document_converter import DocumentConverter # ============================================================================= # Configuration Constants # ============================================================================= # Supported document extensions for parsing SUPPORTED_EXTENSIONS: frozenset[str] = frozenset({ ".pdf", ".docx", ".doc", ".pptx", ".xlsx", ".html", ".md" }) # Preview settings DEFAULT_PREVIEW_CHARS = 3000 # Characters for single file preview (~2-3 pages) DEFAULT_SCAN_PREVIEW_CHARS = 1500 # Characters for folder scan preview (~1 page) MAX_PREVIEW_LINES = 30 # Maximum lines to show in scan results # Parallel processing settings DEFAULT_MAX_WORKERS = 4 # Thread pool size for parallel document scanning # ============================================================================= # Document Cache # ============================================================================= # Cache for parsed documents to avoid re-parsing _DOCUMENT_CACHE: dict[str, str] = {} def clear_document_cache() -> None: """Clear the document cache. Useful for testing or memory management.""" _DOCUMENT_CACHE.clear() def _get_cached_or_parse(file_path: str) -> str: """ Get document content from cache or parse it. Uses file modification time in cache key to invalidate stale entries. Args: file_path: Path to the document file. Returns: The document content as markdown. Raises: Exception: If the document cannot be parsed. """ abs_path = os.path.abspath(file_path) cache_key = f"{abs_path}:{os.path.getmtime(abs_path)}" if cache_key not in _DOCUMENT_CACHE: converter = DocumentConverter() result = converter.convert(file_path) _DOCUMENT_CACHE[cache_key] = result.document.export_to_markdown() return _DOCUMENT_CACHE[cache_key] # ============================================================================= # Directory Operations # ============================================================================= def describe_dir_content(directory: str) -> str: """ Describe the contents of a directory. Lists all files and subdirectories in the given directory path. Args: directory: Path to the directory to describe. Returns: A formatted string describing the directory contents, or an error message if the directory doesn't exist. """ if not os.path.exists(directory) or not os.path.isdir(directory): return f"No such directory: {directory}" children = os.listdir(directory) if not children: return f"Directory {directory} is empty" files = [] directories = [] for child in children: fullpath = os.path.join(directory, child) if os.path.isfile(fullpath): files.append(fullpath) else: directories.append(fullpath) description = f"Content of {directory}\n" description += "FILES:\n- " + "\n- ".join(files) if not directories: description += "\nThis folder does not have any sub-folders" else: description += "\nSUBFOLDERS:\n- " + "\n- ".join(directories) return description # ============================================================================= # Basic File Operations # ============================================================================= def read_file(file_path: str) -> str: """ Read the contents of a text file. Args: file_path: Path to the file to read. Returns: The file contents, or an error message if the file doesn't exist. """ if not os.path.exists(file_path) or not os.path.isfile(file_path): return f"No such file: {file_path}" with open(file_path, "r") as f: return f.read() def grep_file_content(file_path: str, pattern: str) -> str: """ Search for a regex pattern in a file. Args: file_path: Path to the file to search. pattern: Regular expression pattern to search for. Returns: A formatted string with matches, "No matches found", or an error message if the file doesn't exist. """ if not os.path.exists(file_path) or not os.path.isfile(file_path): return f"No such file: {file_path}" with open(file_path, "r") as f: content = f.read() regex = re.compile(pattern=pattern, flags=re.MULTILINE) matches = regex.findall(content) if matches: return f"MATCHES for {pattern} in {file_path}:\n\n- " + "\n- ".join(matches) return "No matches found" def glob_paths(directory: str, pattern: str) -> str: """ Find files matching a glob pattern in a directory. Args: directory: Path to the directory to search in. pattern: Glob pattern to match (e.g., "*.txt", "**/*.pdf"). Returns: A formatted string with matching paths, "No matches found", or an error message if the directory doesn't exist. """ if not os.path.exists(directory) or not os.path.isdir(directory): return f"No such directory: {directory}" # Use pathlib for cleaner path handling search_path = Path(directory) / pattern matches = glob_module.glob(str(search_path)) if matches: return f"MATCHES for {pattern} in {directory}:\n\n- " + "\n- ".join(matches) return "No matches found" # ============================================================================= # Document Parsing Operations # ============================================================================= def preview_file(file_path: str, max_chars: int = DEFAULT_PREVIEW_CHARS) -> str: """ Get a quick preview of a document file. Reads only the first portion of the document content for initial relevance assessment before doing a full parse. Args: file_path: Path to the document file. max_chars: Maximum characters to return (default: 3000, ~2-3 pages). Returns: A preview of the document content, or an error message. """ if not os.path.exists(file_path) or not os.path.isfile(file_path): return f"No such file: {file_path}" ext = os.path.splitext(file_path)[1].lower() if ext not in SUPPORTED_EXTENSIONS: return ( f"Unsupported file extension: {ext}. " f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}" ) try: full_content = _get_cached_or_parse(file_path) preview = full_content[:max_chars] total_len = len(full_content) if total_len > max_chars: preview += ( f"\n\n[... PREVIEW TRUNCATED. Full document has {total_len:,} " f"characters. Use parse_file() to read the complete document ...]" ) return f"=== PREVIEW of {file_path} ===\n\n{preview}" except Exception as e: return f"Error previewing {file_path}: {e}" def parse_file(file_path: str) -> str: """ Parse and return the complete content of a document file. Use this after preview_file() confirms the document is relevant, or when you need to find cross-references to other documents. Supported formats: PDF, DOCX, DOC, PPTX, XLSX, HTML, MD. Args: file_path: Path to the document file. Returns: The complete document content as markdown, or an error message. """ if not os.path.exists(file_path) or not os.path.isfile(file_path): return f"No such file: {file_path}" ext = os.path.splitext(file_path)[1].lower() if ext not in SUPPORTED_EXTENSIONS: return ( f"Unsupported file extension: {ext}. " f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}" ) try: return _get_cached_or_parse(file_path) except Exception as e: return f"Error parsing {file_path}: {e}" # ============================================================================= # Parallel Document Scanning # ============================================================================= def _preview_single_file(file_path: str, preview_chars: int) -> dict: """ Helper to preview a single file for parallel processing. Args: file_path: Path to the document file. preview_chars: Number of characters to include in preview. Returns: A dictionary with file info and preview content. """ filename = os.path.basename(file_path) try: content = _get_cached_or_parse(file_path) preview = content[:preview_chars] return { "file": file_path, "filename": filename, "preview": preview, "total_chars": len(content), "status": "success" } except Exception as e: return { "file": file_path, "filename": filename, "preview": "", "total_chars": 0, "status": f"error: {e}" } def scan_folder( directory: str, max_workers: int = DEFAULT_MAX_WORKERS, preview_chars: int = DEFAULT_SCAN_PREVIEW_CHARS, ) -> str: """ Scan all documents in a folder in parallel and return quick previews. This is the FIRST step when exploring a folder with multiple documents. It efficiently processes all documents at once so you can assess relevance before doing deep dives into specific files. Args: directory: Path to the folder to scan. max_workers: Number of parallel workers (default: 4). preview_chars: Characters to preview per file (default: 1500, ~1 page). Returns: A formatted summary of all documents with their previews. """ if not os.path.exists(directory) or not os.path.isdir(directory): return f"No such directory: {directory}" # Find all supported document files doc_files = [] for item in os.listdir(directory): item_path = os.path.join(directory, item) if os.path.isfile(item_path): ext = os.path.splitext(item)[1].lower() if ext in SUPPORTED_EXTENSIONS: doc_files.append(item_path) if not doc_files: return ( f"No supported documents found in {directory}. " f"Supported extensions: {', '.join(sorted(SUPPORTED_EXTENSIONS))}" ) # Scan all documents in parallel results = [] with ThreadPoolExecutor(max_workers=max_workers) as executor: future_to_file = { executor.submit(_preview_single_file, f, preview_chars): f for f in doc_files } for future in as_completed(future_to_file): results.append(future.result()) # Sort by filename for consistent ordering results.sort(key=lambda x: x["filename"]) # Build the summary report output = [] output.append("═══════════════════════════════════════════════════════════════") output.append(f" PARALLEL DOCUMENT SCAN: {directory}") output.append(f" Found {len(results)} documents") output.append("═══════════════════════════════════════════════════════════════") output.append("") for i, result in enumerate(results, 1): output.append("┌─────────────────────────────────────────────────────────────") output.append(f"│ [{i}/{len(results)}] {result['filename']}") output.append(f"│ Path: {result['file']}") output.append(f"│ Status: {result['status']} | Total size: {result['total_chars']:,} chars") output.append("├─────────────────────────────────────────────────────────────") if result['status'] == 'success' and result['preview']: # Indent the preview content preview_lines = result['preview'].split('\n') for line in preview_lines[:MAX_PREVIEW_LINES]: output.append(f"│ {line}") if len(preview_lines) > MAX_PREVIEW_LINES: output.append("│ ... (preview truncated)") else: output.append("│ [No preview available]") output.append("└─────────────────────────────────────────────────────────────") output.append("") output.append("═══════════════════════════════════════════════════════════════") output.append(" NEXT STEPS:") output.append(" 1. Assess which documents are RELEVANT to the user's query") output.append(" 2. Use parse_file() for DEEP DIVE into relevant documents") output.append(" 3. Watch for cross-references to other docs (may need backtracking)") output.append("═══════════════════════════════════════════════════════════════") return "\n".join(output) ================================================ FILE: src/fs_explorer/index_config.py ================================================ """ Configuration helpers for local index storage. """ from __future__ import annotations import os from pathlib import Path DEFAULT_DB_PATH = "~/.fs_explorer/index.duckdb" ENV_DB_PATH = "FS_EXPLORER_DB_PATH" def resolve_db_path(override_path: str | None = None) -> str: """ Resolve the DuckDB path from CLI override, env var, or default. Precedence: 1) explicit override_path 2) FS_EXPLORER_DB_PATH 3) default path """ raw_path = override_path or os.getenv(ENV_DB_PATH) or DEFAULT_DB_PATH resolved = Path(raw_path).expanduser().resolve() resolved.parent.mkdir(parents=True, exist_ok=True) return str(resolved) ================================================ FILE: src/fs_explorer/indexing/__init__.py ================================================ """Indexing components for FsExplorer.""" from .chunker import SmartChunker, TextChunk from .pipeline import IndexingPipeline, IndexingResult from .schema import SchemaDiscovery __all__ = [ "SmartChunker", "TextChunk", "IndexingPipeline", "IndexingResult", "SchemaDiscovery", ] ================================================ FILE: src/fs_explorer/indexing/chunker.py ================================================ """ Chunking utilities for indexing document content. """ from __future__ import annotations from dataclasses import dataclass @dataclass(frozen=True) class TextChunk: """A content chunk with source offsets.""" text: str position: int start_char: int end_char: int class SmartChunker: """ Paragraph-aware chunker with overlap. This implementation is char-based to keep it deterministic and lightweight. """ def __init__(self, chunk_size: int = 1500, overlap: int = 150) -> None: if chunk_size <= 0: raise ValueError("chunk_size must be > 0") if overlap < 0: raise ValueError("overlap must be >= 0") if overlap >= chunk_size: raise ValueError("overlap must be smaller than chunk_size") self.chunk_size = chunk_size self.overlap = overlap def chunk_text(self, text: str) -> list[TextChunk]: """ Split text into chunks while preferring paragraph boundaries. """ normalized = text.strip() if not normalized: return [] chunks: list[TextChunk] = [] start = 0 position = 0 total = len(normalized) while start < total: tentative_end = min(start + self.chunk_size, total) end = tentative_end if tentative_end < total: boundary = normalized.rfind("\n\n", start + (self.chunk_size // 2), tentative_end) if boundary != -1: end = boundary + 2 chunk_text = normalized[start:end].strip() if chunk_text: chunks.append( TextChunk( text=chunk_text, position=position, start_char=start, end_char=end, ) ) position += 1 if end >= total: break start = max(0, end - self.overlap) return chunks ================================================ FILE: src/fs_explorer/indexing/metadata.py ================================================ """ Metadata extraction helpers for indexed documents. """ from __future__ import annotations import copy import json import os import re from collections import defaultdict from pathlib import Path from typing import Any _CURRENCY_RE = re.compile(r"\$\s?\d[\d,]*(?:\.\d+)?") _DATE_RE = re.compile( r"\b(?:\d{4}-\d{2}-\d{2}|" r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)[a-z]*\s+\d{1,2},\s+\d{4})\b", flags=re.IGNORECASE, ) _DOC_TYPE_TOKEN_RE = re.compile(r"[a-z0-9]+") _DOC_TYPE_STOPWORDS: set[str] = { "the", "and", "for", "with", "from", "copy", "draft", "final", "version", "v1", "v2", "v3", "new", "old", "tmp", "temp", } _LANGEXTRACT_PROMPT_DESCRIPTION = ( "Extract key transaction metadata from legal and deal documents. " "Use extraction classes: organization, person, money, date, deal_term. " "Use exact spans from the source text and avoid paraphrasing." ) _VALID_METADATA_FIELD_NAME_RE = re.compile(r"^[A-Za-z][A-Za-z0-9_]*$") _VALID_FIELD_TYPES: set[str] = {"string", "integer", "number", "boolean"} _VALID_RUNTIME_FIELDS: set[str] = {"enabled", "extraction_count", "entity_classes"} _FIELD_MODE_ALIASES: dict[str, str] = { "csv": "values", "list": "values", "joined": "values", "join": "values", "values": "values", "count": "count", "exists": "exists", "contains": "contains", "contains_any": "contains", } _DEFAULT_LANGEXTRACT_PROFILE: dict[str, Any] = { "name": "default_langextract", "description": "Default metadata extraction profile for legal and deal-style documents.", "prompt_description": _LANGEXTRACT_PROMPT_DESCRIPTION, "fields": [ { "name": "lx_enabled", "type": "boolean", "required": False, "description": "Whether langextract metadata extraction succeeded.", "source": "runtime", "runtime": "enabled", }, { "name": "lx_extraction_count", "type": "integer", "required": False, "description": "Number of langextract entities extracted from the document.", "source": "runtime", "runtime": "extraction_count", }, { "name": "lx_entity_classes", "type": "string", "required": False, "description": "Comma-separated extraction classes returned by langextract.", "source": "runtime", "runtime": "entity_classes", }, { "name": "lx_organizations", "type": "string", "required": False, "description": "Comma-separated organization names extracted by langextract.", "source": "entities", "source_classes": ["organization", "company", "party"], "mode": "values", }, { "name": "lx_people", "type": "string", "required": False, "description": "Comma-separated person names extracted by langextract.", "source": "entities", "source_classes": ["person", "individual", "executive"], "mode": "values", }, { "name": "lx_deal_terms", "type": "string", "required": False, "description": "Comma-separated deal terms extracted by langextract.", "source": "entities", "source_classes": ["deal_term", "term", "provision"], "mode": "values", }, { "name": "lx_money_mentions", "type": "integer", "required": False, "description": "Count of monetary amount entities from langextract.", "source": "entities", "source_classes": ["money", "amount", "currency"], "mode": "count", }, { "name": "lx_date_mentions", "type": "integer", "required": False, "description": "Count of date entities from langextract.", "source": "entities", "source_classes": ["date"], "mode": "count", }, { "name": "lx_has_earnout", "type": "boolean", "required": False, "description": "Whether extracted deal terms indicate an earnout.", "source": "entities", "source_classes": ["deal_term", "term", "provision"], "mode": "contains", "contains_any": ["earnout"], }, { "name": "lx_has_escrow", "type": "boolean", "required": False, "description": "Whether extracted deal terms indicate escrow.", "source": "entities", "source_classes": ["deal_term", "term", "provision"], "mode": "contains", "contains_any": ["escrow"], }, ], } _AUTO_PROFILE_PROMPT_TEMPLATE = ( "You are a metadata schema designer. Analyze the document samples below and generate " "a langextract metadata extraction profile tailored to this corpus.\n\n" "Return a JSON object with these keys:\n" '- "name": a short descriptive profile name (string)\n' '- "description": one-sentence description of the profile (string)\n' '- "prompt_description": instruction text for the extraction model (string)\n' '- "fields": array of field definitions\n\n' "Each field object must have:\n" '- "name": valid identifier starting with "lx_" (letters, digits, underscores)\n' '- "type": one of "string", "integer", "number", "boolean"\n' '- "description": what this field captures\n' '- "source": "entities"\n' '- "source_classes": array of entity class names to aggregate (e.g. ["organization", "company"])\n' '- "mode": one of "values" (comma-joined text), "count" (integer count), "exists" (boolean), ' '"contains" (boolean, requires "contains_any")\n' '- "contains_any": (only when mode is "contains") array of lowercase terms to match\n\n' "Valid entity source classes include (but are not limited to): organization, company, party, " "person, individual, executive, money, amount, currency, date, deal_term, term, provision, " "location, product, technology, regulation, clause, obligation.\n\n" "### Example profile for legal/M&A documents\n" "```json\n" '{"name": "legal_ma", "description": "Metadata extraction for legal and M&A deal documents.", ' '"prompt_description": "Extract key transaction metadata from legal and deal documents.", ' '"fields": [' '{"name": "lx_organizations", "type": "string", "description": "Organization names.", ' '"source": "entities", "source_classes": ["organization", "company", "party"], "mode": "values"}, ' '{"name": "lx_money_mentions", "type": "integer", "description": "Count of monetary amounts.", ' '"source": "entities", "source_classes": ["money", "amount"], "mode": "count"}, ' '{"name": "lx_has_escrow", "type": "boolean", "description": "Whether escrow terms are present.", ' '"source": "entities", "source_classes": ["deal_term", "provision"], "mode": "contains", ' '"contains_any": ["escrow"]}' "]}\n" "```\n\n" "### Example profile for technical/research documents\n" "```json\n" '{"name": "tech_research", "description": "Metadata extraction for technical and research documents.", ' '"prompt_description": "Extract key entities from technical and research documents.", ' '"fields": [' '{"name": "lx_technologies", "type": "string", "description": "Technology names.", ' '"source": "entities", "source_classes": ["technology", "product"], "mode": "values"}, ' '{"name": "lx_people", "type": "string", "description": "Person names.", ' '"source": "entities", "source_classes": ["person", "individual"], "mode": "values"}, ' '{"name": "lx_org_count", "type": "integer", "description": "Number of organizations mentioned.", ' '"source": "entities", "source_classes": ["organization", "company"], "mode": "count"}' "]}\n" "```\n\n" "### Document samples from the corpus\n\n" "SAMPLES_PLACEHOLDER\n\n" "Generate a profile with 4-8 entity fields (do NOT include runtime fields). " "Return ONLY the JSON object, no markdown fencing." ) def _get_genai_client(api_key: str) -> Any: """Instantiate a Google GenAI client. Separated for test patching.""" from google.genai import Client as _GenAIClient return _GenAIClient(api_key=api_key) def auto_discover_profile( folder: str, *, sample_count: int = 3, model_id: str | None = None, ) -> dict[str, Any]: """Use an LLM to generate a langextract profile tailored to the corpus. Falls back to the default hardcoded profile on any failure. """ from .schema import _iter_supported_files files = _iter_supported_files(folder) if not files: return default_langextract_profile() # Sample files evenly n = min(sample_count, len(files)) step = max(1, len(files) // n) sampled = [files[i * step] for i in range(n)] # Parse and truncate from ..fs import parse_file snippets: list[str] = [] for file_path in sampled: try: text = parse_file(file_path) snippets.append( f"--- {Path(file_path).name} ---\n{text[:2000]}" ) except Exception: continue if not snippets: return default_langextract_profile() api_key = os.getenv("GOOGLE_API_KEY") if not api_key: return default_langextract_profile() effective_model = model_id or os.getenv( "FS_EXPLORER_PROFILE_MODEL", "gemini-2.0-flash" ) try: client = _get_genai_client(api_key=api_key) prompt = _AUTO_PROFILE_PROMPT_TEMPLATE.replace( "SAMPLES_PLACEHOLDER", "\n\n".join(snippets) ) response = client.models.generate_content( model=effective_model, contents=prompt, ) raw_text = (response.text or "").strip() # Strip markdown fencing if present if raw_text.startswith("```"): raw_text = re.sub(r"^```[a-z]*\n?", "", raw_text) raw_text = re.sub(r"\n?```$", "", raw_text).strip() profile = json.loads(raw_text) # Add runtime fields that are always present runtime_fields = [ f for f in _DEFAULT_LANGEXTRACT_PROFILE["fields"] if f.get("source") == "runtime" ] existing_names = { str(f.get("name")) for f in profile.get("fields", []) if isinstance(f, dict) } for rf in runtime_fields: if rf["name"] not in existing_names: profile.setdefault("fields", []).insert(0, copy.deepcopy(rf)) return normalize_langextract_profile(profile) except Exception: return default_langextract_profile() def infer_document_type(file_path: str) -> str: """Infer a generic document type from filename tokens.""" stem = Path(file_path).stem.lower() tokens = [token for token in _DOC_TYPE_TOKEN_RE.findall(stem) if token] filtered = [ token for token in tokens if not token.isdigit() and len(token) > 2 and token not in _DOC_TYPE_STOPWORDS ] if filtered: return filtered[-1] if tokens: return tokens[-1] return "document" def default_langextract_profile() -> dict[str, Any]: """Return a mutable copy of the built-in metadata profile.""" return copy.deepcopy(_DEFAULT_LANGEXTRACT_PROFILE) def normalize_langextract_profile(profile: dict[str, Any] | None) -> dict[str, Any]: """ Validate and normalize user-provided langextract profile configuration. Expected shape: - prompt_description: str (optional) - max_chars: int (optional) - fields: list[{ name: str, type: string|integer|number|boolean, description: str (optional), required: bool (optional), source: runtime|entities (default entities), runtime: enabled|extraction_count|entity_classes (runtime source only), source_class: str (entities source), source_classes: list[str] (entities source), mode: values|count|exists|contains (entities source), contains_any: list[str] (contains mode), }] """ raw = default_langextract_profile() if profile is None else copy.deepcopy(profile) if not isinstance(raw, dict): raise ValueError("Metadata profile must be a JSON object.") prompt = raw.get("prompt_description") if prompt is None: prompt_description = _LANGEXTRACT_PROMPT_DESCRIPTION elif isinstance(prompt, str) and prompt.strip(): prompt_description = prompt.strip() else: raise ValueError( "Metadata profile field 'prompt_description' must be a non-empty string." ) max_chars: int | None = None if "max_chars" in raw: max_chars = _safe_positive_int( raw.get("max_chars"), minimum=500, field_name="max_chars", ) raw_fields = raw.get("fields") if not isinstance(raw_fields, list) or not raw_fields: raise ValueError("Metadata profile must include a non-empty 'fields' array.") normalized_fields: list[dict[str, Any]] = [] seen_names: set[str] = set() for idx, raw_field in enumerate(raw_fields): if not isinstance(raw_field, dict): raise ValueError(f"Metadata field at index {idx} must be an object.") name_obj = raw_field.get("name") if not isinstance(name_obj, str) or not name_obj.strip(): raise ValueError( f"Metadata field at index {idx} is missing a valid 'name'." ) name = name_obj.strip() if not _VALID_METADATA_FIELD_NAME_RE.match(name): raise ValueError( f"Invalid metadata field name '{name}'. " "Use letters, numbers, and underscores." ) if name in seen_names: raise ValueError(f"Duplicate metadata field name '{name}'.") seen_names.add(name) field_type = str(raw_field.get("type", "string")).strip().lower() if field_type not in _VALID_FIELD_TYPES: allowed_types = ", ".join(sorted(_VALID_FIELD_TYPES)) raise ValueError( f"Metadata field '{name}' has invalid type '{field_type}'. " f"Allowed types: {allowed_types}." ) description_obj = raw_field.get("description") description = ( description_obj.strip() if isinstance(description_obj, str) and description_obj.strip() else f"Metadata field '{name}'." ) required = bool(raw_field.get("required", False)) source = str(raw_field.get("source", "entities")).strip().lower() if source not in {"runtime", "entities"}: raise ValueError( f"Metadata field '{name}' has invalid source '{source}'. " "Use 'runtime' or 'entities'." ) normalized: dict[str, Any] = { "name": name, "type": field_type, "required": required, "description": description, "source": source, } if source == "runtime": runtime = str(raw_field.get("runtime", "")).strip().lower() if runtime not in _VALID_RUNTIME_FIELDS: allowed_runtime = ", ".join(sorted(_VALID_RUNTIME_FIELDS)) raise ValueError( f"Metadata field '{name}' has invalid runtime source '{runtime}'. " f"Allowed runtime values: {allowed_runtime}." ) normalized["runtime"] = runtime normalized["mode"] = "runtime" normalized["source_classes"] = [] normalized["contains_any"] = [] normalized_fields.append(normalized) continue source_classes = _normalize_source_classes(raw_field) if not source_classes: raise ValueError( f"Metadata field '{name}' requires 'source_class' or " "'source_classes' for entity extraction." ) requested_mode = raw_field.get("mode") mode = _normalize_field_mode(requested_mode, field_type=field_type) contains_any = _normalize_contains_any( raw_field.get("contains_any"), mode=mode, field_name=name, ) normalized["source_classes"] = source_classes normalized["mode"] = mode normalized["contains_any"] = contains_any normalized_fields.append(normalized) normalized_profile: dict[str, Any] = { "name": str(raw.get("name", "langextract_profile")), "description": str( raw.get("description", "User-defined langextract metadata profile.") ), "prompt_description": prompt_description, "fields": normalized_fields, } if max_chars is not None: normalized_profile["max_chars"] = max_chars return normalized_profile def langextract_schema_fields( profile: dict[str, Any] | None = None, ) -> list[dict[str, Any]]: """Return schema field definitions for langextract metadata.""" normalized = normalize_langextract_profile(profile) fields: list[dict[str, Any]] = [] for field in normalized["fields"]: fields.append( { "name": field["name"], "type": field["type"], "required": bool(field.get("required", False)), "description": str(field.get("description", "")), } ) return fields def langextract_field_names(profile: dict[str, Any] | None = None) -> set[str]: """Return field names used by langextract metadata extraction.""" return {field["name"] for field in langextract_schema_fields(profile)} def ensure_langextract_schema_fields( schema_def: dict[str, Any], profile: dict[str, Any] | None = None, ) -> tuple[dict[str, Any], bool]: """Ensure schema contains langextract field definitions.""" normalized_profile = normalize_langextract_profile( profile if profile is not None else _schema_profile_if_present(schema_def) ) required_fields = langextract_schema_fields(normalized_profile) fields_obj = schema_def.get("fields") fields: list[dict[str, Any]] if isinstance(fields_obj, list): fields = [dict(field) for field in fields_obj if isinstance(field, dict)] else: fields = [] existing_names = { str(field.get("name")) for field in fields if isinstance(field.get("name"), str) } updated = list(fields) changed = False for field in required_fields: if field["name"] in existing_names: continue updated.append(dict(field)) changed = True merged = dict(schema_def) if changed: merged["fields"] = updated existing_profile = _schema_profile_if_present(schema_def) if profile is not None or existing_profile is not None: if existing_profile != normalized_profile: merged["metadata_profile"] = normalized_profile changed = True elif "metadata_profile" in schema_def: merged["metadata_profile"] = existing_profile return merged, changed def extract_metadata( *, file_path: str, root_path: str, content: str, schema_def: dict[str, Any] | None = None, with_langextract: bool = False, langextract_model_id: str | None = None, langextract_profile: dict[str, Any] | None = None, ) -> dict[str, Any]: """ Build metadata used for filtering and schema-aware indexing. If a schema is provided with a `fields` list, only those keys are emitted. """ absolute_path = str(Path(file_path).resolve()) relative_path = os.path.relpath(absolute_path, str(Path(root_path).resolve())) extension = Path(file_path).suffix.lower() stat = os.stat(file_path) metadata: dict[str, Any] = { "filename": Path(file_path).name, "relative_path": relative_path, "extension": extension, "document_type": infer_document_type(file_path), "file_size_bytes": int(stat.st_size), "file_mtime": float(stat.st_mtime), "mentions_currency": bool(_CURRENCY_RE.search(content)), "mentions_dates": bool(_DATE_RE.search(content)), } if with_langextract: resolved_profile = _resolve_langextract_profile( schema_def=schema_def, profile_override=langextract_profile, ) metadata.update( _extract_langextract_metadata( content=content, model_id=langextract_model_id, profile=resolved_profile, ) ) if not schema_def: return metadata fields = schema_def.get("fields") if not isinstance(fields, list): return metadata allowed: set[str] = set() for field in fields: if isinstance(field, dict): name = field.get("name") if isinstance(name, str): allowed.add(name) if not allowed: return metadata return {k: v for k, v in metadata.items() if k in allowed} def _extract_langextract_metadata( *, content: str, model_id: str | None = None, profile: dict[str, Any] | None = None, ) -> dict[str, Any]: normalized_profile = normalize_langextract_profile(profile) defaults = _profile_defaults(normalized_profile) api_key = ( os.getenv("LANGEXTRACT_API_KEY") or os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY") ) if not api_key: return defaults try: import langextract as lx # type: ignore[import-not-found] except Exception: return defaults profile_max_chars_obj = normalized_profile.get("max_chars") profile_max_chars = ( _safe_positive_int( profile_max_chars_obj, minimum=500, field_name="max_chars", ) if profile_max_chars_obj is not None else None ) max_chars = profile_max_chars or _safe_int_env( "FS_EXPLORER_LANGEXTRACT_MAX_CHARS", default=6000, minimum=500, ) snippet = content[:max_chars] if not snippet.strip(): return defaults effective_model_id = model_id or os.getenv( "FS_EXPLORER_LANGEXTRACT_MODEL", "gemini-3-flash-preview", ) try: result = lx.extract( text_or_documents=snippet, prompt_description=str(normalized_profile["prompt_description"]), examples=_langextract_examples(lx), model_id=effective_model_id, api_key=api_key, max_char_buffer=min(1200, max_chars), show_progress=False, prompt_validation_level=lx.prompt_validation.PromptValidationLevel.OFF, ) except Exception: return defaults extractions = list(result.extractions or []) return _aggregate_profile_metadata( normalized_profile=normalized_profile, extractions=extractions, enabled=True, ) def _schema_profile_if_present(schema_def: dict[str, Any] | None) -> dict[str, Any] | None: if not schema_def: return None metadata_profile = schema_def.get("metadata_profile") if isinstance(metadata_profile, dict): return metadata_profile return None def _resolve_langextract_profile( *, schema_def: dict[str, Any] | None, profile_override: dict[str, Any] | None, ) -> dict[str, Any] | None: if profile_override is not None: return profile_override return _schema_profile_if_present(schema_def) def _normalize_source_classes(raw_field: dict[str, Any]) -> list[str]: classes: list[str] = [] single = raw_field.get("source_class") if isinstance(single, str) and single.strip(): classes.append(single.strip().lower()) multi = raw_field.get("source_classes") if isinstance(multi, list): for item in multi: if isinstance(item, str) and item.strip(): classes.append(item.strip().lower()) seen: set[str] = set() deduped: list[str] = [] for class_name in classes: if class_name in seen: continue seen.add(class_name) deduped.append(class_name) return deduped def _normalize_field_mode(mode_obj: Any, *, field_type: str) -> str: if isinstance(mode_obj, str) and mode_obj.strip(): requested = mode_obj.strip().lower() normalized = _FIELD_MODE_ALIASES.get(requested) if normalized is None: allowed = ", ".join(sorted(set(_FIELD_MODE_ALIASES.values()))) raise ValueError( f"Unsupported metadata field mode '{requested}'. " f"Allowed modes: {allowed}." ) return normalized if field_type == "boolean": return "exists" if field_type in {"integer", "number"}: return "count" return "values" def _normalize_contains_any( contains_obj: Any, *, mode: str, field_name: str, ) -> list[str]: if mode != "contains": return [] if not isinstance(contains_obj, list) or not contains_obj: raise ValueError( f"Metadata field '{field_name}' with mode 'contains' " "requires 'contains_any' list." ) terms: list[str] = [] for term in contains_obj: if isinstance(term, str) and term.strip(): terms.append(term.strip().lower()) if not terms: raise ValueError( f"Metadata field '{field_name}' with mode 'contains' " "has no valid 'contains_any' terms." ) return terms def _profile_defaults(profile: dict[str, Any]) -> dict[str, Any]: defaults: dict[str, Any] = {} for field in profile["fields"]: defaults[field["name"]] = _default_field_value(field) return defaults def _default_field_value(field: dict[str, Any]) -> Any: source = str(field.get("source", "entities")) runtime = str(field.get("runtime", "")) if source == "runtime": if runtime == "enabled": return False if runtime == "extraction_count": return 0 if runtime == "entity_classes": return "" field_type = str(field.get("type", "string")) if field_type == "boolean": return False if field_type == "integer": return 0 if field_type == "number": return 0.0 return "" def _aggregate_profile_metadata( *, normalized_profile: dict[str, Any], extractions: list[Any], enabled: bool, ) -> dict[str, Any]: classes: set[str] = set() by_class: dict[str, list[str]] = defaultdict(list) for extraction in extractions: extraction_class = str(getattr(extraction, "extraction_class", "")).strip().lower() extraction_text = str(getattr(extraction, "extraction_text", "")).strip() if not extraction_class: continue classes.add(extraction_class) if extraction_text: by_class[extraction_class].append(extraction_text) metadata: dict[str, Any] = {} for field in normalized_profile["fields"]: name = str(field["name"]) source = str(field["source"]) if source == "runtime": value = _runtime_field_value( field=field, enabled=enabled, extraction_count=len(extractions), classes=classes, ) metadata[name] = _coerce_field_value( value=value, field_type=str(field["type"]), ) continue matched_values: list[str] = [] for extraction_class in field["source_classes"]: matched_values.extend(by_class.get(extraction_class, [])) value = _entity_field_value(field=field, matched_values=matched_values) metadata[name] = _coerce_field_value(value=value, field_type=str(field["type"])) defaults = _profile_defaults(normalized_profile) for key, default_value in defaults.items(): metadata.setdefault(key, default_value) return metadata def _runtime_field_value( *, field: dict[str, Any], enabled: bool, extraction_count: int, classes: set[str], ) -> Any: runtime = str(field.get("runtime", "")) if runtime == "enabled": return enabled if runtime == "extraction_count": return extraction_count if runtime == "entity_classes": return ", ".join(sorted(classes)) return _default_field_value(field) def _entity_field_value(*, field: dict[str, Any], matched_values: list[str]) -> Any: mode = str(field.get("mode", "values")) if mode == "count": return len(matched_values) if mode == "exists": return bool(matched_values) if mode == "contains": terms = [str(term).lower() for term in field.get("contains_any", [])] lowered_values = [value.lower() for value in matched_values] return any(term in value for term in terms for value in lowered_values) deduped = _dedupe_preserve_order(matched_values) return ", ".join(deduped) def _coerce_field_value(*, value: Any, field_type: str) -> Any: if field_type == "boolean": return bool(value) if field_type == "integer": if isinstance(value, bool): return int(value) try: return int(value) except (TypeError, ValueError): return 0 if field_type == "number": if isinstance(value, bool): return float(int(value)) try: return float(value) except (TypeError, ValueError): return 0.0 if value is None: return "" return str(value) def _langextract_examples(lx: Any) -> list[Any]: return [ lx.data.ExampleData( text=( "TechCorp Industries will pay $45,000,000 in cash consideration, " "with a $1,500,000 escrow reserve and a $5,000,000 earnout to " "acquire StartupXYZ LLC. CTO Dr. Sarah Chen signed on January 15, 2025." ), extractions=[ lx.data.Extraction( extraction_class="organization", extraction_text="TechCorp Industries", ), lx.data.Extraction( extraction_class="organization", extraction_text="StartupXYZ LLC", ), lx.data.Extraction( extraction_class="money", extraction_text="$45,000,000", ), lx.data.Extraction( extraction_class="money", extraction_text="$1,500,000", ), lx.data.Extraction( extraction_class="money", extraction_text="$5,000,000", ), lx.data.Extraction( extraction_class="deal_term", extraction_text="cash consideration", ), lx.data.Extraction( extraction_class="deal_term", extraction_text="escrow reserve", ), lx.data.Extraction( extraction_class="deal_term", extraction_text="earnout", ), lx.data.Extraction( extraction_class="person", extraction_text="Dr. Sarah Chen", ), lx.data.Extraction( extraction_class="date", extraction_text="January 15, 2025", ), ], ) ] def _dedupe_preserve_order(values: list[str], *, max_items: int = 16) -> list[str]: seen: set[str] = set() deduped: list[str] = [] for value in values: key = value.strip() if not key: continue lower = key.lower() if lower in seen: continue seen.add(lower) deduped.append(key) if len(deduped) >= max_items: break return deduped def _safe_positive_int(value: Any, *, minimum: int, field_name: str) -> int: try: integer = int(value) except (TypeError, ValueError) as exc: raise ValueError( f"Metadata profile field '{field_name}' must be an integer." ) from exc if integer < minimum: raise ValueError( f"Metadata profile field '{field_name}' must be >= {minimum}." ) return integer def _safe_int_env(name: str, *, default: int, minimum: int) -> int: raw = os.getenv(name) if raw is None: return default try: value = int(raw) except ValueError: return default return value if value >= minimum else minimum ================================================ FILE: src/fs_explorer/indexing/pipeline.py ================================================ """ Indexing pipeline orchestration. """ from __future__ import annotations import hashlib import json import os from concurrent.futures import ThreadPoolExecutor from dataclasses import dataclass from pathlib import Path from typing import Any from .chunker import SmartChunker from .metadata import ( ensure_langextract_schema_fields, extract_metadata, langextract_field_names, ) from .schema import SchemaDiscovery from ..embeddings import EmbeddingProvider from ..fs import SUPPORTED_EXTENSIONS, parse_file from ..storage import ChunkRecord, DocumentRecord, DuckDBStorage, StorageBackend _PARSE_ERROR_PREFIXES: tuple[str, ...] = ( "Error parsing ", "Unsupported file extension", "No such file:", ) @dataclass(frozen=True) class IndexingResult: """Summary output for an indexing run.""" corpus_id: str indexed_files: int skipped_files: int deleted_files: int chunks_written: int active_documents: int schema_used: str | None embeddings_written: int = 0 class IndexingPipeline: """Build and update corpus indexes from filesystem documents.""" def __init__( self, storage: StorageBackend, chunker: SmartChunker | None = None, embedding_provider: EmbeddingProvider | None = None, max_workers: int = 4, ) -> None: self.storage = storage self.chunker = chunker or SmartChunker() self.embedding_provider = embedding_provider self._max_workers = max_workers def index_folder( self, folder: str, *, discover_schema: bool = False, schema_name: str | None = None, with_metadata: bool = False, metadata_profile: dict[str, Any] | None = None, ) -> IndexingResult: root = str(Path(folder).resolve()) if not os.path.exists(root) or not os.path.isdir(root): raise ValueError(f"No such directory: {root}") effective_with_metadata = with_metadata or metadata_profile is not None corpus_id = self.storage.get_or_create_corpus(root) schema_def, selected_schema_name = self._resolve_schema( corpus_id=corpus_id, root=root, discover_schema=discover_schema, schema_name=schema_name, with_metadata=effective_with_metadata, metadata_profile=metadata_profile, ) effective_profile = metadata_profile or self._schema_metadata_profile( schema_def ) # Pass 1: Parse all documents parsed_docs: list[tuple[str, str, str]] = [] # (file_path, relative_path, content) skipped_files = 0 active_paths: set[str] = set() for file_path in self._iter_supported_files(root): relative_path = os.path.relpath(file_path, root) active_paths.add(relative_path) content = parse_file(file_path) if self._is_parse_error(content): skipped_files += 1 continue parsed_docs.append((file_path, relative_path, content)) # Parallel metadata extraction across documents metadata_map = self._extract_metadata_batch( parsed_docs=parsed_docs, root_path=root, schema_def=schema_def, with_langextract=effective_with_metadata, langextract_profile=effective_profile, ) # Pass 2: Chunk + upsert (sequential, DB writes) indexed_files = 0 chunks_written = 0 all_chunk_records: list[ChunkRecord] = [] for file_path, relative_path, content in parsed_docs: chunks = self.chunker.chunk_text(content) metadata = metadata_map[relative_path] metadata_json = json.dumps(metadata, sort_keys=True) stat = os.stat(file_path) doc_id = DuckDBStorage.make_document_id(corpus_id, relative_path) doc_record = DocumentRecord( id=doc_id, corpus_id=corpus_id, relative_path=relative_path, absolute_path=str(Path(file_path).resolve()), content=content, metadata_json=metadata_json, file_mtime=float(stat.st_mtime), file_size=int(stat.st_size), content_sha256=self._sha256(content), ) chunk_records: list[ChunkRecord] = [] for chunk in chunks: chunk_records.append( ChunkRecord( id=DuckDBStorage.make_chunk_id( doc_id, chunk.position, chunk.start_char, chunk.end_char, ), doc_id=doc_id, text=chunk.text, position=chunk.position, start_char=chunk.start_char, end_char=chunk.end_char, ) ) self.storage.upsert_document(doc_record, chunk_records) all_chunk_records.extend(chunk_records) indexed_files += 1 chunks_written += len(chunk_records) deleted_files = self.storage.mark_deleted_missing_documents( corpus_id=corpus_id, active_relative_paths=active_paths, ) active_documents = len( self.storage.list_documents(corpus_id=corpus_id, include_deleted=False) ) embeddings_written = self._generate_and_store_embeddings( corpus_id=corpus_id, all_chunk_records=all_chunk_records, ) return IndexingResult( corpus_id=corpus_id, indexed_files=indexed_files, skipped_files=skipped_files, deleted_files=deleted_files, chunks_written=chunks_written, active_documents=active_documents, schema_used=selected_schema_name, embeddings_written=embeddings_written, ) def _extract_metadata_batch( self, *, parsed_docs: list[tuple[str, str, str]], root_path: str, schema_def: dict[str, Any] | None, with_langextract: bool, langextract_profile: dict[str, Any] | None, ) -> dict[str, dict[str, Any]]: """Extract metadata for all documents in parallel using a thread pool.""" def _extract_one(item: tuple[str, str, str]) -> tuple[str, dict[str, Any]]: file_path, relative_path, content = item metadata = extract_metadata( file_path=file_path, root_path=root_path, content=content, schema_def=schema_def, with_langextract=with_langextract, langextract_profile=langextract_profile, ) return relative_path, metadata result: dict[str, dict[str, Any]] = {} if not parsed_docs: return result with ThreadPoolExecutor(max_workers=self._max_workers) as executor: for relative_path, metadata in executor.map(_extract_one, parsed_docs): result[relative_path] = metadata return result def _resolve_schema( self, *, corpus_id: str, root: str, discover_schema: bool, schema_name: str | None, with_metadata: bool, metadata_profile: dict[str, Any] | None, ) -> tuple[dict[str, Any] | None, str | None]: if discover_schema: schema_def = SchemaDiscovery().discover_from_folder( root, with_langextract=with_metadata, metadata_profile=metadata_profile, ) discovered_name = str(schema_def.get("name", f"auto_{Path(root).name}")) self.storage.save_schema( corpus_id=corpus_id, name=discovered_name, schema_def=schema_def, is_active=True, ) return schema_def, discovered_name if schema_name: schema = self.storage.get_schema_by_name( corpus_id=corpus_id, name=schema_name ) if schema is None: raise ValueError(f"Schema '{schema_name}' not found for corpus {root}") if with_metadata: return self._augment_schema_for_langextract( corpus_id=corpus_id, schema_name=schema.name, schema_def=schema.schema_def, metadata_profile=metadata_profile, ) return schema.schema_def, schema.name active = self.storage.get_active_schema(corpus_id=corpus_id) if active is None: if with_metadata: schema_def = SchemaDiscovery().discover_from_folder( root, with_langextract=True, metadata_profile=metadata_profile, ) discovered_name = str(schema_def.get("name", f"auto_{Path(root).name}")) self.storage.save_schema( corpus_id=corpus_id, name=discovered_name, schema_def=schema_def, is_active=True, ) return schema_def, discovered_name return None, None if with_metadata: return self._augment_schema_for_langextract( corpus_id=corpus_id, schema_name=active.name, schema_def=active.schema_def, metadata_profile=metadata_profile, ) return active.schema_def, active.name def _augment_schema_for_langextract( self, *, corpus_id: str, schema_name: str, schema_def: dict[str, Any], metadata_profile: dict[str, Any] | None, ) -> tuple[dict[str, Any], str]: effective_profile = metadata_profile or self._schema_metadata_profile( schema_def ) existing_field_names = self._schema_field_names(schema_def) required = langextract_field_names(effective_profile) if required.issubset(existing_field_names): if metadata_profile is None and ( effective_profile is None or self._schema_metadata_profile(schema_def) is not None ): return schema_def, schema_name augmented_with_profile, changed = ensure_langextract_schema_fields( schema_def, effective_profile, ) if not changed: return schema_def, schema_name self.storage.save_schema( corpus_id=corpus_id, name=schema_name, schema_def=augmented_with_profile, is_active=True, ) return augmented_with_profile, schema_name augmented_schema, _ = ensure_langextract_schema_fields( schema_def, effective_profile, ) self.storage.save_schema( corpus_id=corpus_id, name=schema_name, schema_def=augmented_schema, is_active=True, ) return augmented_schema, schema_name @staticmethod def _schema_metadata_profile( schema_def: dict[str, Any] | None, ) -> dict[str, Any] | None: if not schema_def: return None profile = schema_def.get("metadata_profile") if isinstance(profile, dict): return profile return None @staticmethod def _schema_field_names(schema_def: dict[str, Any]) -> set[str]: fields = schema_def.get("fields") if not isinstance(fields, list): return set() names: set[str] = set() for field in fields: if isinstance(field, dict): name = field.get("name") if isinstance(name, str): names.add(name) return names def _generate_and_store_embeddings( self, *, corpus_id: str, all_chunk_records: list[ChunkRecord], ) -> int: """Embed chunk texts and store in the database. Returns count written.""" if self.embedding_provider is None or not all_chunk_records: return 0 texts = [cr.text for cr in all_chunk_records] embeddings = self.embedding_provider.embed_texts(texts) pairs: list[tuple[str, list[float]]] = [ (cr.id, emb) for cr, emb in zip(all_chunk_records, embeddings) ] written = self.storage.store_chunk_embeddings( corpus_id=corpus_id, chunk_embeddings=pairs, ) if isinstance(self.storage, DuckDBStorage): self.storage.create_hnsw_index(corpus_id=corpus_id) return written @staticmethod def _iter_supported_files(root: str) -> list[str]: files: list[str] = [] for current_root, _, filenames in os.walk(root): for filename in filenames: ext = Path(filename).suffix.lower() if ext in SUPPORTED_EXTENSIONS: files.append(str(Path(current_root) / filename)) files.sort() return files @staticmethod def _sha256(content: str) -> str: return hashlib.sha256(content.encode("utf-8")).hexdigest() @staticmethod def _is_parse_error(content: str) -> bool: return content.startswith(_PARSE_ERROR_PREFIXES) ================================================ FILE: src/fs_explorer/indexing/schema.py ================================================ """ Schema discovery utilities. """ from __future__ import annotations import os from pathlib import Path from typing import Any from .metadata import ( auto_discover_profile, infer_document_type, langextract_schema_fields, normalize_langextract_profile, ) from ..fs import SUPPORTED_EXTENSIONS def _iter_supported_files(folder: str) -> list[str]: root = Path(folder).resolve() files: list[str] = [] for current_root, _, filenames in os.walk(root): for filename in filenames: ext = Path(filename).suffix.lower() if ext in SUPPORTED_EXTENSIONS: files.append(str(Path(current_root) / filename)) files.sort() return files class SchemaDiscovery: """Auto-discover a lightweight metadata schema from a corpus.""" def discover_from_folder( self, folder: str, *, with_langextract: bool = False, metadata_profile: dict[str, Any] | None = None, ) -> dict[str, Any]: files = _iter_supported_files(folder) document_types = sorted({infer_document_type(path) for path in files}) corpus_name = Path(folder).resolve().name or "corpus" fields: list[dict[str, Any]] = [ { "name": "filename", "type": "string", "required": True, "description": "Document filename.", }, { "name": "relative_path", "type": "string", "required": True, "description": "Path relative to corpus root.", }, { "name": "extension", "type": "string", "required": True, "description": "File extension.", }, { "name": "document_type", "type": "string", "required": True, "description": "Inferred document category.", "enum": document_types or ["other"], }, { "name": "file_size_bytes", "type": "integer", "required": True, "description": "File size in bytes.", }, { "name": "file_mtime", "type": "number", "required": True, "description": "File modification timestamp (epoch seconds).", }, { "name": "mentions_currency", "type": "boolean", "required": True, "description": "Whether text appears to contain currency amounts.", }, { "name": "mentions_dates", "type": "boolean", "required": True, "description": "Whether text appears to contain date patterns.", }, ] schema: dict[str, Any] = { "name": f"auto_{corpus_name}", "description": "Auto-discovered schema for document-level metadata filtering.", "fields": fields, } if with_langextract: if metadata_profile is None: effective_profile = auto_discover_profile(folder) else: effective_profile = normalize_langextract_profile(metadata_profile) fields.extend(langextract_schema_fields(effective_profile)) schema["metadata_profile"] = effective_profile return schema ================================================ FILE: src/fs_explorer/main.py ================================================ """ CLI entry point for the FsExplorer agent. Provides a command-line interface for running filesystem exploration tasks with rich, detailed output showing each step of the workflow. """ import json import asyncio import os from datetime import datetime from pathlib import Path from typer import Typer, Option, Argument, Context, BadParameter, Exit from typing import Annotated, Any from rich.markdown import Markdown from rich.panel import Panel from rich.console import Console from rich.table import Table from rich.text import Text from .embeddings import EmbeddingProvider from .index_config import resolve_db_path from .indexing import IndexingPipeline, SchemaDiscovery from .storage import DuckDBStorage from .agent import set_index_context, clear_index_context from .workflow import ( workflow, InputEvent, ToolCallEvent, GoDeeperEvent, AskHumanEvent, HumanAnswerEvent, get_agent, reset_agent, ) from .exploration_trace import ExplorationTrace, extract_cited_sources app = Typer() schema_app = Typer(help="Manage metadata schemas for indexed corpora.") app.add_typer(schema_app, name="schema") # Tool icons for visual distinction TOOL_ICONS = { "scan_folder": "📂", "preview_file": "👁️", "parse_file": "📖", "read": "📄", "grep": "🔍", "glob": "🔎", "semantic_search": "🧠", "get_document": "📚", "list_indexed_documents": "🗂️", } # Phase detection based on tool usage PHASE_DESCRIPTIONS = { "scan_folder": ("Phase 1", "Parallel Document Scan", "cyan"), "preview_file": ("Phase 1/2", "Quick Preview", "cyan"), "parse_file": ("Phase 2", "Deep Dive", "green"), "read": ("Reading", "Text File", "blue"), "grep": ("Searching", "Pattern Match", "yellow"), "glob": ("Finding", "File Search", "yellow"), "semantic_search": ("Indexed", "Semantic Retrieval", "magenta"), "get_document": ("Indexed", "Document Fetch", "green"), "list_indexed_documents": ("Indexed", "Corpus Listing", "blue"), } def _load_metadata_profile(path_value: str | None) -> dict[str, Any] | None: if path_value is None: return None resolved = Path(path_value).expanduser().resolve() if not resolved.exists() or not resolved.is_file(): raise BadParameter(f"Metadata profile file not found: {resolved}") try: payload = json.loads(resolved.read_text()) except json.JSONDecodeError as exc: raise BadParameter( f"Metadata profile file is not valid JSON: {resolved}" ) from exc if not isinstance(payload, dict): raise BadParameter("Metadata profile JSON must be an object.") return payload def format_tool_panel(event: ToolCallEvent, step_number: int) -> Panel: """Create a richly formatted panel for a tool call event.""" tool_name = event.tool_name icon = TOOL_ICONS.get(tool_name, "🔧") phase_info = PHASE_DESCRIPTIONS.get(tool_name, ("Action", "Tool Call", "yellow")) phase_label, phase_desc, color = phase_info # Build the content lines = [] # Tool and target info if "directory" in event.tool_input: target = event.tool_input["directory"] lines.append(f"**Target Directory:** `{target}`") elif "file_path" in event.tool_input: target = event.tool_input["file_path"] lines.append(f"**Target File:** `{target}`") # Additional parameters other_params = { k: v for k, v in event.tool_input.items() if k not in ("directory", "file_path") } if other_params: lines.append(f"**Parameters:** `{json.dumps(other_params)}`") lines.append("") lines.append("---") lines.append("") # Reasoning (this is the key part for visibility) lines.append("**Agent's Reasoning:**") lines.append("") lines.append(event.reason) content = "\n".join(lines) # Create title with step number and phase title = f"{icon} Step {step_number}: {tool_name} [{phase_label}: {phase_desc}]" return Panel( Markdown(content), title=title, title_align="left", border_style=f"bold {color}", padding=(1, 2), ) def format_navigation_panel(event: GoDeeperEvent, step_number: int) -> Panel: """Create a panel for directory navigation events.""" content = f"""**Navigating to:** `{event.directory}` --- **Agent's Reasoning:** {event.reason} """ return Panel( Markdown(content), title=f"📁 Step {step_number}: Navigate to Directory", title_align="left", border_style="bold magenta", padding=(1, 2), ) def print_workflow_header(console: Console, task: str, folder: str) -> None: """Print a header showing the task being executed.""" console.print() header = Table.grid(padding=(0, 2)) header.add_column(style="bold cyan", justify="right") header.add_column() header.add_row("🤖 FsExplorer Agent", "") header.add_row("📋 Task:", task) header.add_row("📁 Folder:", folder) header.add_row("🕐 Started:", datetime.now().strftime("%Y-%m-%d %H:%M:%S")) console.print( Panel( header, border_style="bold blue", title="Starting Exploration", title_align="left", ) ) console.print() def print_workflow_summary( console: Console, agent, step_count: int, trace: ExplorationTrace, cited_sources: list[str], ) -> None: """Print a summary of the workflow execution.""" usage = agent.token_usage # Create summary table summary = Table.grid(padding=(0, 2)) summary.add_column(style="bold", justify="right") summary.add_column() summary.add_row("Total Steps:", str(step_count)) summary.add_row("API Calls:", str(usage.api_calls)) summary.add_row("Documents Scanned:", str(usage.documents_scanned)) summary.add_row("Documents Parsed:", str(usage.documents_parsed)) summary.add_row("", "") summary.add_row("Prompt Tokens:", f"{usage.prompt_tokens:,}") summary.add_row("Completion Tokens:", f"{usage.completion_tokens:,}") summary.add_row("Total Tokens:", f"{usage.total_tokens:,}") summary.add_row("", "") # Cost calculation input_cost, output_cost, total_cost = usage._calculate_cost() summary.add_row("Est. Input Cost:", f"${input_cost:.4f}") summary.add_row("Est. Output Cost:", f"${output_cost:.4f}") summary.add_row("Est. Total Cost:", f"${total_cost:.4f}") console.print() console.print( Panel( summary, title="📊 Workflow Summary", title_align="left", border_style="bold blue", ) ) if trace.step_path: path_markdown = "\n".join(f"- `{entry}`" for entry in trace.step_path) console.print() console.print( Panel( Markdown(path_markdown), title="🧭 Exploration Path", title_align="left", border_style="bold cyan", ) ) referenced_documents = trace.sorted_documents() if referenced_documents: docs_markdown = "\n".join(f"- `{doc}`" for doc in referenced_documents) console.print() console.print( Panel( Markdown(docs_markdown), title="📚 Referenced Documents (Tool Calls)", title_align="left", border_style="bold green", ) ) if cited_sources: sources_markdown = "\n".join(f"- `{source}`" for source in cited_sources) console.print() console.print( Panel( Markdown(sources_markdown), title="🔖 Cited Sources (Final Answer)", title_align="left", border_style="bold yellow", ) ) async def run_workflow( task: str, folder: str = ".", *, use_index: bool = False, db_path: str | None = None, ) -> None: """ Execute the exploration workflow with detailed step-by-step output. Args: task: The user's task/question to answer. """ console = Console() resolved_folder = os.path.abspath(folder) if not os.path.exists(resolved_folder) or not os.path.isdir(resolved_folder): console.print( Panel( Text(f"No such directory: {resolved_folder}", style="bold red"), title="❌ Error", title_align="left", border_style="bold red", ) ) return resolved_db_path: str | None = None index_storage: DuckDBStorage | None = None if use_index: resolved_db_path = resolve_db_path(db_path) storage = DuckDBStorage(resolved_db_path) corpus_id = storage.get_corpus_id(resolved_folder) if corpus_id is None: console.print( Panel( Text( "No index found for this folder. " "Run `explore index ` first.", style="bold red", ), title="❌ Missing Index", title_align="left", border_style="bold red", ) ) return index_storage = storage set_index_context(resolved_folder, resolved_db_path) else: clear_index_context() try: # Reset agent for fresh state reset_agent() # Print header print_workflow_header(console, task, resolved_folder) trace = ExplorationTrace(root_directory=resolved_folder) step_number = 0 handler = workflow.run( start_event=InputEvent( task=task, folder=resolved_folder, use_index=use_index, ) ) with console.status(status="[bold cyan]🔄 Analyzing task...") as status: async for event in handler.stream_events(): if isinstance(event, ToolCallEvent): step_number += 1 resolved_document_path: str | None = None if event.tool_name == "get_document": doc_id = event.tool_input.get("doc_id") if ( index_storage is not None and isinstance(doc_id, str) and doc_id ): document = index_storage.get_document(doc_id=doc_id) if document and not document["is_deleted"]: resolved_document_path = str(document["absolute_path"]) trace.record_tool_call( step_number=step_number, tool_name=event.tool_name, tool_input=event.tool_input, resolved_document_path=resolved_document_path, ) # Update status based on tool icon = TOOL_ICONS.get(event.tool_name, "🔧") if event.tool_name == "scan_folder": status.update( f"[bold cyan]{icon} Scanning documents in parallel..." ) elif event.tool_name == "parse_file": status.update( f"[bold green]{icon} Reading document in detail..." ) elif event.tool_name == "preview_file": status.update(f"[bold cyan]{icon} Quick preview of document...") elif event.tool_name == "semantic_search": status.update(f"[bold magenta]{icon} Searching index...") elif event.tool_name == "get_document": status.update(f"[bold green]{icon} Reading indexed document...") elif event.tool_name == "list_indexed_documents": status.update(f"[bold blue]{icon} Listing indexed documents...") else: status.update( f"[bold yellow]{icon} Executing {event.tool_name}..." ) # Print the detailed panel panel = format_tool_panel(event, step_number) console.print(panel) console.print() status.update("[bold cyan]🔄 Processing results...") elif isinstance(event, GoDeeperEvent): step_number += 1 trace.record_go_deeper( step_number=step_number, directory=event.directory ) panel = format_navigation_panel(event, step_number) console.print(panel) console.print() status.update("[bold cyan]🔄 Exploring directory...") elif isinstance(event, AskHumanEvent): status.stop() console.print() # Create a nice prompt panel question_panel = Panel( Markdown( f"**Question:** {event.question}\n\n**Why I'm asking:** {event.reason}" ), title="❓ Human Input Required", title_align="left", border_style="bold red", ) console.print(question_panel) answer = console.input("[bold cyan]Your answer:[/] ") while answer.strip() == "": console.print("[bold red]Please provide an answer.[/]") answer = console.input("[bold cyan]Your answer:[/] ") handler.ctx.send_event(HumanAnswerEvent(response=answer.strip())) console.print() status.start() status.update("[bold cyan]🔄 Processing your response...") # Get final result result = await handler status.update("[bold green]✨ Preparing final answer...") await asyncio.sleep(0.1) status.stop() # Print final result with prominent styling console.print() if result.final_result: final_panel = Panel( Markdown(result.final_result), title="✅ Final Answer", title_align="left", border_style="bold green", padding=(1, 2), ) console.print(final_panel) elif result.error: error_panel = Panel( Text(result.error, style="bold red"), title="❌ Error", title_align="left", border_style="bold red", ) console.print(error_panel) # Print workflow summary agent = get_agent() cited_sources = extract_cited_sources(result.final_result) print_workflow_summary(console, agent, step_number, trace, cited_sources) finally: clear_index_context() @app.callback(invoke_without_command=True) def main( ctx: Context, task: Annotated[ str | None, Option( "--task", "-t", help="Task that the FsExplorer Agent has to perform while exploring the current directory.", ), ] = None, folder: Annotated[ str, Option( "--folder", "-f", help="Folder to explore. Defaults to the current directory.", ), ] = ".", use_index: Annotated[ bool, Option( "--use-index", help="Use indexed retrieval tools for this run (requires prior indexing).", ), ] = False, db_path: Annotated[ str | None, Option("--db-path", help="Path to DuckDB index file."), ] = None, ) -> None: """ Explore documents with an agent, build indexes, and manage schema metadata. Backward-compatible mode: - `explore --task "..." [--folder ...]` """ if ctx.invoked_subcommand is not None: return if task is None or not task.strip(): raise BadParameter("`--task` is required unless you run a subcommand.") effective_use_index = use_index if ( not effective_use_index and os.getenv("FS_EXPLORER_AUTO_INDEX", "").strip() == "1" ): try: resolved_folder = os.path.abspath(folder) resolved_db = resolve_db_path(db_path) storage = DuckDBStorage(resolved_db, read_only=True, initialize=False) if storage.get_corpus_id(resolved_folder) is not None: effective_use_index = True storage.close() except Exception: pass asyncio.run( run_workflow(task, folder, use_index=effective_use_index, db_path=db_path) ) @app.command("index") def index_command( folder: Annotated[ str, Argument(help="Folder to index recursively."), ] = ".", db_path: Annotated[ str | None, Option("--db-path", help="Path to DuckDB index file."), ] = None, discover_schema: Annotated[ bool, Option( "--discover-schema", help="Auto-discover metadata schema and set it active for this corpus.", ), ] = False, schema_name: Annotated[ str | None, Option("--schema-name", help="Use an existing stored schema by name."), ] = None, with_metadata: Annotated[ bool, Option( "--with-metadata", help=( "Enable langextract metadata extraction (requires API key). " "Also enables schema discovery if not explicitly requested." ), ), ] = False, metadata_profile_path: Annotated[ str | None, Option( "--metadata-profile", help=( "Path to JSON profile defining dynamic langextract metadata fields " "and prompt. Implies --with-metadata." ), ), ] = None, with_embeddings: Annotated[ bool, Option( "--with-embeddings", help="Generate vector embeddings for indexed chunks (requires GOOGLE_API_KEY).", ), ] = False, ) -> None: """Build or refresh an index for a folder.""" console = Console() resolved_db_path = resolve_db_path(db_path) storage = DuckDBStorage(resolved_db_path) embedding_provider: EmbeddingProvider | None = None if with_embeddings: try: embedding_provider = EmbeddingProvider() except ValueError as exc: raise BadParameter(str(exc)) from exc pipeline = IndexingPipeline( storage=storage, embedding_provider=embedding_provider, ) metadata_profile = _load_metadata_profile(metadata_profile_path) effective_with_metadata = with_metadata or metadata_profile is not None if effective_with_metadata and metadata_profile is None: console.print( "[bold cyan]🔍 Analyzing corpus to generate metadata profile...[/]" ) try: effective_discover_schema = discover_schema or effective_with_metadata result = pipeline.index_folder( folder, discover_schema=effective_discover_schema, schema_name=schema_name, with_metadata=effective_with_metadata, metadata_profile=metadata_profile, ) except ValueError as exc: raise BadParameter(str(exc)) from exc summary = Table.grid(padding=(0, 2)) summary.add_column(style="bold", justify="right") summary.add_column() summary.add_row("DB Path:", resolved_db_path) summary.add_row("Corpus ID:", result.corpus_id) summary.add_row("Indexed Files:", str(result.indexed_files)) summary.add_row("Skipped Files:", str(result.skipped_files)) summary.add_row("Deleted Files:", str(result.deleted_files)) summary.add_row("Chunks Written:", str(result.chunks_written)) summary.add_row("Active Documents:", str(result.active_documents)) summary.add_row("Embeddings Written:", str(result.embeddings_written)) summary.add_row("Schema Used:", result.schema_used or "") summary.add_row( "Metadata Mode:", "langextract" if effective_with_metadata else "heuristic", ) if metadata_profile_path: profile_label = str(Path(metadata_profile_path).expanduser().resolve()) elif effective_with_metadata: profile_label = "" else: profile_label = "" summary.add_row("Metadata Profile:", profile_label) console.print(Panel(summary, title="📦 Index Complete", border_style="bold green")) @app.command("query") def query_command( task: Annotated[ str, Option( "--task", "-t", help="Question to answer using indexed retrieval tools.", ), ], folder: Annotated[ str, Option( "--folder", "-f", help="Folder whose index should be queried.", ), ] = ".", db_path: Annotated[ str | None, Option("--db-path", help="Path to DuckDB index file."), ] = None, ) -> None: """Run the agent with indexed retrieval enabled.""" asyncio.run(run_workflow(task, folder, use_index=True, db_path=db_path)) @schema_app.command("discover") def schema_discover_command( folder: Annotated[ str, Argument(help="Folder to inspect for schema discovery."), ] = ".", db_path: Annotated[ str | None, Option("--db-path", help="Path to DuckDB index file."), ] = None, name: Annotated[ str | None, Option("--name", help="Override discovered schema name."), ] = None, activate: Annotated[ bool, Option( "--activate/--no-activate", help="Set schema as active for the corpus.", ), ] = True, with_metadata: Annotated[ bool, Option( "--with-metadata", help="Include langextract metadata fields in discovered schema.", ), ] = False, metadata_profile_path: Annotated[ str | None, Option( "--metadata-profile", help=( "Path to JSON profile defining dynamic langextract metadata fields " "and prompt. Implies --with-metadata." ), ), ] = None, ) -> None: """Auto-discover and store a metadata schema for a folder.""" console = Console() resolved_folder = str(os.path.abspath(folder)) if not os.path.isdir(resolved_folder): raise BadParameter(f"No such directory: {resolved_folder}") resolved_db_path = resolve_db_path(db_path) storage = DuckDBStorage(resolved_db_path) corpus_id = storage.get_or_create_corpus(resolved_folder) metadata_profile = _load_metadata_profile(metadata_profile_path) effective_with_metadata = with_metadata or metadata_profile is not None if effective_with_metadata and metadata_profile is None: console.print( "[bold cyan]🔍 Analyzing corpus to generate metadata profile...[/]" ) discovery = SchemaDiscovery() discovered = discovery.discover_from_folder( resolved_folder, with_langextract=effective_with_metadata, metadata_profile=metadata_profile, ) schema_name = name or str( discovered.get("name", f"auto_{os.path.basename(resolved_folder)}") ) discovered["name"] = schema_name schema_id = storage.save_schema( corpus_id=corpus_id, name=schema_name, schema_def=discovered, is_active=activate, ) output = Table.grid(padding=(0, 2)) output.add_column(style="bold", justify="right") output.add_column() output.add_row("DB Path:", resolved_db_path) output.add_row("Corpus ID:", corpus_id) output.add_row("Schema ID:", schema_id) output.add_row("Schema Name:", schema_name) output.add_row("Active:", str(activate)) output.add_row("Field Count:", str(len(discovered.get("fields", [])))) output.add_row( "Metadata Mode:", "langextract" if effective_with_metadata else "heuristic" ) if metadata_profile_path: profile_label = str(Path(metadata_profile_path).expanduser().resolve()) elif effective_with_metadata: profile_label = "" else: profile_label = "" output.add_row("Metadata Profile:", profile_label) console.print(Panel(output, title="🧩 Schema Saved", border_style="bold cyan")) console.print_json(json.dumps(discovered, indent=2)) @schema_app.command("show") def schema_show_command( folder: Annotated[ str, Argument(help="Folder whose schemas should be listed."), ] = ".", db_path: Annotated[ str | None, Option("--db-path", help="Path to DuckDB index file."), ] = None, ) -> None: """Show saved schemas for a folder's corpus.""" console = Console() resolved_folder = str(os.path.abspath(folder)) resolved_db_path = resolve_db_path(db_path) storage = DuckDBStorage(resolved_db_path) corpus_id = storage.get_corpus_id(resolved_folder) if corpus_id is None: console.print( Panel( f"No corpus found for folder: {resolved_folder}\nRun `explore index {resolved_folder}` first.", title="⚠️ No Corpus", border_style="bold yellow", ) ) raise Exit(code=1) schemas = storage.list_schemas(corpus_id=corpus_id) if not schemas: console.print( Panel( f"No schemas saved for corpus: {corpus_id}", title="⚠️ No Schemas", border_style="bold yellow", ) ) raise Exit(code=1) table = Table(title=f"Schemas for {resolved_folder}") table.add_column("Name") table.add_column("Active") table.add_column("Created At") table.add_column("Field Count") for schema in schemas: table.add_row( schema.name, "yes" if schema.is_active else "no", schema.created_at, str(len(schema.schema_def.get("fields", []))), ) console.print(table) ================================================ FILE: src/fs_explorer/models.py ================================================ """ Pydantic models for FsExplorer agent actions. This module defines the structured data models used to represent the actions the agent can take during filesystem exploration. """ from pydantic import BaseModel, Field from typing import TypeAlias, Literal, Any # ============================================================================= # Type Aliases # ============================================================================= Tools: TypeAlias = Literal[ "read", "grep", "glob", "scan_folder", "preview_file", "parse_file", "semantic_search", "get_document", "list_indexed_documents", ] """Available tool names that the agent can invoke.""" ActionType: TypeAlias = Literal["stop", "godeeper", "toolcall", "askhuman"] """Types of actions the agent can take.""" # ============================================================================= # Action Models # ============================================================================= class StopAction(BaseModel): """ Action indicating the task is complete. Used when the agent has gathered enough information to provide a final answer to the user's query. """ final_result: str = Field( description="Final result of the operation with the answer to the user's query" ) class AskHumanAction(BaseModel): """ Action requesting clarification from the user. Used when the agent needs additional information or context to proceed with the task. """ question: str = Field( description="Clarification question to ask the user" ) class GoDeeperAction(BaseModel): """ Action to navigate into a subdirectory. Used when the agent needs to explore a subdirectory to find relevant files. """ directory: str = Field( description="Path to the directory to navigate into" ) class ToolCallArg(BaseModel): """ A single argument for a tool call. Represents a parameter name-value pair to pass to a tool. """ parameter_name: str = Field( description="Name of the parameter" ) parameter_value: Any = Field( description="Value for the parameter" ) class ToolCallAction(BaseModel): """ Action to invoke a filesystem tool. Used when the agent needs to read files, search for patterns, or parse documents to gather information. """ tool_name: Tools = Field( description="Name of the tool to invoke" ) tool_input: list[ToolCallArg] = Field( description="Arguments to pass to the tool" ) def to_fn_args(self) -> dict[str, Any]: """ Convert tool input to a dictionary for function calls. Returns: Dictionary mapping parameter names to values. """ return {arg.parameter_name: arg.parameter_value for arg in self.tool_input} class Action(BaseModel): """ Container for an agent action with reasoning. Wraps any of the specific action types (stop, go deeper, tool call, ask human) along with the agent's explanation for why this action was chosen. """ action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction = Field( description="The specific action to take" ) reason: str = Field( description="Explanation for why this action was chosen" ) def to_action_type(self) -> ActionType: """ Get the type of this action. Returns: The action type string: "toolcall", "godeeper", "askhuman", or "stop". """ if isinstance(self.action, ToolCallAction): return "toolcall" elif isinstance(self.action, GoDeeperAction): return "godeeper" elif isinstance(self.action, AskHumanAction): return "askhuman" else: return "stop" ================================================ FILE: src/fs_explorer/search/__init__.py ================================================ """Search helpers for indexed corpora.""" from .filters import ( MetadataFilter, MetadataFilterParseError, parse_metadata_filters, supported_filter_syntax, ) from .query import IndexedQueryEngine, SearchHit from .ranker import RankedDocument, rank_documents from .semantic import SemanticSearchEngine __all__ = [ "MetadataFilter", "MetadataFilterParseError", "parse_metadata_filters", "supported_filter_syntax", "IndexedQueryEngine", "SearchHit", "RankedDocument", "rank_documents", "SemanticSearchEngine", ] ================================================ FILE: src/fs_explorer/search/filters.py ================================================ """ Metadata filter parsing helpers. """ from __future__ import annotations import re from dataclasses import dataclass from typing import Any, Literal FilterOperator = Literal["eq", "ne", "gt", "gte", "lt", "lte", "in", "contains"] @dataclass(frozen=True) class MetadataFilter: """Normalized metadata filter condition.""" field: str operator: FilterOperator value: str | bool | int | float | list[str | bool | int | float] def to_storage_dict(self) -> dict[str, Any]: return { "field": self.field, "operator": self.operator, "value": self.value, } class MetadataFilterParseError(ValueError): """Raised when metadata filter syntax is invalid.""" _FIELD_RE = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$") _NUMBER_RE = re.compile(r"^-?\d+(?:\.\d+)?$") def supported_filter_syntax() -> str: """Return a short help text for filter syntax.""" return ( "Supported filter syntax: " "`field=value`, `field!=value`, `field>=number`, `field<=number`, " "`field>number`, `field list[MetadataFilter]: """Parse a raw filter string into normalized metadata conditions.""" if raw_filters is None or not raw_filters.strip(): return [] conditions = _split_conditions(raw_filters) parsed: list[MetadataFilter] = [] for condition in conditions: parsed.append(_parse_condition(condition, allowed_fields=allowed_fields)) return parsed def _parse_condition(condition: str, *, allowed_fields: set[str] | None) -> MetadataFilter: text = condition.strip() if not text: raise MetadataFilterParseError("Empty filter condition.") in_match = re.match(r"^\s*([A-Za-z_][A-Za-z0-9_]*)\s+in\s+(.+)\s*$", text, flags=re.IGNORECASE) if in_match: field = in_match.group(1) _validate_field(field, allowed_fields=allowed_fields) values = _parse_list_value(in_match.group(2)) if not values: raise MetadataFilterParseError(f"`in` filter has no values: {text!r}") return MetadataFilter(field=field, operator="in", value=values) op_match = re.match(r"^\s*([A-Za-z_][A-Za-z0-9_]*)\s*(<=|>=|!=|=|<|>|~|:)\s*(.+)\s*$", text) if not op_match: raise MetadataFilterParseError(f"Invalid filter syntax: {text!r}") field = op_match.group(1) operator_symbol = op_match.group(2) raw_value = op_match.group(3) _validate_field(field, allowed_fields=allowed_fields) value = _parse_scalar_value(raw_value) operator_map: dict[str, FilterOperator] = { "=": "eq", ":": "eq", "!=": "ne", ">": "gt", ">=": "gte", "<": "lt", "<=": "lte", "~": "contains", } operator = operator_map[operator_symbol] if operator in {"gt", "gte", "lt", "lte"} and not isinstance(value, (int, float)): raise MetadataFilterParseError( f"Operator `{operator_symbol}` requires a numeric value: {text!r}" ) return MetadataFilter(field=field, operator=operator, value=value) def _validate_field(field: str, *, allowed_fields: set[str] | None) -> None: if not _FIELD_RE.match(field): raise MetadataFilterParseError(f"Invalid field name: {field!r}") if allowed_fields is not None and field not in allowed_fields: allowed = ", ".join(sorted(allowed_fields)) if allowed_fields else "" raise MetadataFilterParseError( f"Unknown metadata field {field!r}. Allowed fields: {allowed}" ) def _split_conditions(raw: str) -> list[str]: parts: list[str] = [] current: list[str] = [] quote: str | None = None paren_depth = 0 bracket_depth = 0 i = 0 while i < len(raw): ch = raw[i] if quote is not None: current.append(ch) if ch == quote: quote = None i += 1 continue if ch in {"'", '"'}: quote = ch current.append(ch) i += 1 continue if ch == "(": paren_depth += 1 current.append(ch) i += 1 continue if ch == ")": paren_depth = max(paren_depth - 1, 0) current.append(ch) i += 1 continue if ch == "[": bracket_depth += 1 current.append(ch) i += 1 continue if ch == "]": bracket_depth = max(bracket_depth - 1, 0) current.append(ch) i += 1 continue if paren_depth == 0 and bracket_depth == 0 and ch == ",": _flush_part(parts, current) i += 1 continue if ( paren_depth == 0 and bracket_depth == 0 and raw[i : i + 3].lower() == "and" and (i == 0 or raw[i - 1].isspace()) and (i + 3 == len(raw) or raw[i + 3].isspace()) ): _flush_part(parts, current) i += 3 continue current.append(ch) i += 1 _flush_part(parts, current) return parts def _flush_part(parts: list[str], current: list[str]) -> None: text = "".join(current).strip() if text: parts.append(text) current.clear() def _parse_list_value(raw_value: str) -> list[str | bool | int | float]: text = raw_value.strip() if text.startswith("(") and text.endswith(")"): text = text[1:-1] elif text.startswith("[") and text.endswith("]"): text = text[1:-1] if not text.strip(): return [] items = _split_conditions(text) return [_parse_scalar_value(item) for item in items] def _parse_scalar_value(raw_value: str) -> str | bool | int | float: text = raw_value.strip() if not text: raise MetadataFilterParseError("Missing filter value.") if (text.startswith("'") and text.endswith("'")) or ( text.startswith('"') and text.endswith('"') ): return text[1:-1] lower = text.lower() if lower == "true": return True if lower == "false": return False if _NUMBER_RE.match(text): if "." in text: return float(text) return int(text) return text ================================================ FILE: src/fs_explorer/search/query.py ================================================ """ Indexed query helpers for agent tools. """ from __future__ import annotations from concurrent.futures import ThreadPoolExecutor from dataclasses import dataclass from typing import Any, Callable from ..embeddings import EmbeddingProvider from ..storage import DuckDBStorage, StorageBackend from .filters import MetadataFilter, parse_metadata_filters from .ranker import RankedDocument, rank_documents @dataclass(frozen=True) class SearchHit: """Ranked document hit from indexed retrieval.""" doc_id: str relative_path: str absolute_path: str position: int | None text: str semantic_score: float metadata_score: int score: float matched_by: str class IndexedQueryEngine: """Parallel retrieval engine for semantic + metadata query paths.""" def __init__( self, storage: StorageBackend, embedding_provider: EmbeddingProvider | None = None, ) -> None: self.storage = storage self.embedding_provider = embedding_provider def search( self, *, corpus_id: str, query: str, filters: str | None = None, limit: int = 5, enable_semantic: bool = True, enable_metadata: bool = True, ) -> list[SearchHit]: normalized_limit = max(limit, 1) parsed_filters = self._parse_filters(corpus_id=corpus_id, filters=filters) semantic_limit = max(normalized_limit * 4, normalized_limit) metadata_limit = max(normalized_limit * 4, normalized_limit) run_semantic = enable_semantic run_metadata = enable_metadata and bool(parsed_filters) semantic_rows: list[dict[str, Any]] metadata_rows: list[dict[str, Any]] if run_semantic and run_metadata: semantic_rows, metadata_rows = self._search_parallel( corpus_id=corpus_id, query=query, metadata_filters=parsed_filters, semantic_limit=semantic_limit, metadata_limit=metadata_limit, ) elif run_semantic: semantic_rows = self._semantic_query( corpus_id=corpus_id, query=query, limit=semantic_limit, ) metadata_rows = [] elif run_metadata: semantic_rows = [] metadata_rows = self._metadata_query( corpus_id=corpus_id, metadata_filters=parsed_filters, limit=metadata_limit, ) else: semantic_rows, metadata_rows = [], [] ranked = self._merge_and_rank( semantic_rows=semantic_rows, metadata_rows=metadata_rows, limit=normalized_limit, ) return [ SearchHit( doc_id=doc.doc_id, relative_path=doc.relative_path, absolute_path=doc.absolute_path, position=doc.position, text=doc.text, semantic_score=doc.semantic_score, metadata_score=doc.metadata_score, score=doc.combined_score, matched_by=doc.matched_by, ) for doc in ranked ] def _parse_filters( self, *, corpus_id: str, filters: str | None ) -> list[MetadataFilter]: if filters is None or not filters.strip(): return [] allowed_fields = self._allowed_filter_fields(corpus_id=corpus_id) return parse_metadata_filters(filters, allowed_fields=allowed_fields) def _allowed_filter_fields(self, *, corpus_id: str) -> set[str] | None: active_schema = self.storage.get_active_schema(corpus_id=corpus_id) if active_schema is None: return None fields = active_schema.schema_def.get("fields") if not isinstance(fields, list): return None allowed: set[str] = set() for field in fields: if isinstance(field, dict): name = field.get("name") if isinstance(name, str): allowed.add(name) return allowed if allowed else None def _search_parallel( self, *, corpus_id: str, query: str, metadata_filters: list[MetadataFilter], semantic_limit: int, metadata_limit: int, ) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]: with ThreadPoolExecutor(max_workers=2) as executor: semantic_future = executor.submit( self._semantic_query, corpus_id=corpus_id, query=query, limit=semantic_limit, ) metadata_future = executor.submit( self._metadata_query, corpus_id=corpus_id, metadata_filters=metadata_filters, limit=metadata_limit, ) semantic_rows = semantic_future.result() metadata_rows = metadata_future.result() return semantic_rows, metadata_rows def _semantic_query( self, *, corpus_id: str, query: str, limit: int, ) -> list[dict[str, Any]]: scoped_storage, cleanup = self._acquire_query_storage() try: if self.embedding_provider is not None and scoped_storage.has_embeddings( corpus_id=corpus_id ): query_embedding = self.embedding_provider.embed_query(query) return scoped_storage.search_chunks_semantic( corpus_id=corpus_id, query_embedding=query_embedding, limit=limit, ) return scoped_storage.search_chunks( corpus_id=corpus_id, query=query, limit=limit ) finally: cleanup() def _metadata_query( self, *, corpus_id: str, metadata_filters: list[MetadataFilter], limit: int, ) -> list[dict[str, Any]]: scoped_storage, cleanup = self._acquire_query_storage() try: return scoped_storage.search_documents_by_metadata( corpus_id=corpus_id, filters=[flt.to_storage_dict() for flt in metadata_filters], limit=limit, ) finally: cleanup() def _acquire_query_storage(self) -> tuple[StorageBackend, Callable[[], None]]: if isinstance(self.storage, DuckDBStorage): clone = DuckDBStorage( self.storage.db_path, read_only=self.storage.read_only, initialize=False, embedding_dim=self.storage.embedding_dim, ) return clone, clone.close return self.storage, lambda: None @staticmethod def _merge_and_rank( *, semantic_rows: list[dict[str, Any]], metadata_rows: list[dict[str, Any]], limit: int, ) -> list[RankedDocument]: merged: dict[str, dict[str, Any]] = {} for row in semantic_rows: doc_id = str(row["doc_id"]) score = float(row["score"]) position = int(row["position"]) entry = merged.setdefault( doc_id, { "doc_id": doc_id, "relative_path": str(row["relative_path"]), "absolute_path": str(row["absolute_path"]), "position": position, "text": str(row["text"]), "semantic_score": 0.0, "metadata_score": 0, }, ) if score > float(entry["semantic_score"]): entry["semantic_score"] = score entry["position"] = position entry["text"] = str(row["text"]) for row in metadata_rows: doc_id = str(row["doc_id"]) entry = merged.setdefault( doc_id, { "doc_id": doc_id, "relative_path": str(row["relative_path"]), "absolute_path": str(row["absolute_path"]), "position": None, "text": str(row.get("preview_text", "")), "semantic_score": 0.0, "metadata_score": 0, }, ) entry["metadata_score"] = max( int(entry["metadata_score"]), int(row.get("metadata_score", 1)), ) if not entry["text"]: entry["text"] = str(row.get("preview_text", "")) documents = [ RankedDocument( doc_id=str(entry["doc_id"]), relative_path=str(entry["relative_path"]), absolute_path=str(entry["absolute_path"]), position=int(entry["position"]) if entry["position"] is not None else None, text=str(entry["text"]), semantic_score=float(entry["semantic_score"]), metadata_score=int(entry["metadata_score"]), ) for entry in merged.values() ] return rank_documents(documents, limit=limit) ================================================ FILE: src/fs_explorer/search/ranker.py ================================================ """ Ranking helpers for merging retrieval result sets. """ from __future__ import annotations from dataclasses import dataclass @dataclass(frozen=True) class RankedDocument: """Merged retrieval candidate for a document.""" doc_id: str relative_path: str absolute_path: str position: int | None text: str semantic_score: float metadata_score: int @property def combined_score(self) -> float: # Semantic scores dominate ordering; metadata score boosts ties and # metadata-only matches into the candidate set. return float(self.semantic_score * 100 + self.metadata_score * 10) @property def matched_by(self) -> str: if self.semantic_score > 0 and self.metadata_score > 0: return "semantic+metadata" if self.semantic_score > 0: return "semantic" return "metadata" def rank_documents( documents: list[RankedDocument], *, limit: int ) -> list[RankedDocument]: """Sort merged retrieval results and apply limit.""" ordered = sorted( documents, key=lambda doc: ( -doc.combined_score, -doc.semantic_score, -doc.metadata_score, doc.position if doc.position is not None else 10**9, doc.relative_path, ), ) return ordered[: max(limit, 1)] ================================================ FILE: src/fs_explorer/search/semantic.py ================================================ """ Vector-based semantic search engine. Embeds a query and searches chunk embeddings via cosine similarity, falling back to keyword matching when embeddings are unavailable. """ from __future__ import annotations from typing import Any from ..embeddings import EmbeddingProvider from ..storage import StorageBackend class SemanticSearchEngine: """Embed a query and search stored chunk embeddings.""" def __init__( self, storage: StorageBackend, embedding_provider: EmbeddingProvider, ) -> None: self.storage = storage self.embedding_provider = embedding_provider def search( self, *, corpus_id: str, query: str, limit: int = 5, ) -> list[dict[str, Any]]: """Return ranked chunk hits using vector cosine similarity.""" query_embedding = self.embedding_provider.embed_query(query) return self.storage.search_chunks_semantic( corpus_id=corpus_id, query_embedding=query_embedding, limit=limit, ) ================================================ FILE: src/fs_explorer/server.py ================================================ """ FastAPI server for FsExplorer web UI. Provides a WebSocket endpoint for real-time workflow streaming and serves the single-page HTML interface. """ import asyncio from pathlib import Path from typing import Any from fastapi import FastAPI, WebSocket, WebSocketDisconnect from fastapi.responses import HTMLResponse, JSONResponse from pydantic import BaseModel from .agent import clear_index_context, set_index_context, set_search_flags from .embeddings import EmbeddingProvider from .exploration_trace import ExplorationTrace, extract_cited_sources from .index_config import resolve_db_path from .indexing import IndexingPipeline from .indexing.metadata import auto_discover_profile from .search import IndexedQueryEngine from .storage import DuckDBStorage from .workflow import ( AskHumanEvent, GoDeeperEvent, HumanAnswerEvent, InputEvent, ToolCallEvent, get_agent, reset_agent, workflow, ) app = FastAPI(title="FsExplorer", description="AI-powered filesystem exploration") _corpus_locks: dict[str, asyncio.Lock] = {} def _get_corpus_lock(folder: str) -> asyncio.Lock: """Return a per-folder asyncio lock, creating one if needed.""" normalized = str(Path(folder).resolve()) if normalized not in _corpus_locks: _corpus_locks[normalized] = asyncio.Lock() return _corpus_locks[normalized] class TaskRequest(BaseModel): """Request model for task submission.""" task: str folder: str = "." use_index: bool = False db_path: str | None = None class IndexRequest(BaseModel): """Request model for index build/refresh.""" folder: str = "." db_path: str | None = None discover_schema: bool = True schema_name: str | None = None with_metadata: bool = False metadata_profile: dict[str, Any] | None = None with_embeddings: bool = False class AutoProfileRequest(BaseModel): """Request model for auto-profile generation.""" folder: str = "." class SearchRequest(BaseModel): """Request model for search queries.""" corpus_folder: str query: str filters: str | None = None limit: int = 5 db_path: str | None = None @app.get("/", response_class=HTMLResponse) async def get_ui(): """Serve the main UI HTML file.""" html_path = Path(__file__).parent / "ui.html" if html_path.exists(): return HTMLResponse( content=html_path.read_text(encoding="utf-8"), status_code=200 ) return HTMLResponse(content="

UI not found

", status_code=404) @app.get("/api/folders") async def list_folders(path: str = "."): """ List folders in the given path. Returns list of folder names and current path info. """ try: base_path = Path(path).resolve() if not base_path.exists(): return JSONResponse({"error": "Path not found"}, status_code=404) if not base_path.is_dir(): return JSONResponse({"error": "Not a directory"}, status_code=400) # Get folders (non-hidden) folders = sorted( [ f.name for f in base_path.iterdir() if f.is_dir() and not f.name.startswith(".") ] ) # Get parent path (if not at root) parent = str(base_path.parent) if base_path != base_path.parent else None return { "current": str(base_path), "parent": parent, "folders": folders, "files_count": len([f for f in base_path.iterdir() if f.is_file()]), } except PermissionError: return JSONResponse({"error": "Permission denied"}, status_code=403) except Exception as e: return JSONResponse({"error": str(e)}, status_code=500) @app.get("/api/index/status") async def index_status(folder: str, db_path: str | None = None): """Check whether a folder has been indexed and return status details.""" try: folder_path = Path(folder).resolve() if not folder_path.exists() or not folder_path.is_dir(): return {"indexed": False} resolved_db_path = resolve_db_path(db_path) if not Path(resolved_db_path).exists(): return {"indexed": False} try: storage = DuckDBStorage(resolved_db_path, read_only=True, initialize=False) except Exception: return {"indexed": False} try: corpus_id = storage.get_corpus_id(str(folder_path)) if corpus_id is None: storage.close() return {"indexed": False} docs = storage.list_documents(corpus_id=corpus_id, include_deleted=False) active_schema = storage.get_active_schema(corpus_id=corpus_id) has_embeddings = storage.has_embeddings(corpus_id=corpus_id) schema_name: str | None = None has_metadata = False schema_fields: list[str] = [] if active_schema is not None: schema_name = active_schema.name has_metadata = ( active_schema.schema_def.get("metadata_profile") is not None ) fields_def = active_schema.schema_def.get("fields") if isinstance(fields_def, list): for f in fields_def: if isinstance(f, dict) and isinstance(f.get("name"), str): schema_fields.append(f["name"]) storage.close() return { "indexed": True, "corpus_id": corpus_id, "document_count": len(docs), "schema_name": schema_name, "has_metadata": has_metadata, "has_embeddings": has_embeddings, "schema_fields": schema_fields, } except Exception: storage.close() return {"indexed": False} except Exception: return {"indexed": False} @app.post("/api/index/auto-profile") async def generate_auto_profile(request: AutoProfileRequest): """Generate an auto-discovered metadata profile for preview/editing.""" try: folder_path = Path(request.folder).resolve() if not folder_path.exists() or not folder_path.is_dir(): return JSONResponse( {"error": f"Invalid folder: {request.folder}"}, status_code=400 ) profile = await asyncio.to_thread(auto_discover_profile, str(folder_path)) return {"profile": profile} except Exception as exc: return JSONResponse({"error": str(exc)}, status_code=500) @app.post("/api/index") async def build_index(request: IndexRequest): """Build or refresh the index for a selected folder.""" try: folder_path = Path(request.folder).resolve() if not folder_path.exists(): return JSONResponse({"error": "Path not found"}, status_code=404) if not folder_path.is_dir(): return JSONResponse({"error": "Not a directory"}, status_code=400) lock = _get_corpus_lock(str(folder_path)) async with lock: resolved_db_path = resolve_db_path(request.db_path) embedding_provider: EmbeddingProvider | None = None if request.with_embeddings: try: embedding_provider = EmbeddingProvider() except ValueError: embedding_provider = None pipeline = IndexingPipeline( storage=DuckDBStorage(resolved_db_path), embedding_provider=embedding_provider, ) effective_with_metadata = ( request.with_metadata or request.metadata_profile is not None ) discover_schema = request.discover_schema or effective_with_metadata result = pipeline.index_folder( str(folder_path), discover_schema=discover_schema, schema_name=request.schema_name, with_metadata=effective_with_metadata, metadata_profile=request.metadata_profile, ) return { "db_path": resolved_db_path, "folder": str(folder_path), "corpus_id": result.corpus_id, "indexed_files": result.indexed_files, "skipped_files": result.skipped_files, "deleted_files": result.deleted_files, "chunks_written": result.chunks_written, "active_documents": result.active_documents, "schema_used": result.schema_used, "embeddings_written": result.embeddings_written, "metadata_mode": "langextract" if effective_with_metadata else "heuristic", } except ValueError as exc: return JSONResponse({"error": str(exc)}, status_code=400) except PermissionError: return JSONResponse({"error": "Permission denied"}, status_code=403) except Exception as exc: return JSONResponse({"error": str(exc)}, status_code=500) @app.post("/api/search") async def search_index(request: SearchRequest): """Search an indexed corpus and return ranked hits.""" try: folder_path = Path(request.corpus_folder).resolve() if not folder_path.exists() or not folder_path.is_dir(): return JSONResponse( {"error": f"Invalid folder: {request.corpus_folder}"}, status_code=400 ) resolved_db_path = resolve_db_path(request.db_path) storage = DuckDBStorage(resolved_db_path, read_only=True, initialize=False) corpus_id = storage.get_corpus_id(str(folder_path)) if corpus_id is None: storage.close() return JSONResponse( {"error": "No index found for this folder."}, status_code=404 ) embedding_provider: EmbeddingProvider | None = None if storage.has_embeddings(corpus_id=corpus_id): try: embedding_provider = EmbeddingProvider() except ValueError: pass engine = IndexedQueryEngine(storage, embedding_provider=embedding_provider) hits = engine.search( corpus_id=corpus_id, query=request.query, filters=request.filters, limit=request.limit, ) storage.close() return { "corpus_folder": str(folder_path), "query": request.query, "hits": [ { "doc_id": hit.doc_id, "relative_path": hit.relative_path, "absolute_path": hit.absolute_path, "position": hit.position, "text": hit.text, "semantic_score": hit.semantic_score, "metadata_score": hit.metadata_score, "score": hit.score, "matched_by": hit.matched_by, } for hit in hits ], } except Exception as exc: return JSONResponse({"error": str(exc)}, status_code=500) @app.websocket("/ws/explore") async def websocket_explore(websocket: WebSocket): """ WebSocket endpoint for real-time exploration streaming. Protocol: 1. Client sends: {"task": "user question"} 2. Server streams events: {"type": "...", "data": {...}} 3. Final event: {"type": "complete", "data": {...}} """ await websocket.accept() try: # Receive the task data = await websocket.receive_json() task = data.get("task", "") folder = data.get("folder", ".") use_index = bool(data.get("use_index", False)) db_path = data.get("db_path") enable_semantic = bool(data.get("enable_semantic", False)) enable_metadata = bool(data.get("enable_metadata", False)) index_storage: DuckDBStorage | None = None if not task: await websocket.send_json( {"type": "error", "data": {"message": "No task provided"}} ) return # Validate folder folder_path = Path(folder).resolve() if not folder_path.exists() or not folder_path.is_dir(): await websocket.send_json( {"type": "error", "data": {"message": f"Invalid folder: {folder}"}} ) return clear_index_context() if use_index: resolved_db_path = resolve_db_path( db_path if isinstance(db_path, str) else None ) storage = DuckDBStorage(resolved_db_path) corpus_id = storage.get_corpus_id(str(folder_path)) if corpus_id is None: await websocket.send_json( { "type": "error", "data": { "message": ( "No index found for the selected folder. " "Run `explore index ` first." ) }, } ) return index_storage = storage set_index_context(str(folder_path), resolved_db_path) set_search_flags( enable_semantic=enable_semantic and use_index, enable_metadata=enable_metadata and use_index, ) trace = ExplorationTrace(root_directory=str(folder_path)) # Reset agent for fresh state reset_agent() # Send start event await websocket.send_json( { "type": "start", "data": { "task": task, "folder": str(folder_path), "use_index": use_index, }, } ) # Run the workflow step_number = 0 handler = workflow.run( start_event=InputEvent( task=task, folder=str(folder_path), use_index=use_index, enable_semantic=enable_semantic and use_index, enable_metadata=enable_metadata and use_index, ) ) async for event in handler.stream_events(): if isinstance(event, ToolCallEvent): step_number += 1 resolved_document_path: str | None = None if event.tool_name == "get_document": doc_id = event.tool_input.get("doc_id") if index_storage is not None and isinstance(doc_id, str) and doc_id: document = index_storage.get_document(doc_id=doc_id) if document and not document["is_deleted"]: resolved_document_path = str(document["absolute_path"]) trace.record_tool_call( step_number=step_number, tool_name=event.tool_name, tool_input=event.tool_input, resolved_document_path=resolved_document_path, ) await websocket.send_json( { "type": "tool_call", "data": { "step": step_number, "tool_name": event.tool_name, "tool_input": event.tool_input, "reason": event.reason, }, } ) elif isinstance(event, GoDeeperEvent): step_number += 1 trace.record_go_deeper( step_number=step_number, directory=event.directory ) await websocket.send_json( { "type": "go_deeper", "data": { "step": step_number, "directory": event.directory, "reason": event.reason, }, } ) elif isinstance(event, AskHumanEvent): step_number += 1 await websocket.send_json( { "type": "ask_human", "data": { "step": step_number, "question": event.question, "reason": event.reason, }, } ) # Wait for human response response_data = await websocket.receive_json() if response_data.get("type") == "human_response": handler.ctx.send_event( HumanAnswerEvent(response=response_data.get("response", "")) ) # Get final result result = await handler cited_sources = extract_cited_sources(result.final_result) # Get token usage agent = get_agent() usage = agent.token_usage input_cost, output_cost, total_cost = usage._calculate_cost() await websocket.send_json( { "type": "complete", "data": { "final_result": result.final_result, "error": result.error, "stats": { "steps": step_number, "api_calls": usage.api_calls, "documents_scanned": usage.documents_scanned, "documents_parsed": usage.documents_parsed, "prompt_tokens": usage.prompt_tokens, "completion_tokens": usage.completion_tokens, "total_tokens": usage.total_tokens, "tool_result_chars": usage.tool_result_chars, "estimated_cost": round(total_cost, 6), }, "trace": { "step_path": trace.step_path, "referenced_documents": trace.sorted_documents(), "cited_sources": cited_sources, }, }, } ) except WebSocketDisconnect: pass except Exception as e: await websocket.send_json({"type": "error", "data": {"message": str(e)}}) finally: set_search_flags(enable_semantic=False, enable_metadata=False) clear_index_context() def run_server(host: str = "127.0.0.1", port: int = 8000): """Run the FastAPI server.""" import uvicorn uvicorn.run(app, host=host, port=port) if __name__ == "__main__": run_server() ================================================ FILE: src/fs_explorer/storage/__init__.py ================================================ """Storage backends for FsExplorer indexing.""" from .base import ChunkRecord, DocumentRecord, SchemaRecord, StorageBackend from .duckdb import DuckDBStorage __all__ = [ "ChunkRecord", "DocumentRecord", "SchemaRecord", "StorageBackend", "DuckDBStorage", ] ================================================ FILE: src/fs_explorer/storage/base.py ================================================ """ Storage interfaces and data models for index persistence. """ from __future__ import annotations from dataclasses import dataclass from typing import Any, Protocol @dataclass(frozen=True) class ChunkRecord: """A text chunk stored for a document.""" id: str doc_id: str text: str position: int start_char: int end_char: int embedding: list[float] | None = None @dataclass(frozen=True) class DocumentRecord: """A normalized document record for indexing.""" id: str corpus_id: str relative_path: str absolute_path: str content: str metadata_json: str file_mtime: float file_size: int content_sha256: str @dataclass(frozen=True) class SchemaRecord: """A stored schema entry.""" id: str corpus_id: str name: str schema_def: dict[str, Any] is_active: bool created_at: str class StorageBackend(Protocol): """Protocol for persistence operations used by indexing and schema workflows.""" def initialize(self) -> None: """Initialize required tables/indexes.""" def get_or_create_corpus(self, root_path: str) -> str: """Return corpus id for a root path, creating if needed.""" def get_corpus_id(self, root_path: str) -> str | None: """Return corpus id for a root path if present.""" def upsert_document( self, document: DocumentRecord, chunks: list[ChunkRecord] ) -> None: """Insert or update a document and replace its chunks.""" def mark_deleted_missing_documents( self, *, corpus_id: str, active_relative_paths: set[str], ) -> int: """Mark documents deleted when not present in the latest index run.""" def list_documents( self, *, corpus_id: str, include_deleted: bool = False, ) -> list[dict[str, Any]]: """List documents for a corpus.""" def count_chunks(self, *, corpus_id: str) -> int: """Count chunks for active documents in a corpus.""" def search_chunks( self, *, corpus_id: str, query: str, limit: int = 5, ) -> list[dict[str, Any]]: """Search indexed chunks and return ranked matches.""" def search_documents_by_metadata( self, *, corpus_id: str, filters: list[dict[str, Any]], limit: int = 20, ) -> list[dict[str, Any]]: """Search indexed documents by metadata filters.""" def get_document(self, *, doc_id: str) -> dict[str, Any] | None: """Get a document by id.""" def save_schema( self, *, corpus_id: str, name: str, schema_def: dict[str, Any], is_active: bool = True, ) -> str: """Create or update a schema entry.""" def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]: """List all schemas for a corpus.""" def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRecord | None: """Fetch a schema by name.""" def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None: """Fetch active schema for a corpus if present.""" def store_chunk_embeddings( self, *, corpus_id: str, chunk_embeddings: list[tuple[str, list[float]]], ) -> int: """Bulk-store (chunk_id, embedding) pairs. Return count written.""" def search_chunks_semantic( self, *, corpus_id: str, query_embedding: list[float], limit: int = 5, ) -> list[dict[str, Any]]: """Search chunks by cosine similarity against a query embedding.""" def get_metadata_field_values( self, *, corpus_id: str, field_names: list[str], max_distinct: int = 10, ) -> dict[str, list[str]]: """Return up to *max_distinct* distinct non-empty values per metadata field.""" def has_embeddings(self, *, corpus_id: str) -> bool: """Return True if the corpus has stored embeddings.""" ================================================ FILE: src/fs_explorer/storage/duckdb.py ================================================ """ DuckDB storage backend for index persistence. """ from __future__ import annotations import hashlib import json import re from pathlib import Path from typing import Any import duckdb from .base import ChunkRecord, DocumentRecord, SchemaRecord def _stable_id(prefix: str, value: str) -> str: digest = hashlib.sha1(value.encode("utf-8")).hexdigest() return f"{prefix}_{digest}" def _query_terms(query: str, max_terms: int = 8) -> list[str]: terms = re.findall(r"[a-zA-Z0-9_]{3,}", query.lower()) unique_terms: list[str] = [] for term in terms: if term not in unique_terms: unique_terms.append(term) if len(unique_terms) >= max_terms: break if unique_terms: return unique_terms fallback = query.strip().lower() return [fallback] if fallback else [] class DuckDBStorage: """DuckDB-backed persistence for corpora, documents, chunks, and schemas.""" def __init__( self, db_path: str, *, read_only: bool = False, initialize: bool = True, embedding_dim: int = 768, ) -> None: self.db_path = str(Path(db_path).expanduser().resolve()) self.read_only = read_only self.embedding_dim = embedding_dim Path(self.db_path).parent.mkdir(parents=True, exist_ok=True) self._conn = duckdb.connect(self.db_path, read_only=read_only) self._vss_available = False if initialize and not read_only: self.initialize() if not read_only: self._try_load_vss() def close(self) -> None: """Close the underlying DuckDB connection.""" self._conn.close() def initialize(self) -> None: self._conn.execute( """ CREATE TABLE IF NOT EXISTS corpora ( id VARCHAR PRIMARY KEY, root_path VARCHAR NOT NULL UNIQUE, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); """ ) self._conn.execute( """ CREATE TABLE IF NOT EXISTS documents ( id VARCHAR PRIMARY KEY, corpus_id VARCHAR NOT NULL REFERENCES corpora(id), relative_path VARCHAR NOT NULL, absolute_path VARCHAR NOT NULL, content VARCHAR NOT NULL, metadata_json VARCHAR NOT NULL DEFAULT '{}', file_mtime DOUBLE NOT NULL, file_size BIGINT NOT NULL, content_sha256 VARCHAR NOT NULL, last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, is_deleted BOOLEAN DEFAULT FALSE, UNIQUE(corpus_id, relative_path) ); """ ) self._conn.execute( """ CREATE TABLE IF NOT EXISTS chunks ( id VARCHAR PRIMARY KEY, doc_id VARCHAR NOT NULL REFERENCES documents(id), text VARCHAR NOT NULL, position INTEGER NOT NULL, start_char INTEGER NOT NULL, end_char INTEGER NOT NULL ); """ ) self._conn.execute( """ CREATE TABLE IF NOT EXISTS schemas ( id VARCHAR PRIMARY KEY, corpus_id VARCHAR NOT NULL REFERENCES corpora(id), name VARCHAR NOT NULL, schema_def VARCHAR NOT NULL, is_active BOOLEAN DEFAULT FALSE, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, UNIQUE(corpus_id, name) ); """ ) self._conn.execute( f""" CREATE TABLE IF NOT EXISTS chunk_embeddings ( chunk_id VARCHAR PRIMARY KEY REFERENCES chunks(id), corpus_id VARCHAR NOT NULL, embedding FLOAT[{self.embedding_dim}] NOT NULL ); """ ) def _try_load_vss(self) -> None: """Attempt to install and load the vss extension for HNSW acceleration.""" try: self._conn.execute("INSTALL vss") self._conn.execute("LOAD vss") self._vss_available = True except Exception: self._vss_available = False def get_or_create_corpus(self, root_path: str) -> str: normalized = str(Path(root_path).resolve()) corpus_id = _stable_id("corpus", normalized) self._conn.execute( """ INSERT INTO corpora (id, root_path) VALUES (?, ?) ON CONFLICT(root_path) DO NOTHING """, [corpus_id, normalized], ) row = self._conn.execute( "SELECT id FROM corpora WHERE root_path = ?", [normalized], ).fetchone() if row is None: raise RuntimeError(f"Failed to create corpus for path: {normalized}") return str(row[0]) def get_corpus_id(self, root_path: str) -> str | None: normalized = str(Path(root_path).resolve()) row = self._conn.execute( "SELECT id FROM corpora WHERE root_path = ?", [normalized], ).fetchone() if row is None: return None return str(row[0]) def upsert_document( self, document: DocumentRecord, chunks: list[ChunkRecord] ) -> None: # Cascade-delete embeddings for old chunks, then remove old chunks. self._conn.execute( """ DELETE FROM chunk_embeddings WHERE chunk_id IN (SELECT id FROM chunks WHERE doc_id = ?) """, [document.id], ) self._conn.execute("DELETE FROM chunks WHERE doc_id = ?", [document.id]) self._conn.execute( """ INSERT INTO documents ( id, corpus_id, relative_path, absolute_path, content, metadata_json, file_mtime, file_size, content_sha256, is_deleted ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, FALSE) ON CONFLICT(id) DO UPDATE SET corpus_id = excluded.corpus_id, relative_path = excluded.relative_path, absolute_path = excluded.absolute_path, content = excluded.content, metadata_json = excluded.metadata_json, file_mtime = excluded.file_mtime, file_size = excluded.file_size, content_sha256 = excluded.content_sha256, last_indexed_at = now(), is_deleted = FALSE """, [ document.id, document.corpus_id, document.relative_path, document.absolute_path, document.content, document.metadata_json, document.file_mtime, document.file_size, document.content_sha256, ], ) if chunks: self._conn.executemany( """ INSERT INTO chunks (id, doc_id, text, position, start_char, end_char) VALUES (?, ?, ?, ?, ?, ?) """, [ ( chunk.id, chunk.doc_id, chunk.text, chunk.position, chunk.start_char, chunk.end_char, ) for chunk in chunks ], ) def mark_deleted_missing_documents( self, *, corpus_id: str, active_relative_paths: set[str], ) -> int: if not active_relative_paths: self._conn.execute( """ UPDATE documents SET is_deleted = TRUE WHERE corpus_id = ? AND is_deleted = FALSE """, [corpus_id], ) else: placeholders = ", ".join(["?"] * len(active_relative_paths)) params: list[Any] = [corpus_id] params.extend(sorted(active_relative_paths)) self._conn.execute( f""" UPDATE documents SET is_deleted = TRUE WHERE corpus_id = ? AND is_deleted = FALSE AND relative_path NOT IN ({placeholders}) """, params, ) row = self._conn.execute( """ SELECT COUNT(*) FROM documents WHERE corpus_id = ? AND is_deleted = TRUE """, [corpus_id], ).fetchone() return int(row[0]) if row else 0 def list_documents( self, *, corpus_id: str, include_deleted: bool = False, ) -> list[dict[str, Any]]: sql = """ SELECT id, relative_path, absolute_path, file_size, file_mtime, is_deleted FROM documents WHERE corpus_id = ? """ params: list[Any] = [corpus_id] if not include_deleted: sql += " AND is_deleted = FALSE" sql += " ORDER BY relative_path" rows = self._conn.execute(sql, params).fetchall() results: list[dict[str, Any]] = [] for row in rows: results.append( { "id": str(row[0]), "relative_path": str(row[1]), "absolute_path": str(row[2]), "file_size": int(row[3]), "file_mtime": float(row[4]), "is_deleted": bool(row[5]), } ) return results def count_chunks(self, *, corpus_id: str) -> int: row = self._conn.execute( """ SELECT COUNT(*) FROM chunks c JOIN documents d ON d.id = c.doc_id WHERE d.corpus_id = ? AND d.is_deleted = FALSE """, [corpus_id], ).fetchone() return int(row[0]) if row else 0 def search_chunks( self, *, corpus_id: str, query: str, limit: int = 5, ) -> list[dict[str, Any]]: terms = _query_terms(query) if not terms: return [] score_expr = " + ".join( ["CASE WHEN lower(c.text) LIKE '%' || ? || '%' THEN 1 ELSE 0 END"] * len(terms) ) sql = f""" SELECT * FROM ( SELECT d.id AS doc_id, d.relative_path, d.absolute_path, c.position, c.text, ({score_expr}) AS score FROM chunks c JOIN documents d ON d.id = c.doc_id WHERE d.corpus_id = ? AND d.is_deleted = FALSE ) ranked WHERE score > 0 ORDER BY score DESC, relative_path ASC, position ASC LIMIT ? """ params: list[Any] = [] params.extend(terms) params.append(corpus_id) params.append(limit) rows = self._conn.execute(sql, params).fetchall() results: list[dict[str, Any]] = [] for row in rows: results.append( { "doc_id": str(row[0]), "relative_path": str(row[1]), "absolute_path": str(row[2]), "position": int(row[3]), "text": str(row[4]), "score": int(row[5]), } ) return results def search_documents_by_metadata( self, *, corpus_id: str, filters: list[dict[str, Any]], limit: int = 20, ) -> list[dict[str, Any]]: if not filters: return [] sql = """ SELECT d.id, d.relative_path, d.absolute_path, substring(d.content, 1, 320) AS preview_text FROM documents d WHERE d.corpus_id = ? AND d.is_deleted = FALSE """ params: list[Any] = [corpus_id] for flt in filters: field = str(flt["field"]) operator = str(flt["operator"]) value = flt["value"] clause, clause_params = self._metadata_clause( field=field, operator=operator, value=value, ) sql += f"\n AND {clause}" params.extend(clause_params) sql += "\nORDER BY d.relative_path ASC\nLIMIT ?" params.append(limit) rows = self._conn.execute(sql, params).fetchall() metadata_score = len(filters) results: list[dict[str, Any]] = [] for row in rows: results.append( { "doc_id": str(row[0]), "relative_path": str(row[1]), "absolute_path": str(row[2]), "preview_text": str(row[3]), "metadata_score": metadata_score, } ) return results def get_document(self, *, doc_id: str) -> dict[str, Any] | None: row = self._conn.execute( """ SELECT id, corpus_id, relative_path, absolute_path, content, metadata_json, is_deleted FROM documents WHERE id = ? LIMIT 1 """, [doc_id], ).fetchone() if row is None: return None return { "id": str(row[0]), "corpus_id": str(row[1]), "relative_path": str(row[2]), "absolute_path": str(row[3]), "content": str(row[4]), "metadata_json": str(row[5]), "is_deleted": bool(row[6]), } def save_schema( self, *, corpus_id: str, name: str, schema_def: dict[str, Any], is_active: bool = True, ) -> str: schema_id = _stable_id("schema", f"{corpus_id}:{name}") if is_active: self._conn.execute( "UPDATE schemas SET is_active = FALSE WHERE corpus_id = ?", [corpus_id], ) self._conn.execute( """ INSERT INTO schemas (id, corpus_id, name, schema_def, is_active) VALUES (?, ?, ?, ?, ?) ON CONFLICT(corpus_id, name) DO UPDATE SET schema_def = excluded.schema_def, is_active = excluded.is_active """, [ schema_id, corpus_id, name, json.dumps(schema_def, sort_keys=True), is_active, ], ) return schema_id def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]: rows = self._conn.execute( """ SELECT id, corpus_id, name, schema_def, is_active, created_at FROM schemas WHERE corpus_id = ? ORDER BY created_at DESC, name ASC """, [corpus_id], ).fetchall() return [self._row_to_schema_record(row) for row in rows] def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRecord | None: row = self._conn.execute( """ SELECT id, corpus_id, name, schema_def, is_active, created_at FROM schemas WHERE corpus_id = ? AND name = ? LIMIT 1 """, [corpus_id, name], ).fetchone() if row is None: return None return self._row_to_schema_record(row) def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None: row = self._conn.execute( """ SELECT id, corpus_id, name, schema_def, is_active, created_at FROM schemas WHERE corpus_id = ? AND is_active = TRUE ORDER BY created_at DESC LIMIT 1 """, [corpus_id], ).fetchone() if row is None: return None return self._row_to_schema_record(row) @staticmethod def make_document_id(corpus_id: str, relative_path: str) -> str: return _stable_id("doc", f"{corpus_id}:{relative_path}") @staticmethod def make_chunk_id( doc_id: str, position: int, start_char: int, end_char: int ) -> str: return _stable_id("chunk", f"{doc_id}:{position}:{start_char}:{end_char}") @staticmethod def _row_to_schema_record(row: tuple[Any, ...]) -> SchemaRecord: return SchemaRecord( id=str(row[0]), corpus_id=str(row[1]), name=str(row[2]), schema_def=json.loads(str(row[3])), is_active=bool(row[4]), created_at=str(row[5]), ) def store_chunk_embeddings( self, *, corpus_id: str, chunk_embeddings: list[tuple[str, list[float]]], ) -> int: """Bulk-store (chunk_id, embedding) pairs. Return count written.""" if not chunk_embeddings: return 0 self._conn.executemany( """ INSERT INTO chunk_embeddings (chunk_id, corpus_id, embedding) VALUES (?, ?, ?) ON CONFLICT(chunk_id) DO UPDATE SET corpus_id = excluded.corpus_id, embedding = excluded.embedding """, [(cid, corpus_id, emb) for cid, emb in chunk_embeddings], ) return len(chunk_embeddings) def search_chunks_semantic( self, *, corpus_id: str, query_embedding: list[float], limit: int = 5, ) -> list[dict[str, Any]]: """Search chunks by cosine similarity against a query embedding.""" sql = """ SELECT d.id AS doc_id, d.relative_path, d.absolute_path, c.position, c.text, array_cosine_similarity(ce.embedding, ?::FLOAT[{dim}]) AS score FROM chunk_embeddings ce JOIN chunks c ON c.id = ce.chunk_id JOIN documents d ON d.id = c.doc_id WHERE ce.corpus_id = ? AND d.is_deleted = FALSE ORDER BY score DESC LIMIT ? """.format(dim=self.embedding_dim) rows = self._conn.execute(sql, [query_embedding, corpus_id, limit]).fetchall() results: list[dict[str, Any]] = [] for row in rows: results.append( { "doc_id": str(row[0]), "relative_path": str(row[1]), "absolute_path": str(row[2]), "position": int(row[3]), "text": str(row[4]), "score": float(row[5]), } ) return results def get_metadata_field_values( self, *, corpus_id: str, field_names: list[str], max_distinct: int = 10, ) -> dict[str, list[str]]: """Return up to *max_distinct* distinct non-empty values per metadata field.""" result: dict[str, list[str]] = {} for field in field_names: rows = self._conn.execute( """ SELECT DISTINCT json_extract_string(d.metadata_json, ?) AS val FROM documents d WHERE d.corpus_id = ? AND d.is_deleted = FALSE AND val IS NOT NULL AND val != '' LIMIT ? """, [f"$.{field}", corpus_id, max_distinct], ).fetchall() result[field] = [str(row[0]) for row in rows] return result def has_embeddings(self, *, corpus_id: str) -> bool: """Return True if the corpus has stored embeddings.""" row = self._conn.execute( "SELECT COUNT(*) FROM chunk_embeddings WHERE corpus_id = ?", [corpus_id], ).fetchone() return bool(row and int(row[0]) > 0) def create_hnsw_index(self, *, corpus_id: str) -> bool: """Create an HNSW index on chunk embeddings if vss is available. Returns True if the index was created, False otherwise. """ if not self._vss_available: return False try: index_name = f"hnsw_{corpus_id.replace('-', '_')}" self._conn.execute( f""" CREATE INDEX IF NOT EXISTS {index_name} ON chunk_embeddings USING HNSW (embedding) WITH (metric = 'cosine') """ ) return True except Exception: return False @staticmethod def _metadata_clause( *, field: str, operator: str, value: Any, ) -> tuple[str, list[Any]]: json_expr = "json_extract_string(d.metadata_json, ?)" json_path = f"$.{field}" if operator in {"eq", "ne"}: comparator = "=" if operator == "eq" else "<>" if isinstance(value, bool): return ( f"lower(coalesce({json_expr}, '')) {comparator} ?", [json_path, "true" if value else "false"], ) if isinstance(value, (int, float)): return ( f"try_cast({json_expr} AS DOUBLE) {comparator} ?", [json_path, float(value)], ) return ( f"lower(coalesce({json_expr}, '')) {comparator} lower(?)", [json_path, str(value)], ) if operator in {"gt", "gte", "lt", "lte"}: if not isinstance(value, (int, float)): raise ValueError( f"Metadata operator {operator!r} requires numeric value for field {field!r}." ) comparator_map = { "gt": ">", "gte": ">=", "lt": "<", "lte": "<=", } comparator = comparator_map[operator] return ( f"try_cast({json_expr} AS DOUBLE) {comparator} ?", [json_path, float(value)], ) if operator == "contains": return ( f"lower(coalesce({json_expr}, '')) LIKE '%' || lower(?) || '%'", [json_path, str(value)], ) if operator == "in": if not isinstance(value, list) or not value: raise ValueError( f"Metadata `in` filter for field {field!r} has no values." ) if all(isinstance(item, bool) for item in value): placeholders = ", ".join(["?"] * len(value)) return ( f"lower(coalesce({json_expr}, '')) IN ({placeholders})", [ json_path, *["true" if bool(item) else "false" for item in value], ], ) if all( isinstance(item, (int, float)) and not isinstance(item, bool) for item in value ): placeholders = ", ".join(["?"] * len(value)) return ( f"try_cast({json_expr} AS DOUBLE) IN ({placeholders})", [json_path, *[float(item) for item in value]], ) placeholders = ", ".join(["?"] * len(value)) return ( f"lower(coalesce({json_expr}, '')) IN ({placeholders})", [json_path, *[str(item).lower() for item in value]], ) raise ValueError(f"Unsupported metadata operator: {operator!r}") ================================================ FILE: src/fs_explorer/ui.html ================================================ fs-explorer

fs-explorer

v0.1.0
Ready
Target Folder
.
Query
Retrieval
Execution Log
Awaiting query...
Enter a question to begin document exploration
Response
No results yet
Results with citations will appear here
Powered by Gemini 3 Flash · Documents parsed with Docling
================================================ FILE: src/fs_explorer/workflow.py ================================================ """ Workflow orchestration for the FsExplorer agent. This module defines the event-driven workflow that coordinates the agent's exploration of the filesystem, handling tool calls, directory navigation, and human interaction. """ import contextvars import os from workflows import Workflow, Context, step from workflows.events import ( StartEvent, StopEvent, Event, InputRequiredEvent, HumanResponseEvent, ) from workflows.resource import Resource from pydantic import BaseModel from typing import Annotated, cast, Any from .agent import FsExplorerAgent from .models import GoDeeperAction, ToolCallAction, StopAction, AskHumanAction, Action from .fs import describe_dir_content # Per-asyncio-task agent storage — each WebSocket connection gets its own. _AGENT_VAR: contextvars.ContextVar[FsExplorerAgent | None] = contextvars.ContextVar( "_AGENT_VAR", default=None ) def get_agent() -> FsExplorerAgent: """Get or create the agent instance for the current context.""" agent = _AGENT_VAR.get() if agent is None: agent = FsExplorerAgent() _AGENT_VAR.set(agent) return agent def reset_agent() -> None: """Reset the agent instance for the current context.""" _AGENT_VAR.set(None) class WorkflowState(BaseModel): """State maintained throughout the workflow execution.""" initial_task: str = "" root_directory: str = "." current_directory: str = "." use_index: bool = False enable_semantic: bool = False enable_metadata: bool = False class InputEvent(StartEvent): """Initial event containing the user's task.""" task: str folder: str = "." use_index: bool = False enable_semantic: bool = False enable_metadata: bool = False class GoDeeperEvent(Event): """Event triggered when navigating into a subdirectory.""" directory: str reason: str class ToolCallEvent(Event): """Event triggered when executing a tool.""" tool_name: str tool_input: dict[str, Any] reason: str class AskHumanEvent(InputRequiredEvent): """Event triggered when human input is required.""" question: str reason: str class HumanAnswerEvent(HumanResponseEvent): """Event containing the human's response.""" response: str class ExplorationEndEvent(StopEvent): """Event signaling the end of exploration.""" final_result: str | None = None error: str | None = None # Type alias for the union of possible workflow events WorkflowEvent = ExplorationEndEvent | GoDeeperEvent | ToolCallEvent | AskHumanEvent def _handle_action_result( action: Action, action_type: str, ctx: Context[WorkflowState], ) -> WorkflowEvent: """ Convert an action result into the appropriate workflow event. This helper extracts the common logic for handling agent action results, reducing code duplication across workflow steps. Args: action: The action returned by the agent action_type: The type of action ("godeeper", "toolcall", "askhuman", "stop") ctx: The workflow context for state updates and event streaming Returns: The appropriate workflow event based on the action type """ if action_type == "godeeper": godeeper = cast(GoDeeperAction, action.action) event = GoDeeperEvent(directory=godeeper.directory, reason=action.reason) ctx.write_event_to_stream(event) return event elif action_type == "toolcall": toolcall = cast(ToolCallAction, action.action) event = ToolCallEvent( tool_name=toolcall.tool_name, tool_input=toolcall.to_fn_args(), reason=action.reason, ) ctx.write_event_to_stream(event) return event elif action_type == "askhuman": askhuman = cast(AskHumanAction, action.action) # InputRequiredEvent is written to the stream by default return AskHumanEvent(question=askhuman.question, reason=action.reason) else: # stop stopaction = cast(StopAction, action.action) return ExplorationEndEvent(final_result=stopaction.final_result) async def _process_agent_action( agent: FsExplorerAgent, ctx: Context[WorkflowState], update_directory: bool = False, ) -> WorkflowEvent: """ Process the agent's next action and return the appropriate event. Args: agent: The agent instance ctx: The workflow context update_directory: Whether to update the current directory on godeeper action Returns: The appropriate workflow event """ result = await agent.take_action() if result is None: return ExplorationEndEvent(error="Could not produce action to take") action, action_type = result # Update directory state if needed for godeeper actions if update_directory and action_type == "godeeper": godeeper = cast(GoDeeperAction, action.action) async with ctx.store.edit_state() as state: state.current_directory = godeeper.directory return _handle_action_result(action, action_type, ctx) class FsExplorerWorkflow(Workflow): """ Event-driven workflow for filesystem exploration. Coordinates the agent's actions through a series of steps: - start_exploration: Initial task processing - go_deeper_action: Directory navigation - tool_call_action: Tool execution - receive_human_answer: Human interaction handling """ @step async def start_exploration( self, ev: InputEvent, ctx: Context[WorkflowState], agent: Annotated[FsExplorerAgent, Resource(get_agent)], ) -> WorkflowEvent: """Initialize exploration with the user's task.""" root_directory = os.path.abspath(ev.folder) if not os.path.exists(root_directory) or not os.path.isdir(root_directory): return ExplorationEndEvent(error=f"No such directory: {root_directory}") async with ctx.store.edit_state() as state: state.initial_task = ev.task state.root_directory = root_directory state.current_directory = root_directory state.use_index = ev.use_index state.enable_semantic = ev.enable_semantic state.enable_metadata = ev.enable_metadata dirdescription = describe_dir_content(root_directory) if ev.enable_semantic and ev.enable_metadata: index_hint = ( "An index is available. Start with `semantic_search` (with optional " "filters) for fast retrieval, then use filesystem tools for deep dives." ) elif ev.enable_semantic: index_hint = ( "An index is available. Use `semantic_search` (no filters) for " "similarity search, then use filesystem tools for details." ) elif ev.enable_metadata: index_hint = ( "An index is available. Use `semantic_search` with metadata " "filters, then use filesystem tools for details." ) else: index_hint = "Prefer absolute paths from the directory listing when calling tools." agent.configure_task( f"Given that the current directory ('{root_directory}') looks like this:\n\n" f"```text\n{dirdescription}\n```\n\n" f"And that the user is giving you this task: '{ev.task}', " f"what action should you take first? {index_hint}" ) return await _process_agent_action(agent, ctx, update_directory=True) @step async def go_deeper_action( self, ev: GoDeeperEvent, ctx: Context[WorkflowState], agent: Annotated[FsExplorerAgent, Resource(get_agent)], ) -> WorkflowEvent: """Handle navigation into a subdirectory.""" state = await ctx.store.get_state() dirdescription = describe_dir_content(state.current_directory) agent.configure_task( f"Given that the current directory ('{state.current_directory}') " f"looks like this:\n\n```text\n{dirdescription}\n```\n\n" f"And that the user is giving you this task: '{state.initial_task}', " f"what action should you take next?" ) return await _process_agent_action(agent, ctx, update_directory=True) @step async def receive_human_answer( self, ev: HumanAnswerEvent, ctx: Context[WorkflowState], agent: Annotated[FsExplorerAgent, Resource(get_agent)], ) -> WorkflowEvent: """Process the human's response to a question.""" state = await ctx.store.get_state() agent.configure_task( f"Human response to your question: {ev.response}\n\n" f"Based on it, proceed with your exploration based on the " f"original task: {state.initial_task}" ) return await _process_agent_action(agent, ctx, update_directory=True) @step async def tool_call_action( self, ev: ToolCallEvent, ctx: Context[WorkflowState], agent: Annotated[FsExplorerAgent, Resource(get_agent)], ) -> WorkflowEvent: """Process the result of a tool call.""" agent.configure_task( "Given the result from the tool call you just performed, " "what action should you take next?" ) return await _process_agent_action(agent, ctx, update_directory=True) # Workflow timeout for complex multi-document analysis (5 minutes) WORKFLOW_TIMEOUT_SECONDS = 300 workflow = FsExplorerWorkflow(timeout=WORKFLOW_TIMEOUT_SECONDS) ================================================ FILE: tests/__init__.py ================================================ ================================================ FILE: tests/conftest.py ================================================ """ Pytest fixtures and mocks for FsExplorer tests. Provides mock implementations of the Google GenAI client for unit testing without making actual API calls. """ from google.genai.types import ( HttpOptions, Content, GenerateContentResponse, Candidate, Part, GenerateContentResponseUsageMetadata, ) from fs_explorer.models import StopAction, Action class MockModels: """Mock implementation of the GenAI models interface.""" async def generate_content(self, *args, **kwargs) -> GenerateContentResponse: """Return a mock response with a stop action.""" return GenerateContentResponse( candidates=[ Candidate( content=Content( role="model", parts=[ Part.from_text( text=Action( action=StopAction( final_result="this is a final result" ), reason="I am done", ).model_dump_json() ) ], ) ) ], usage_metadata=GenerateContentResponseUsageMetadata( prompt_token_count=100, candidates_token_count=50, total_token_count=150, ), ) class MockAio: """Mock implementation of the async GenAI interface.""" @property def models(self) -> MockModels: """Return mock models interface.""" return MockModels() class MockGenAIClient: """ Mock implementation of the Google GenAI client. Provides predictable responses for testing without API calls. """ def __init__(self, api_key: str, http_options: HttpOptions) -> None: """Initialize mock client (ignores parameters).""" pass @property def aio(self) -> MockAio: """Return mock async interface.""" return MockAio() ================================================ FILE: tests/test_agent.py ================================================ """Tests for the FsExplorerAgent class.""" import pytest import os from unittest.mock import patch from google.genai import Client as GenAIClient from google.genai.types import HttpOptions from fs_explorer.agent import ( FsExplorerAgent, SYSTEM_PROMPT, TokenUsage, _build_system_prompt, set_search_flags, get_search_flags, clear_index_context, ) from fs_explorer.models import Action, StopAction from .conftest import MockGenAIClient class TestAgentInitialization: """Tests for agent initialization.""" @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"}) def test_agent_init_with_env_key(self) -> None: """Test agent initialization with API key from environment.""" agent = FsExplorerAgent() assert isinstance(agent._client, GenAIClient) assert len(agent._chat_history) == 0 # No system prompt in history assert isinstance(agent.token_usage, TokenUsage) def test_agent_init_with_explicit_key(self) -> None: """Test agent initialization with explicit API key.""" agent = FsExplorerAgent(api_key="explicit-test-key") assert isinstance(agent._client, GenAIClient) def test_agent_init_without_key_raises(self) -> None: """Test that initialization without API key raises ValueError.""" # Ensure no key in environment env = os.environ.copy() if "GOOGLE_API_KEY" in env: del env["GOOGLE_API_KEY"] with patch.dict(os.environ, env, clear=True): with pytest.raises(ValueError, match="GOOGLE_API_KEY not found"): FsExplorerAgent() class TestAgentConfiguration: """Tests for agent task configuration.""" @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"}) def test_configure_task_adds_to_history(self) -> None: """Test that configure_task adds message to chat history.""" agent = FsExplorerAgent() agent.configure_task("this is a task") assert len(agent._chat_history) == 1 assert agent._chat_history[0].role == "user" assert agent._chat_history[0].parts[0].text == "this is a task" @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"}) def test_multiple_configure_task_calls(self) -> None: """Test that multiple configure_task calls accumulate.""" agent = FsExplorerAgent() agent.configure_task("task 1") agent.configure_task("task 2") assert len(agent._chat_history) == 2 assert agent._chat_history[0].parts[0].text == "task 1" assert agent._chat_history[1].parts[0].text == "task 2" class TestAgentActions: """Tests for agent action handling.""" @pytest.mark.asyncio @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"}) async def test_take_action_returns_action(self) -> None: """Test that take_action returns an action from the model.""" agent = FsExplorerAgent() agent.configure_task("this is a task") agent._client = MockGenAIClient( api_key="test", http_options=HttpOptions(api_version="v1beta") ) result = await agent.take_action() assert result is not None action, action_type = result assert isinstance(action, Action) assert isinstance(action.action, StopAction) assert action.action.final_result == "this is a final result" assert action.reason == "I am done" assert action_type == "stop" @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"}) def test_reset_clears_history(self) -> None: """Test that reset clears chat history and token usage.""" agent = FsExplorerAgent() agent.configure_task("task 1") agent.token_usage.api_calls = 5 agent.reset() assert len(agent._chat_history) == 0 assert agent.token_usage.api_calls == 0 class TestTokenUsage: """Tests for TokenUsage tracking.""" def test_add_api_call(self) -> None: """Test adding API call metrics.""" usage = TokenUsage() usage.add_api_call(100, 50) assert usage.prompt_tokens == 100 assert usage.completion_tokens == 50 assert usage.total_tokens == 150 assert usage.api_calls == 1 def test_add_tool_result_parse_file(self) -> None: """Test tracking parse_file tool usage.""" usage = TokenUsage() usage.add_tool_result("document content here", "parse_file") assert usage.documents_parsed == 1 assert usage.tool_result_chars == len("document content here") def test_add_tool_result_scan_folder(self) -> None: """Test tracking scan_folder tool usage.""" usage = TokenUsage() # Simulating scan output with document markers result = "│ [1/3] doc1.pdf\n│ [2/3] doc2.pdf\n│ [3/3] doc3.pdf" usage.add_tool_result(result, "scan_folder") assert usage.documents_scanned == 3 def test_summary_format(self) -> None: """Test that summary produces formatted output.""" usage = TokenUsage() usage.add_api_call(1000, 500) summary = usage.summary() assert "TOKEN USAGE SUMMARY" in summary assert "1,000" in summary # Formatted prompt tokens assert "API Calls:" in summary assert "Est. Cost" in summary class TestSystemPrompt: """Tests for system prompt configuration.""" def test_system_prompt_contains_tools(self) -> None: """Test that system prompt documents all tools.""" assert "scan_folder" in SYSTEM_PROMPT assert "preview_file" in SYSTEM_PROMPT assert "parse_file" in SYSTEM_PROMPT assert "read" in SYSTEM_PROMPT assert "grep" in SYSTEM_PROMPT assert "glob" in SYSTEM_PROMPT def test_system_prompt_contains_strategy(self) -> None: """Test that system prompt includes exploration strategy.""" assert "Three-Phase" in SYSTEM_PROMPT or "PHASE" in SYSTEM_PROMPT assert "Parallel Scan" in SYSTEM_PROMPT or "PARALLEL" in SYSTEM_PROMPT assert "Backtracking" in SYSTEM_PROMPT or "BACKTRACK" in SYSTEM_PROMPT def test_system_prompt_contains_index_tools(self) -> None: """Test that system prompt documents index-aware tools.""" assert "semantic_search" in SYSTEM_PROMPT assert "get_document" in SYSTEM_PROMPT assert "list_indexed_documents" in SYSTEM_PROMPT class TestSearchFlags: """Tests for search flag state and dynamic system prompt.""" def setup_method(self) -> None: clear_index_context() def teardown_method(self) -> None: clear_index_context() def test_set_and_get_search_flags(self) -> None: assert get_search_flags() == (False, False) set_search_flags(enable_semantic=True, enable_metadata=False) assert get_search_flags() == (True, False) set_search_flags(enable_semantic=False, enable_metadata=False) assert get_search_flags() == (False, False) def test_clear_index_context_resets_flags(self) -> None: set_search_flags(enable_semantic=True, enable_metadata=True) clear_index_context() assert get_search_flags() == (False, False) def test_build_system_prompt_no_index(self) -> None: prompt = _build_system_prompt(False, False) assert prompt == SYSTEM_PROMPT def test_build_system_prompt_semantic_only(self) -> None: prompt = _build_system_prompt(True, False) assert "Semantic Only" in prompt assert "WITHOUT the `filters`" in prompt def test_build_system_prompt_metadata_only(self) -> None: prompt = _build_system_prompt(False, True) assert "Metadata Only" in prompt assert "metadata filtering" in prompt def test_build_system_prompt_both(self) -> None: prompt = _build_system_prompt(True, True) assert "Semantic + Metadata" in prompt @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"}) def test_all_tools_always_available(self) -> None: """Filesystem and indexed tools are never blocked.""" set_search_flags(enable_semantic=False, enable_metadata=False) agent = FsExplorerAgent() agent.configure_task("test") agent.call_tool("glob", {"directory": "/tmp", "pattern": "*.md"}) last = agent._chat_history[-1] assert "not available" not in last.parts[0].text ================================================ FILE: tests/test_cli_indexing.py ================================================ """CLI tests for indexing and schema commands.""" from pathlib import Path import fs_explorer.indexing.pipeline as pipeline_module import fs_explorer.main as main_module from fs_explorer.storage import DuckDBStorage from typer.testing import CliRunner def test_root_task_mode_remains_compatible(tmp_path: Path, monkeypatch) -> None: called: dict[str, object] = {} async def fake_run_workflow( task: str, folder: str = ".", *, use_index: bool = False, db_path: str | None = None, ) -> None: called["task"] = task called["folder"] = folder called["use_index"] = use_index called["db_path"] = db_path monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow) runner = CliRunner() result = runner.invoke( main_module.app, ["--task", "who is the CTO?", "--folder", str(tmp_path)], ) assert result.exit_code == 0 assert called["task"] == "who is the CTO?" assert called["folder"] == str(tmp_path) assert called["use_index"] is False def test_query_command_enables_index_mode(tmp_path: Path, monkeypatch) -> None: called: dict[str, object] = {} async def fake_run_workflow( task: str, folder: str = ".", *, use_index: bool = False, db_path: str | None = None, ) -> None: called["task"] = task called["folder"] = folder called["use_index"] = use_index called["db_path"] = db_path monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow) runner = CliRunner() result = runner.invoke( main_module.app, [ "query", "--task", "purchase price?", "--folder", str(tmp_path), "--db-path", "tmp.duckdb", ], ) assert result.exit_code == 0 assert called["task"] == "purchase price?" assert called["folder"] == str(tmp_path) assert called["use_index"] is True assert called["db_path"] == "tmp.duckdb" def test_index_and_schema_commands(tmp_path: Path, monkeypatch) -> None: corpus = tmp_path / "corpus" corpus.mkdir() (corpus / "agreement.md").write_text("Purchase price is $10.") (corpus / "risk_report.md").write_text("Risk summary here.") # Replace Docling path with plain text read for this unit test. monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = tmp_path / "index.duckdb" runner = CliRunner() index_result = runner.invoke( main_module.app, ["index", str(corpus), "--db-path", str(db_path), "--discover-schema"], ) assert index_result.exit_code == 0 assert "Index Complete" in index_result.stdout show_result = runner.invoke( main_module.app, ["schema", "show", str(corpus), "--db-path", str(db_path)], ) assert show_result.exit_code == 0 assert "auto_corpus" in show_result.stdout def test_index_command_with_metadata_forces_schema_discovery( tmp_path: Path, monkeypatch, ) -> None: called: dict[str, object] = {} class FakePipeline: def __init__(self, storage, embedding_provider=None) -> None: # noqa: ANN001 called["storage_type"] = type(storage).__name__ def index_folder( self, folder: str, *, discover_schema: bool = False, schema_name: str | None = None, with_metadata: bool = False, metadata_profile: dict | None = None, ): called["folder"] = folder called["discover_schema"] = discover_schema called["schema_name"] = schema_name called["with_metadata"] = with_metadata called["metadata_profile"] = metadata_profile return pipeline_module.IndexingResult( corpus_id="corpus_123", indexed_files=1, skipped_files=0, deleted_files=0, chunks_written=1, active_documents=1, schema_used="auto_corpus", ) monkeypatch.setattr(main_module, "IndexingPipeline", FakePipeline) db_path = tmp_path / "index.duckdb" corpus = tmp_path / "corpus" corpus.mkdir() runner = CliRunner() result = runner.invoke( main_module.app, ["index", str(corpus), "--db-path", str(db_path), "--with-metadata"], ) assert result.exit_code == 0 assert called["with_metadata"] is True assert called["discover_schema"] is True assert called["metadata_profile"] is None def test_index_command_with_metadata_profile_path( tmp_path: Path, monkeypatch, ) -> None: called: dict[str, object] = {} class FakePipeline: def __init__(self, storage, embedding_provider=None) -> None: # noqa: ANN001 called["storage_type"] = type(storage).__name__ def index_folder( self, folder: str, *, discover_schema: bool = False, schema_name: str | None = None, with_metadata: bool = False, metadata_profile: dict | None = None, ): called["folder"] = folder called["discover_schema"] = discover_schema called["schema_name"] = schema_name called["with_metadata"] = with_metadata called["metadata_profile"] = metadata_profile return pipeline_module.IndexingResult( corpus_id="corpus_123", indexed_files=1, skipped_files=0, deleted_files=0, chunks_written=1, active_documents=1, schema_used="auto_corpus", ) monkeypatch.setattr(main_module, "IndexingPipeline", FakePipeline) db_path = tmp_path / "index.duckdb" corpus = tmp_path / "corpus" corpus.mkdir() metadata_profile_path = tmp_path / "profile.json" metadata_profile_path.write_text( ( "{" '"prompt_description": "Extract organizations.", ' '"fields": [' '{"name": "org_names", "type": "string", "source_class": "organization"}' "]" "}" ) ) runner = CliRunner() result = runner.invoke( main_module.app, [ "index", str(corpus), "--db-path", str(db_path), "--metadata-profile", str(metadata_profile_path), ], ) assert result.exit_code == 0 assert called["with_metadata"] is True assert called["discover_schema"] is True assert isinstance(called["metadata_profile"], dict) assert called["metadata_profile"]["fields"][0]["name"] == "org_names" def test_index_command_with_embeddings_flag( tmp_path: Path, monkeypatch, ) -> None: """--with-embeddings creates an EmbeddingProvider and passes it to the pipeline.""" calls: dict[str, object] = {} class FakePipeline: def __init__(self, storage, embedding_provider=None) -> None: # noqa: ANN001 calls["has_embedding_provider"] = embedding_provider is not None def index_folder(self, folder, **kwargs): # noqa: ANN001, ANN003 return pipeline_module.IndexingResult( corpus_id="corpus_123", indexed_files=1, skipped_files=0, deleted_files=0, chunks_written=1, active_documents=1, schema_used=None, embeddings_written=5, ) class FakeEmbeddingProvider: def __init__(self, **kwargs): # noqa: ANN003 pass monkeypatch.setattr(main_module, "IndexingPipeline", FakePipeline) monkeypatch.setattr(main_module, "EmbeddingProvider", FakeEmbeddingProvider) db_path = tmp_path / "index.duckdb" corpus = tmp_path / "corpus" corpus.mkdir() runner = CliRunner() result = runner.invoke( main_module.app, ["index", str(corpus), "--db-path", str(db_path), "--with-embeddings"], ) assert result.exit_code == 0 assert calls["has_embedding_provider"] is True assert "Embeddings Written" in result.stdout def test_auto_index_env_var_enables_use_index( tmp_path: Path, monkeypatch, ) -> None: """FS_EXPLORER_AUTO_INDEX=1 auto-enables --use-index when index exists.""" called: dict[str, object] = {} async def fake_run_workflow( task: str, folder: str = ".", *, use_index: bool = False, db_path: str | None = None, ) -> None: called["use_index"] = use_index monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow) monkeypatch.setenv("FS_EXPLORER_AUTO_INDEX", "1") # Create a real DuckDB with a corpus entry so auto-index detection works. corpus = tmp_path / "corpus" corpus.mkdir() db_path = tmp_path / "index.duckdb" storage = DuckDBStorage(str(db_path)) storage.get_or_create_corpus(str(corpus.resolve())) storage.close() monkeypatch.setenv("FS_EXPLORER_DB_PATH", str(db_path)) runner = CliRunner() result = runner.invoke( main_module.app, ["--task", "test question", "--folder", str(corpus)], ) assert result.exit_code == 0 assert called["use_index"] is True def test_auto_index_env_var_silent_fallback( tmp_path: Path, monkeypatch, ) -> None: """FS_EXPLORER_AUTO_INDEX=1 gracefully falls back when no index exists.""" called: dict[str, object] = {} async def fake_run_workflow( task: str, folder: str = ".", *, use_index: bool = False, db_path: str | None = None, ) -> None: called["use_index"] = use_index monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow) monkeypatch.setenv("FS_EXPLORER_AUTO_INDEX", "1") corpus = tmp_path / "empty_corpus" corpus.mkdir() runner = CliRunner() result = runner.invoke( main_module.app, ["--task", "test question", "--folder", str(corpus)], ) assert result.exit_code == 0 assert called["use_index"] is False ================================================ FILE: tests/test_e2e.py ================================================ import pytest import os from workflows.testing import WorkflowTestRunner SKIP_IF, SKIP_REASON = ( os.getenv("GOOGLE_API_KEY") is None, "GOOGLE_API_KEY not available", ) @pytest.mark.asyncio @pytest.mark.skipif(condition=SKIP_IF, reason=SKIP_REASON) async def test_e2e() -> None: from fs_explorer.workflow import ( workflow, InputEvent, ExplorationEndEvent, ToolCallEvent, GoDeeperEvent, ) start_event = InputEvent( task="Starting from the current directory, individuate the python file responsible for file system operations and explain what it does" ) runner = WorkflowTestRunner(workflow=workflow) result = await runner.run(start_event=start_event) assert isinstance(result.result, ExplorationEndEvent) assert result.result.error is None assert result.result.final_result is not None assert len(result.collected) > 1 assert ToolCallEvent in result.event_types or GoDeeperEvent in result.event_types ================================================ FILE: tests/test_embeddings.py ================================================ """Tests for the embedding provider.""" from __future__ import annotations import os from dataclasses import dataclass from typing import Any import pytest from fs_explorer.embeddings import EmbeddingProvider # --------------------------------------------------------------------------- # Mock helpers # --------------------------------------------------------------------------- @dataclass class _FakeEmbedding: values: list[float] @dataclass class _FakeEmbedResult: embeddings: list[_FakeEmbedding] class _FakeModels: """Records calls and returns deterministic embeddings.""" def __init__(self) -> None: self.calls: list[dict[str, Any]] = [] def embed_content( self, *, model: str, contents: list[str], config: dict ) -> _FakeEmbedResult: self.calls.append({"model": model, "contents": contents, "config": config}) dim = config.get("output_dimensionality", 768) return _FakeEmbedResult( embeddings=[ _FakeEmbedding(values=[float(i)] * dim) for i in range(len(contents)) ] ) class _FakeClient: def __init__(self) -> None: self.models = _FakeModels() # --------------------------------------------------------------------------- # Unit tests (mock-based, no API key needed) # --------------------------------------------------------------------------- def test_embed_texts_returns_correct_count() -> None: client = _FakeClient() provider = EmbeddingProvider(client=client, dim=4, batch_size=50) embeddings = provider.embed_texts(["hello", "world"]) assert len(embeddings) == 2 assert len(embeddings[0]) == 4 def test_embed_texts_uses_document_task_type() -> None: client = _FakeClient() provider = EmbeddingProvider(client=client, dim=4) provider.embed_texts(["test"]) call = client.models.calls[0] assert call["config"]["task_type"] == "RETRIEVAL_DOCUMENT" def test_embed_query_uses_query_task_type() -> None: client = _FakeClient() provider = EmbeddingProvider(client=client, dim=4) result = provider.embed_query("search query") assert len(result) == 4 call = client.models.calls[0] assert call["config"]["task_type"] == "RETRIEVAL_QUERY" def test_embed_texts_batching() -> None: client = _FakeClient() provider = EmbeddingProvider(client=client, dim=4, batch_size=3) texts = [f"text_{i}" for i in range(7)] embeddings = provider.embed_texts(texts) assert len(embeddings) == 7 # 7 texts with batch_size=3 → 3 API calls (3+3+1) assert len(client.models.calls) == 3 assert len(client.models.calls[0]["contents"]) == 3 assert len(client.models.calls[1]["contents"]) == 3 assert len(client.models.calls[2]["contents"]) == 1 def test_env_overrides(monkeypatch) -> None: client = _FakeClient() monkeypatch.setenv("FS_EXPLORER_EMBEDDING_MODEL", "custom-model-001") monkeypatch.setenv("FS_EXPLORER_EMBEDDING_DIM", "256") monkeypatch.setenv("FS_EXPLORER_EMBEDDING_BATCH_SIZE", "10") provider = EmbeddingProvider(client=client) assert provider.model == "custom-model-001" assert provider.dim == 256 assert provider.batch_size == 10 provider.embed_texts(["test"]) call = client.models.calls[0] assert call["model"] == "custom-model-001" assert call["config"]["output_dimensionality"] == 256 def test_missing_api_key_raises(monkeypatch) -> None: monkeypatch.delenv("GOOGLE_API_KEY", raising=False) with pytest.raises(ValueError, match="GOOGLE_API_KEY"): EmbeddingProvider(api_key=None, client=None) # --------------------------------------------------------------------------- # Real API integration test (skipped unless GOOGLE_API_KEY is set) # --------------------------------------------------------------------------- @pytest.mark.skipif( not os.getenv("GOOGLE_API_KEY"), reason="GOOGLE_API_KEY not set — skipping real embedding test", ) def test_real_embedding_api() -> None: provider = EmbeddingProvider(dim=128) texts = ["The purchase price is $45 million.", "Risk assessment summary."] embeddings = provider.embed_texts(texts) assert len(embeddings) == 2 assert len(embeddings[0]) == 128 assert all(isinstance(v, float) for v in embeddings[0]) query_emb = provider.embed_query("purchase price") assert len(query_emb) == 128 ================================================ FILE: tests/test_exploration_trace.py ================================================ """Tests for exploration trace helpers.""" import os from fs_explorer.exploration_trace import ( ExplorationTrace, extract_cited_sources, normalize_path, ) def test_normalize_path_relative() -> None: root = "/tmp/project" assert normalize_path("docs/file.pdf", root) == os.path.abspath("/tmp/project/docs/file.pdf") def test_normalize_path_absolute() -> None: root = "/tmp/project" assert normalize_path("/var/data/file.pdf", root) == os.path.abspath("/var/data/file.pdf") def test_trace_records_steps_and_documents() -> None: trace = ExplorationTrace(root_directory="/tmp/project") trace.record_tool_call( step_number=1, tool_name="scan_folder", tool_input={"directory": "docs"}, ) trace.record_tool_call( step_number=2, tool_name="parse_file", tool_input={"file_path": "docs/contract.pdf"}, ) trace.record_go_deeper(step_number=3, directory="docs/subdir") assert len(trace.step_path) == 3 assert "tool:scan_folder" in trace.step_path[0] assert "tool:parse_file" in trace.step_path[1] assert "godeeper" in trace.step_path[2] referenced = trace.sorted_documents() assert len(referenced) == 1 assert referenced[0].endswith("docs/contract.pdf") def test_trace_records_resolved_document_paths() -> None: trace = ExplorationTrace(root_directory="/tmp/project") trace.record_tool_call( step_number=1, tool_name="get_document", tool_input={"doc_id": "doc_123"}, resolved_document_path="/tmp/project/docs/indexed.pdf", ) assert "document=/tmp/project/docs/indexed.pdf" in trace.step_path[0] assert trace.sorted_documents() == ["/tmp/project/docs/indexed.pdf"] def test_extract_cited_sources_ordered_unique() -> None: final_result = ( "Price is $10M [Source: agreement.pdf, Section 2.1]. " "Escrow is $1M [Source: escrow.pdf, Section 3]. " "Reconfirmed [Source: agreement.pdf, Section 2.1]." ) assert extract_cited_sources(final_result) == ["agreement.pdf", "escrow.pdf"] ================================================ FILE: tests/test_fs.py ================================================ """Tests for filesystem utility functions.""" import pytest import os import tempfile from pathlib import Path from fs_explorer.fs import ( describe_dir_content, read_file, grep_file_content, glob_paths, parse_file, preview_file, scan_folder, clear_document_cache, SUPPORTED_EXTENSIONS, ) class TestDescribeDirContent: """Tests for describe_dir_content function.""" def test_valid_directory(self) -> None: """Test describing a valid directory with files and subfolders.""" description = describe_dir_content("tests/testfiles") assert "Content of tests/testfiles" in description assert "tests/testfiles/file1.txt" in description assert "tests/testfiles/file2.md" in description assert "tests/testfiles/last" in description def test_nonexistent_directory(self) -> None: """Test describing a directory that doesn't exist.""" description = describe_dir_content("tests/testfile") assert description == "No such directory: tests/testfile" def test_directory_without_subfolders(self) -> None: """Test describing a directory that has no subdirectories.""" description = describe_dir_content("tests/testfiles/last") assert "Content of tests/testfiles/last" in description assert "tests/testfiles/last/lastfile.txt" in description assert "This folder does not have any sub-folders" in description class TestReadFile: """Tests for read_file function.""" def test_valid_file(self) -> None: """Test reading a valid text file.""" content = read_file("tests/testfiles/file1.txt") assert content.strip() == "this is a test" def test_nonexistent_file(self) -> None: """Test reading a file that doesn't exist.""" content = read_file("tests/testfiles/file2.txt") assert content == "No such file: tests/testfiles/file2.txt" class TestGrepFileContent: """Tests for grep_file_content function.""" def test_pattern_match(self) -> None: """Test searching for a pattern that exists.""" result = grep_file_content("tests/testfiles/file2.md", r"(are|is) a test") assert "MATCHES for (are|is) a test" in result assert "is" in result def test_no_match(self) -> None: """Test searching for a pattern that doesn't exist.""" result = grep_file_content("tests/testfiles/last/lastfile.txt", r"test") assert result == "No matches found" def test_nonexistent_file(self) -> None: """Test searching in a file that doesn't exist.""" result = grep_file_content("tests/testfiles/file2.txt", r"test") assert result == "No such file: tests/testfiles/file2.txt" class TestGlobPaths: """Tests for glob_paths function.""" def test_pattern_match(self) -> None: """Test finding files that match a glob pattern.""" result = glob_paths("tests/testfiles", "file?.*") assert "MATCHES for file?.* in tests/testfiles" in result assert "file1.txt" in result assert "file2.md" in result def test_no_match(self) -> None: """Test a pattern that matches nothing.""" result = glob_paths("tests/testfiles", "nonexistent*") assert result == "No matches found" def test_nonexistent_directory(self) -> None: """Test glob in a directory that doesn't exist.""" result = glob_paths("tests/nonexistent", "*.txt") assert result == "No such directory: tests/nonexistent" class TestDocumentParsing: """Tests for document parsing functions (parse_file, preview_file).""" def setup_method(self) -> None: """Clear cache before each test.""" clear_document_cache() def test_parse_file_nonexistent(self) -> None: """Test parsing a file that doesn't exist.""" content = parse_file("data/nonexistent.pdf") assert content == "No such file: data/nonexistent.pdf" def test_parse_file_unsupported_extension(self) -> None: """Test parsing a file with unsupported extension.""" content = parse_file("tests/testfiles/file1.txt") assert "Unsupported file extension: .txt" in content def test_preview_file_nonexistent(self) -> None: """Test previewing a file that doesn't exist.""" content = preview_file("data/nonexistent.pdf") assert content == "No such file: data/nonexistent.pdf" def test_preview_file_unsupported_extension(self) -> None: """Test previewing a file with unsupported extension.""" content = preview_file("tests/testfiles/file1.txt") assert "Unsupported file extension: .txt" in content @pytest.mark.skipif( not os.path.exists("data/large_acquisition"), reason="Test documents not generated" ) def test_parse_file_pdf(self) -> None: """Test parsing an actual PDF file.""" # Use one of the generated test PDFs pdf_files = list(Path("data/large_acquisition").glob("*.pdf")) if pdf_files: content = parse_file(str(pdf_files[0])) assert len(content) > 0 assert "Error" not in content @pytest.mark.skipif( not os.path.exists("data/large_acquisition"), reason="Test documents not generated" ) def test_preview_file_pdf(self) -> None: """Test previewing an actual PDF file.""" pdf_files = list(Path("data/large_acquisition").glob("*.pdf")) if pdf_files: content = preview_file(str(pdf_files[0]), max_chars=500) assert "=== PREVIEW of" in content # Preview should be limited assert len(content) < 2000 # Preview + header + truncation message class TestScanFolder: """Tests for scan_folder function.""" def setup_method(self) -> None: """Clear cache before each test.""" clear_document_cache() def test_nonexistent_directory(self) -> None: """Test scanning a directory that doesn't exist.""" result = scan_folder("nonexistent/path") assert result == "No such directory: nonexistent/path" def test_empty_directory(self) -> None: """Test scanning a directory with no supported documents.""" with tempfile.TemporaryDirectory() as tmpdir: # Create a non-document file Path(tmpdir, "test.txt").write_text("hello") result = scan_folder(tmpdir) assert "No supported documents found" in result @pytest.mark.skipif( not os.path.exists("data/large_acquisition"), reason="Test documents not generated" ) def test_scan_folder_with_documents(self) -> None: """Test scanning a folder with actual documents.""" result = scan_folder("data/large_acquisition", max_workers=2) assert "PARALLEL DOCUMENT SCAN" in result assert "Found" in result assert "documents" in result class TestSupportedExtensions: """Tests for supported extensions configuration.""" def test_supported_extensions_is_frozenset(self) -> None: """Verify SUPPORTED_EXTENSIONS is immutable.""" assert isinstance(SUPPORTED_EXTENSIONS, frozenset) def test_common_extensions_supported(self) -> None: """Verify common document extensions are supported.""" assert ".pdf" in SUPPORTED_EXTENSIONS assert ".docx" in SUPPORTED_EXTENSIONS assert ".md" in SUPPORTED_EXTENSIONS ================================================ FILE: tests/test_indexing.py ================================================ """Tests for indexing and schema components.""" import json import time from dataclasses import dataclass from pathlib import Path from unittest.mock import MagicMock, patch import fs_explorer.indexing.metadata as metadata_module import fs_explorer.indexing.pipeline as pipeline_module from fs_explorer.embeddings import EmbeddingProvider from fs_explorer.indexing.chunker import SmartChunker from fs_explorer.indexing.metadata import auto_discover_profile, normalize_langextract_profile from fs_explorer.indexing.pipeline import IndexingPipeline from fs_explorer.indexing.schema import SchemaDiscovery from fs_explorer.storage import DuckDBStorage def test_smart_chunker_overlap() -> None: text = "A" * 2500 chunker = SmartChunker(chunk_size=1000, overlap=100) chunks = chunker.chunk_text(text) assert len(chunks) == 3 assert chunks[1].start_char == chunks[0].end_char - 100 assert chunks[2].start_char == chunks[1].end_char - 100 def test_schema_discovery_from_folder(tmp_path: Path) -> None: folder = tmp_path / "corpus" folder.mkdir() (folder / "01_master_agreement.md").write_text("# agreement\nprice: $10") (folder / "04_risk_report.md").write_text("# report\nrisk summary") schema = SchemaDiscovery().discover_from_folder(str(folder)) fields = schema["fields"] field_names = {field["name"] for field in fields} assert "document_type" in field_names assert "mentions_currency" in field_names document_type_field = next( field for field in fields if field["name"] == "document_type" ) assert "agreement" in document_type_field["enum"] assert "report" in document_type_field["enum"] def test_schema_discovery_with_langextract_fields(tmp_path: Path, monkeypatch) -> None: folder = tmp_path / "corpus" folder.mkdir() (folder / "agreement.md").write_text("Purchase price with escrow and earnout.") # Mock auto_discover_profile to return the default profile so this test # stays deterministic (auto-discovery would call the real LLM). from fs_explorer.indexing.metadata import default_langextract_profile monkeypatch.setattr( "fs_explorer.indexing.schema.auto_discover_profile", lambda folder, **kwargs: default_langextract_profile(), ) schema = SchemaDiscovery().discover_from_folder( str(folder), with_langextract=True, ) field_names = {field["name"] for field in schema["fields"]} assert "lx_enabled" in field_names assert "lx_has_earnout" in field_names assert "lx_money_mentions" in field_names def test_schema_discovery_with_custom_metadata_profile(tmp_path: Path) -> None: folder = tmp_path / "corpus" folder.mkdir() (folder / "notes.md").write_text("Acme Corp retained Jane Doe for diligence.") profile = { "prompt_description": "Extract organizations and people.", "fields": [ { "name": "org_names", "type": "string", "source_class": "organization", "mode": "values", }, { "name": "person_count", "type": "integer", "source_class": "person", "mode": "count", }, ], } schema = SchemaDiscovery().discover_from_folder( str(folder), with_langextract=True, metadata_profile=profile, ) field_names = {field["name"] for field in schema["fields"]} assert "org_names" in field_names assert "person_count" in field_names assert isinstance(schema.get("metadata_profile"), dict) def test_indexing_pipeline_indexes_and_marks_deleted( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() first = corpus / "a_agreement.md" second = corpus / "b_schedule.md" first.write_text("Purchase price is $45,000,000.\n\nSection 1.2") second.write_text("Schedule details.\n\nEffective Date: January 1, 2026") # Avoid Docling in this unit test; treat markdown as plain text. monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = tmp_path / "index.duckdb" storage = DuckDBStorage(str(db_path)) pipeline = IndexingPipeline(storage=storage) first_result = pipeline.index_folder(str(corpus), discover_schema=True) assert first_result.indexed_files == 2 assert first_result.skipped_files == 0 assert first_result.active_documents == 2 assert first_result.schema_used is not None assert storage.count_chunks(corpus_id=first_result.corpus_id) > 0 hits = storage.search_chunks( corpus_id=first_result.corpus_id, query="purchase price", limit=3, ) assert hits top_doc = storage.get_document(doc_id=hits[0]["doc_id"]) assert top_doc is not None assert "Purchase price" in top_doc["content"] metadata_hits = storage.search_documents_by_metadata( corpus_id=first_result.corpus_id, filters=[ { "field": "document_type", "operator": "eq", "value": "agreement", } ], limit=5, ) assert metadata_hits assert any(hit["relative_path"] == "a_agreement.md" for hit in metadata_hits) assert all(hit["relative_path"] != "b_schedule.md" for hit in metadata_hits) second.unlink() second_result = pipeline.index_folder(str(corpus)) assert second_result.indexed_files == 1 assert second_result.active_documents == 1 all_docs = storage.list_documents( corpus_id=first_result.corpus_id, include_deleted=True, ) deleted_paths = {doc["relative_path"] for doc in all_docs if doc["is_deleted"]} assert "b_schedule.md" in deleted_paths def test_indexing_pipeline_with_langextract_metadata( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() doc_path = corpus / "agreement.md" doc_path.write_text("Purchase price and escrow details.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) # Use the default profile so the schema includes the expected fields from fs_explorer.indexing.metadata import default_langextract_profile monkeypatch.setattr( "fs_explorer.indexing.schema.auto_discover_profile", lambda folder, **kwargs: default_langextract_profile(), ) monkeypatch.setattr( metadata_module, "_extract_langextract_metadata", lambda **_: { "lx_enabled": True, "lx_extraction_count": 3, "lx_entity_classes": "deal_term,organization", "lx_organizations": "TechCorp Industries", "lx_people": "", "lx_deal_terms": "escrow reserve", "lx_money_mentions": 1, "lx_date_mentions": 0, "lx_has_earnout": False, "lx_has_escrow": True, }, ) storage = DuckDBStorage(str(tmp_path / "index.duckdb")) pipeline = IndexingPipeline(storage=storage) result = pipeline.index_folder( str(corpus), discover_schema=True, with_metadata=True, ) assert result.indexed_files == 1 assert result.schema_used is not None docs = storage.list_documents(corpus_id=result.corpus_id, include_deleted=False) assert len(docs) == 1 stored = storage.get_document(doc_id=docs[0]["id"]) assert stored is not None metadata = json.loads(stored["metadata_json"]) assert metadata["lx_enabled"] is True assert metadata["lx_has_escrow"] is True hits = storage.search_documents_by_metadata( corpus_id=result.corpus_id, filters=[{"field": "lx_has_escrow", "operator": "eq", "value": True}], limit=5, ) assert hits assert hits[0]["relative_path"] == "agreement.md" def test_indexing_pipeline_reuses_saved_metadata_profile( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() doc_path = corpus / "custom.md" doc_path.write_text("Acme Corp and Jane Doe signed terms.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) seen_profiles: list[dict[str, object] | None] = [] def fake_extract(**kwargs): # noqa: ANN003 seen_profiles.append(kwargs.get("profile")) return { "org_names": "Acme Corp", "person_present": True, } monkeypatch.setattr(metadata_module, "_extract_langextract_metadata", fake_extract) custom_profile = { "prompt_description": "Extract organizations and people.", "fields": [ { "name": "org_names", "type": "string", "source_class": "organization", "mode": "values", }, { "name": "person_present", "type": "boolean", "source_class": "person", "mode": "exists", }, ], } storage = DuckDBStorage(str(tmp_path / "index.duckdb")) pipeline = IndexingPipeline(storage=storage) first_result = pipeline.index_folder( str(corpus), discover_schema=True, with_metadata=True, metadata_profile=custom_profile, ) assert first_result.indexed_files == 1 assert seen_profiles and isinstance(seen_profiles[0], dict) second_result = pipeline.index_folder( str(corpus), with_metadata=True, ) assert second_result.indexed_files == 1 assert len(seen_profiles) >= 2 latest_profile = seen_profiles[-1] assert isinstance(latest_profile, dict) fields_obj = latest_profile.get("fields") assert isinstance(fields_obj, list) second_fields = { str(field["name"]) for field in fields_obj if isinstance(field, dict) and isinstance(field.get("name"), str) } assert {"org_names", "person_present"}.issubset(second_fields) # --------------------------------------------------------------------------- # Auto-profile generation tests # --------------------------------------------------------------------------- def test_auto_discover_profile_with_mock_llm( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "contract.md").write_text("TechCorp acquires StartupXYZ for $10M.") (corpus / "report.md").write_text("Quarterly revenue report for FY2025.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) monkeypatch.setenv("GOOGLE_API_KEY", "fake-key") llm_response_json = json.dumps( { "name": "test_auto", "description": "Auto-generated test profile.", "prompt_description": "Extract key metadata from documents.", "fields": [ { "name": "lx_organizations", "type": "string", "description": "Organization names.", "source": "entities", "source_classes": ["organization", "company"], "mode": "values", }, { "name": "lx_money_count", "type": "integer", "description": "Count of monetary amounts.", "source": "entities", "source_classes": ["money"], "mode": "count", }, ], } ) mock_response = MagicMock() mock_response.text = llm_response_json mock_client_instance = MagicMock() mock_client_instance.models.generate_content.return_value = mock_response with patch( "fs_explorer.indexing.metadata._get_genai_client", return_value=mock_client_instance, ): profile = auto_discover_profile(str(corpus)) # Should pass validation normalized = normalize_langextract_profile(profile) field_names = {f["name"] for f in normalized["fields"]} assert "lx_organizations" in field_names assert "lx_money_count" in field_names # Runtime fields should have been added automatically assert "lx_enabled" in field_names def test_auto_discover_profile_falls_back_on_error( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "file.md").write_text("Some content.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) monkeypatch.setenv("GOOGLE_API_KEY", "fake-key") with patch( "fs_explorer.indexing.metadata._get_genai_client", side_effect=RuntimeError("API down"), ): profile = auto_discover_profile(str(corpus)) # Should return default profile default_names = { f["name"] for f in metadata_module._DEFAULT_LANGEXTRACT_PROFILE["fields"] } got_names = {f["name"] for f in profile["fields"]} assert default_names == got_names def test_auto_discover_profile_falls_back_without_api_key( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "file.md").write_text("Some content.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) monkeypatch.delenv("GOOGLE_API_KEY", raising=False) profile = auto_discover_profile(str(corpus)) default_names = { f["name"] for f in metadata_module._DEFAULT_LANGEXTRACT_PROFILE["fields"] } got_names = {f["name"] for f in profile["fields"]} assert default_names == got_names def test_schema_discovery_uses_auto_profile_when_no_explicit_profile( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "contract.md").write_text("Agreement terms.") # Capture what auto_discover_profile returns (mock it) auto_profile = { "name": "auto_test", "description": "Auto-generated.", "prompt_description": "Extract metadata.", "fields": [ { "name": "lx_enabled", "type": "boolean", "required": False, "description": "Whether langextract succeeded.", "source": "runtime", "runtime": "enabled", "mode": "runtime", "source_classes": [], "contains_any": [], }, { "name": "lx_orgs", "type": "string", "required": False, "description": "Organizations.", "source": "entities", "source_classes": ["organization"], "mode": "values", "contains_any": [], }, ], } monkeypatch.setattr( "fs_explorer.indexing.schema.auto_discover_profile", lambda folder, **kwargs: auto_profile, ) schema = SchemaDiscovery().discover_from_folder( str(corpus), with_langextract=True, metadata_profile=None, ) field_names = {f["name"] for f in schema["fields"]} assert "lx_orgs" in field_names assert "lx_enabled" in field_names assert schema.get("metadata_profile") == auto_profile # --------------------------------------------------------------------------- # Mock embedding helpers for indexing tests # --------------------------------------------------------------------------- @dataclass class _FakeEmbedding: values: list[float] @dataclass class _FakeEmbedResult: embeddings: list[_FakeEmbedding] class _FakeEmbedModels: def embed_content( self, *, model: str, contents: list[str], config: dict ) -> _FakeEmbedResult: dim = config.get("output_dimensionality", 4) return _FakeEmbedResult( embeddings=[ _FakeEmbedding(values=[0.1 * i] * dim) for i in range(len(contents)) ] ) class _FakeEmbedClient: def __init__(self) -> None: self.models = _FakeEmbedModels() # --------------------------------------------------------------------------- # Embedding indexing tests # --------------------------------------------------------------------------- def test_indexing_pipeline_with_embeddings( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "agreement.md").write_text("Purchase price is $45,000,000.") (corpus / "report.md").write_text("Risk register summary.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = str(tmp_path / "index.duckdb") storage = DuckDBStorage(db_path, embedding_dim=4) provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4) pipeline = IndexingPipeline(storage=storage, embedding_provider=provider) result = pipeline.index_folder(str(corpus), discover_schema=True) assert result.indexed_files == 2 assert result.embeddings_written > 0 assert storage.has_embeddings(corpus_id=result.corpus_id) def test_indexing_pipeline_without_embeddings( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "agreement.md").write_text("Purchase price.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = str(tmp_path / "index.duckdb") storage = DuckDBStorage(db_path) pipeline = IndexingPipeline(storage=storage) result = pipeline.index_folder(str(corpus), discover_schema=True) assert result.embeddings_written == 0 assert not storage.has_embeddings(corpus_id=result.corpus_id) def test_embedding_cascade_on_reindex( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() doc = corpus / "agreement.md" doc.write_text("Purchase price is $45,000,000.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = str(tmp_path / "index.duckdb") storage = DuckDBStorage(db_path, embedding_dim=4) provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4) pipeline = IndexingPipeline(storage=storage, embedding_provider=provider) first = pipeline.index_folder(str(corpus), discover_schema=True) assert first.embeddings_written > 0 # Update document and re-index; old embeddings should be replaced doc.write_text("Updated purchase price is $50,000,000.") second = pipeline.index_folder(str(corpus)) assert second.embeddings_written > 0 assert storage.has_embeddings(corpus_id=second.corpus_id) # --------------------------------------------------------------------------- # Parallel metadata extraction tests # --------------------------------------------------------------------------- def test_extract_metadata_batch_returns_correct_metadata( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "agreement.md").write_text("Purchase price is $45,000,000.") (corpus / "report.md").write_text("Risk register summary.") (corpus / "schedule.md").write_text("Effective Date: January 1, 2026") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) storage = DuckDBStorage(str(tmp_path / "index.duckdb")) pipeline = IndexingPipeline(storage=storage, max_workers=2) root = str(corpus) parsed_docs = [] import os for f in sorted(corpus.iterdir()): content = f.read_text() rel = os.path.relpath(str(f), root) parsed_docs.append((str(f), rel, content)) metadata_map = pipeline._extract_metadata_batch( parsed_docs=parsed_docs, root_path=root, schema_def=None, with_langextract=False, langextract_profile=None, ) assert len(metadata_map) == 3 assert "agreement.md" in metadata_map assert "report.md" in metadata_map assert "schedule.md" in metadata_map # Check heuristic metadata assert metadata_map["agreement.md"]["mentions_currency"] is True assert metadata_map["schedule.md"]["mentions_dates"] is True assert metadata_map["report.md"]["document_type"] == "report" def test_extract_metadata_batch_parallel_is_faster_than_sequential( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() for i in range(6): (corpus / f"doc_{i}.md").write_text(f"Document {i} content. Price is ${i}00.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) delay = 0.1 original_extract = metadata_module.extract_metadata def slow_extract(**kwargs): time.sleep(delay) return original_extract(**kwargs) monkeypatch.setattr(pipeline_module, "extract_metadata", slow_extract) storage = DuckDBStorage(str(tmp_path / "index.duckdb")) pipeline = IndexingPipeline(storage=storage, max_workers=6) root = str(corpus) parsed_docs = [] import os for f in sorted(corpus.iterdir()): content = f.read_text() rel = os.path.relpath(str(f), root) parsed_docs.append((str(f), rel, content)) start = time.monotonic() metadata_map = pipeline._extract_metadata_batch( parsed_docs=parsed_docs, root_path=root, schema_def=None, with_langextract=False, langextract_profile=None, ) elapsed = time.monotonic() - start assert len(metadata_map) == 6 # 6 docs * 0.1s each = 0.6s sequential; parallel should finish in < 0.4s assert elapsed < 0.4, f"Parallel extraction too slow: {elapsed:.2f}s" def test_parallel_and_sequential_produce_same_results( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "a.md").write_text("Purchase price is $45,000,000.") (corpus / "b.md").write_text("Effective Date: January 1, 2026. Risk summary.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) storage = DuckDBStorage(str(tmp_path / "index.duckdb")) root = str(corpus) parsed_docs = [] import os for f in sorted(corpus.iterdir()): content = f.read_text() rel = os.path.relpath(str(f), root) parsed_docs.append((str(f), rel, content)) # Sequential (max_workers=1) pipeline_seq = IndexingPipeline(storage=storage, max_workers=1) map_seq = pipeline_seq._extract_metadata_batch( parsed_docs=parsed_docs, root_path=root, schema_def=None, with_langextract=False, langextract_profile=None, ) # Parallel (max_workers=4) pipeline_par = IndexingPipeline(storage=storage, max_workers=4) map_par = pipeline_par._extract_metadata_batch( parsed_docs=parsed_docs, root_path=root, schema_def=None, with_langextract=False, langextract_profile=None, ) assert map_seq.keys() == map_par.keys() for key in map_seq: assert map_seq[key] == map_par[key], f"Mismatch for {key}" ================================================ FILE: tests/test_models.py ================================================ from fs_explorer.models import ( ToolCallAction, Action, ToolCallArg, GoDeeperAction, StopAction, ) def test_tool_call_action_to_tool_args() -> None: tool_call_action = ToolCallAction( tool_name="glob", tool_input=[ ToolCallArg(parameter_name="directory", parameter_value="tests/testfiles"), ToolCallArg(parameter_name="pattern", parameter_value="file?.*"), ], ) assert tool_call_action.to_fn_args() == { "directory": "tests/testfiles", "pattern": "file?.*", } def test_action_to_action_type() -> None: action = Action( action=ToolCallAction( tool_name="glob", tool_input=[ ToolCallArg( parameter_name="directory", parameter_value="tests/testfiles" ), ToolCallArg(parameter_name="pattern", parameter_value="file?.*"), ], ), reason="", ) assert action.to_action_type() == "toolcall" action = Action(action=GoDeeperAction(directory="tests/testfiles/last"), reason="") assert action.to_action_type() == "godeeper" action = Action(action=StopAction(final_result="hello"), reason="") assert action.to_action_type() == "stop" ================================================ FILE: tests/test_search.py ================================================ """Tests for search filtering and merged retrieval ranking.""" from __future__ import annotations import time from dataclasses import dataclass from pathlib import Path import fs_explorer.indexing.pipeline as pipeline_module import pytest from fs_explorer.embeddings import EmbeddingProvider from fs_explorer.indexing.pipeline import IndexingPipeline from fs_explorer.search import ( IndexedQueryEngine, MetadataFilterParseError, parse_metadata_filters, ) from fs_explorer.storage import DuckDBStorage def test_parse_metadata_filters_supports_scalar_and_list_values() -> None: parsed = parse_metadata_filters( "document_type=agreement and mentions_currency=true, file_size_bytes>=100, " "document_type in (agreement, report)" ) assert len(parsed) == 4 assert parsed[0].field == "document_type" assert parsed[0].operator == "eq" assert parsed[0].value == "agreement" assert parsed[1].field == "mentions_currency" assert parsed[1].value is True assert parsed[2].operator == "gte" assert parsed[2].value == 100 assert parsed[3].operator == "in" assert parsed[3].value == ["agreement", "report"] def test_parse_metadata_filters_rejects_unknown_schema_fields() -> None: with pytest.raises(MetadataFilterParseError): parse_metadata_filters( "owner=finance", allowed_fields={"document_type", "mentions_currency"}, ) def test_indexed_query_engine_unions_semantic_and_metadata_results( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "a_agreement.md").write_text("Purchase price is $45,000,000.") (corpus / "b_report.md").write_text( "Risk register and litigation exposure summary." ) monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = tmp_path / "index.duckdb" storage = DuckDBStorage(str(db_path)) result = IndexingPipeline(storage=storage).index_folder( str(corpus), discover_schema=True ) engine = IndexedQueryEngine(storage) hits = engine.search( corpus_id=result.corpus_id, query="purchase price", filters="document_type=report", limit=5, ) by_path = {hit.relative_path: hit for hit in hits} assert "a_agreement.md" in by_path assert "b_report.md" in by_path assert by_path["a_agreement.md"].semantic_score > 0 assert by_path["b_report.md"].metadata_score > 0 class _SlowStorage: def search_chunks(self, *, corpus_id: str, query: str, limit: int = 5): # noqa: ARG002 time.sleep(0.3) return [ { "doc_id": "doc_semantic", "relative_path": "a.md", "absolute_path": "/tmp/a.md", "position": 0, "text": "semantic hit", "score": 3, } ] def search_documents_by_metadata(self, *, corpus_id: str, filters, limit: int = 20): # noqa: ARG002 time.sleep(0.3) return [ { "doc_id": "doc_metadata", "relative_path": "b.md", "absolute_path": "/tmp/b.md", "preview_text": "metadata hit", "metadata_score": 1, } ] def get_active_schema(self, *, corpus_id: str): # noqa: ARG002 return None def test_indexed_query_engine_executes_semantic_and_metadata_in_parallel() -> None: engine = IndexedQueryEngine(_SlowStorage()) start = time.perf_counter() hits = engine.search( corpus_id="corpus_test", query="test", filters="document_type=agreement", limit=5, ) elapsed = time.perf_counter() - start assert elapsed < 0.58 assert {hit.doc_id for hit in hits} == {"doc_semantic", "doc_metadata"} def test_search_enable_semantic_false_returns_only_metadata() -> None: """When enable_semantic=False, only metadata results are returned.""" engine = IndexedQueryEngine(_SlowStorage()) hits = engine.search( corpus_id="corpus_test", query="test", filters="document_type=agreement", limit=5, enable_semantic=False, ) assert len(hits) == 1 assert hits[0].doc_id == "doc_metadata" def test_search_enable_metadata_false_returns_only_semantic() -> None: """When enable_metadata=False, only semantic results are returned.""" engine = IndexedQueryEngine(_SlowStorage()) hits = engine.search( corpus_id="corpus_test", query="test", filters="document_type=agreement", limit=5, enable_metadata=False, ) assert len(hits) == 1 assert hits[0].doc_id == "doc_semantic" def test_search_both_disabled_returns_empty() -> None: """When both enable_semantic and enable_metadata are False, no results.""" engine = IndexedQueryEngine(_SlowStorage()) hits = engine.search( corpus_id="corpus_test", query="test", filters="document_type=agreement", limit=5, enable_semantic=False, enable_metadata=False, ) assert hits == [] # --------------------------------------------------------------------------- # Mock embedding helpers # --------------------------------------------------------------------------- @dataclass class _FakeEmbedding: values: list[float] @dataclass class _FakeEmbedResult: embeddings: list[_FakeEmbedding] class _FakeEmbedModels: def embed_content( self, *, model: str, contents: list[str], config: dict ) -> _FakeEmbedResult: dim = config.get("output_dimensionality", 4) return _FakeEmbedResult( embeddings=[ _FakeEmbedding(values=[0.1 * (i + 1)] * dim) for i in range(len(contents)) ] ) class _FakeEmbedClient: def __init__(self) -> None: self.models = _FakeEmbedModels() # --------------------------------------------------------------------------- # Vector search tests # --------------------------------------------------------------------------- def test_vector_search_with_pre_stored_embeddings( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "agreement.md").write_text("Purchase price is $45,000,000.") (corpus / "report.md").write_text("Risk register and litigation exposure summary.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = str(tmp_path / "index.duckdb") storage = DuckDBStorage(db_path, embedding_dim=4) provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4) pipeline = IndexingPipeline(storage=storage, embedding_provider=provider) result = pipeline.index_folder(str(corpus), discover_schema=True) assert result.embeddings_written > 0 engine = IndexedQueryEngine(storage, embedding_provider=provider) hits = engine.search( corpus_id=result.corpus_id, query="purchase price", limit=5, ) assert len(hits) > 0 # All hits should have float semantic scores from cosine similarity for hit in hits: assert isinstance(hit.semantic_score, float) def test_keyword_fallback_when_no_embeddings( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "agreement.md").write_text("Purchase price is $45,000,000.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = str(tmp_path / "index.duckdb") storage = DuckDBStorage(db_path) IndexingPipeline(storage=storage).index_folder(str(corpus), discover_schema=True) # Create engine with embedding provider but no embeddings stored provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4) engine = IndexedQueryEngine(storage, embedding_provider=provider) result_corpus_id = storage.get_corpus_id(str(Path(corpus).resolve())) assert result_corpus_id is not None hits = engine.search( corpus_id=result_corpus_id, query="purchase price", limit=5, ) # Should still return results via keyword fallback assert len(hits) > 0 def test_get_metadata_field_values_returns_distinct_values( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "a_agreement.md").write_text("Purchase price is $45,000,000.") (corpus / "b_report.md").write_text("Risk register summary.") (corpus / "c_agreement.md").write_text("Escrow details for the deal.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = tmp_path / "index.duckdb" storage = DuckDBStorage(str(db_path)) result = IndexingPipeline(storage=storage).index_folder( str(corpus), discover_schema=True ) values = storage.get_metadata_field_values( corpus_id=result.corpus_id, field_names=["document_type", "mentions_currency"], ) assert "document_type" in values assert "agreement" in values["document_type"] assert "report" in values["document_type"] assert "mentions_currency" in values def test_get_metadata_field_values_empty_corpus(tmp_path: Path) -> None: db_path = tmp_path / "index.duckdb" storage = DuckDBStorage(str(db_path)) corpus_id = storage.get_or_create_corpus(str(tmp_path / "empty")) values = storage.get_metadata_field_values( corpus_id=corpus_id, field_names=["document_type"], ) assert values == {"document_type": []} def test_get_metadata_field_values_respects_max_distinct( tmp_path: Path, monkeypatch, ) -> None: corpus = tmp_path / "docs" corpus.mkdir() for i in range(5): (corpus / f"doc_{i:02d}_type{i}.md").write_text(f"Content {i}") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) storage = DuckDBStorage(str(tmp_path / "index.duckdb")) result = IndexingPipeline(storage=storage).index_folder( str(corpus), discover_schema=True ) values = storage.get_metadata_field_values( corpus_id=result.corpus_id, field_names=["document_type"], max_distinct=2, ) assert len(values["document_type"]) <= 2 def test_semantic_search_includes_field_catalog_on_first_call( tmp_path: Path, monkeypatch, ) -> None: import fs_explorer.agent as agent_module corpus = tmp_path / "docs" corpus.mkdir() (corpus / "a_agreement.md").write_text("Purchase price is $45,000,000.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = str(tmp_path / "index.duckdb") storage = DuckDBStorage(db_path) IndexingPipeline(storage=storage).index_folder( str(corpus), discover_schema=True ) agent_module.set_index_context(str(corpus), db_path) agent_module.set_search_flags(enable_semantic=True, enable_metadata=True) try: first = agent_module.semantic_search("purchase price") assert "Available filter fields" in first assert "document_type" in first second = agent_module.semantic_search("purchase price") assert "Available filter fields" not in second finally: agent_module.clear_index_context() def test_float_scoring_in_ranked_documents() -> None: from fs_explorer.search.ranker import RankedDocument, rank_documents docs = [ RankedDocument( doc_id="d1", relative_path="a.md", absolute_path="/a.md", position=0, text="doc 1", semantic_score=0.95, metadata_score=1, ), RankedDocument( doc_id="d2", relative_path="b.md", absolute_path="/b.md", position=0, text="doc 2", semantic_score=0.5, metadata_score=2, ), ] ranked = rank_documents(docs, limit=2) assert ranked[0].doc_id == "d1" assert ranked[0].combined_score > ranked[1].combined_score ================================================ FILE: tests/test_server_search.py ================================================ """Tests for the /api/search and /api/index REST endpoints.""" from __future__ import annotations from pathlib import Path from unittest.mock import patch import fs_explorer.indexing.pipeline as pipeline_module import pytest from fastapi.testclient import TestClient from fs_explorer.indexing.pipeline import IndexingPipeline from fs_explorer.server import app from fs_explorer.storage import DuckDBStorage @pytest.fixture() def indexed_corpus(tmp_path: Path, monkeypatch): """Create a small indexed corpus and return (folder, db_path).""" corpus = tmp_path / "docs" corpus.mkdir() (corpus / "agreement.md").write_text("Purchase price is $45,000,000.") (corpus / "report.md").write_text("Risk register and litigation exposure summary.") monkeypatch.setattr( pipeline_module, "parse_file", lambda file_path: Path(file_path).read_text(), ) db_path = str(tmp_path / "index.duckdb") storage = DuckDBStorage(db_path) IndexingPipeline(storage=storage).index_folder(str(corpus), discover_schema=True) return str(corpus), db_path def test_search_endpoint_returns_hits(indexed_corpus) -> None: corpus_folder, db_path = indexed_corpus client = TestClient(app) response = client.post( "/api/search", json={ "corpus_folder": corpus_folder, "query": "purchase price", "db_path": db_path, }, ) assert response.status_code == 200 data = response.json() assert "hits" in data assert len(data["hits"]) > 0 assert data["hits"][0]["semantic_score"] > 0 def test_search_endpoint_with_filters(indexed_corpus) -> None: corpus_folder, db_path = indexed_corpus client = TestClient(app) response = client.post( "/api/search", json={ "corpus_folder": corpus_folder, "query": "litigation", "filters": "document_type=report", "db_path": db_path, }, ) assert response.status_code == 200 data = response.json() assert "hits" in data def test_search_endpoint_missing_index(tmp_path: Path) -> None: corpus = tmp_path / "empty" corpus.mkdir() db_path = str(tmp_path / "nonexistent.duckdb") client = TestClient(app) response = client.post( "/api/search", json={ "corpus_folder": str(corpus), "query": "test", "db_path": db_path, }, ) assert response.status_code in (404, 500) def test_search_endpoint_invalid_folder() -> None: client = TestClient(app) response = client.post( "/api/search", json={ "corpus_folder": "/nonexistent/path/abc123", "query": "test", }, ) assert response.status_code == 400 # --------------------------------------------------------------------------- # /api/index/status tests # --------------------------------------------------------------------------- def test_index_status_not_indexed(tmp_path: Path) -> None: corpus = tmp_path / "empty_folder" corpus.mkdir() db_path = str(tmp_path / "nonexistent.duckdb") client = TestClient(app) response = client.get( "/api/index/status", params={"folder": str(corpus), "db_path": db_path}, ) assert response.status_code == 200 data = response.json() assert data["indexed"] is False def test_index_status_after_indexing(indexed_corpus) -> None: corpus_folder, db_path = indexed_corpus client = TestClient(app) response = client.get( "/api/index/status", params={"folder": corpus_folder, "db_path": db_path}, ) assert response.status_code == 200 data = response.json() assert data["indexed"] is True assert data["document_count"] == 2 assert data["schema_name"] is not None assert isinstance(data["has_metadata"], bool) assert isinstance(data["has_embeddings"], bool) def test_index_status_includes_schema_fields(indexed_corpus) -> None: corpus_folder, db_path = indexed_corpus client = TestClient(app) response = client.get( "/api/index/status", params={"folder": corpus_folder, "db_path": db_path}, ) assert response.status_code == 200 data = response.json() assert "schema_fields" in data assert isinstance(data["schema_fields"], list) assert len(data["schema_fields"]) > 0 assert "document_type" in data["schema_fields"] # --------------------------------------------------------------------------- # /api/index/auto-profile tests # --------------------------------------------------------------------------- def test_auto_profile_endpoint(tmp_path: Path) -> None: corpus = tmp_path / "docs" corpus.mkdir() (corpus / "contract.md").write_text("TechCorp acquires StartupXYZ for $10M.") fake_profile = { "name": "test_auto", "description": "Auto-generated.", "prompt_description": "Extract metadata.", "fields": [ { "name": "lx_organizations", "type": "string", "description": "Org names.", "source": "entities", "source_classes": ["organization"], "mode": "values", } ], } client = TestClient(app) with patch( "fs_explorer.server.auto_discover_profile", return_value=fake_profile, ): response = client.post( "/api/index/auto-profile", json={"folder": str(corpus)}, ) assert response.status_code == 200 data = response.json() assert "profile" in data assert data["profile"]["name"] == "test_auto" field_names = {f["name"] for f in data["profile"]["fields"]} assert "lx_organizations" in field_names def test_auto_profile_invalid_folder() -> None: client = TestClient(app) response = client.post( "/api/index/auto-profile", json={"folder": "/nonexistent/path/abc123"}, ) assert response.status_code == 400 ================================================ FILE: tests/testfiles/file1.txt ================================================ this is a test ================================================ FILE: tests/testfiles/file2.md ================================================ # this is a test! ================================================ FILE: tests/testfiles/last/lastfile.txt ================================================ hello