Repository: PromtEngineer/agentic-file-search
Branch: main
Commit: 83c5b4231f44
Files: 59
Total size: 458.6 KB
Directory structure:
gitextract_mqv4xk8i/
├── .github/
│ └── workflows/
│ ├── build.yaml
│ ├── lint.yaml
│ ├── test.yaml
│ └── typecheck.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .python-version
├── ARCHITECTURE.md
├── CLAUDE.md
├── IMPLEMENTATION_PLAN.md
├── Makefile
├── README.md
├── YOUTUBE_DEMO_TESTS.md
├── data/
│ ├── large_acquisition/
│ │ └── TEST_QUESTIONS.md
│ ├── test_acquisition/
│ │ └── TEST_QUESTIONS.md
│ └── testfile.txt
├── docker/
│ └── docker-compose.yml
├── pyproject.toml
├── scripts/
│ ├── generate_large_docs.py
│ └── generate_test_docs.py
├── src/
│ └── fs_explorer/
│ ├── __init__.py
│ ├── agent.py
│ ├── embeddings.py
│ ├── exploration_trace.py
│ ├── fs.py
│ ├── index_config.py
│ ├── indexing/
│ │ ├── __init__.py
│ │ ├── chunker.py
│ │ ├── metadata.py
│ │ ├── pipeline.py
│ │ └── schema.py
│ ├── main.py
│ ├── models.py
│ ├── search/
│ │ ├── __init__.py
│ │ ├── filters.py
│ │ ├── query.py
│ │ ├── ranker.py
│ │ └── semantic.py
│ ├── server.py
│ ├── storage/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── duckdb.py
│ ├── ui.html
│ └── workflow.py
└── tests/
├── __init__.py
├── conftest.py
├── test_agent.py
├── test_cli_indexing.py
├── test_e2e.py
├── test_embeddings.py
├── test_exploration_trace.py
├── test_fs.py
├── test_indexing.py
├── test_models.py
├── test_search.py
├── test_server_search.py
└── testfiles/
├── file1.txt
├── file2.md
└── last/
└── lastfile.txt
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/build.yaml
================================================
name: Build
on:
pull_request:
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Set up Python
run: uv python install 3.13
- name: Build package
run: make build
================================================
FILE: .github/workflows/lint.yaml
================================================
name: Linting
on:
pull_request:
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Set up Python
run: uv python install 3.12
- name: Run formatter
shell: bash
run: make format-check
- name: Run linter
shell: bash
run: make lint
================================================
FILE: .github/workflows/test.yaml
================================================
name: CI Tests - Pull Request
on:
pull_request:
jobs:
testing_pr:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 1
- name: Install uv
uses: astral-sh/setup-uv@v6
with:
python-version: ${{ matrix.python-version }}
enable-cache: true
- name: Run Tests on Main Package
run: make test
================================================
FILE: .github/workflows/typecheck.yaml
================================================
name: Typecheck
on:
pull_request:
jobs:
core-typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 1
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Set up Python
run: uv python install
- name: Run Mypy
run: make typecheck
================================================
FILE: .gitignore
================================================
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info
# Virtual environments
.venv
# caches
*_cache/
# Environment
.env
# OS files
.DS_Store
================================================
FILE: .pre-commit-config.yaml
================================================
---
default_language_version:
python: python3
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: check-merge-conflict
- id: check-symlinks
- id: check-yaml
- id: detect-private-key
================================================
FILE: .python-version
================================================
3.13
================================================
FILE: ARCHITECTURE.md
================================================
# FsExplorer Architecture Documentation
## Table of Contents
1. [System Overview](#system-overview)
2. [Component Architecture](#component-architecture)
3. [Core Modules](#core-modules)
4. [Workflow Engine](#workflow-engine)
5. [Agent Decision Loop](#agent-decision-loop)
6. [Document Processing Pipeline](#document-processing-pipeline)
7. [Three-Phase Exploration Strategy](#three-phase-exploration-strategy)
8. [Token Tracking & Cost Estimation](#token-tracking--cost-estimation)
9. [CLI Interface](#cli-interface)
10. [Data Flow](#data-flow)
11. [File Structure](#file-structure)
12. [Extension Points](#extension-points)
---
## System Overview
FsExplorer is an AI-powered filesystem exploration agent that answers questions about documents by intelligently navigating directories, parsing files, and synthesizing information with source citations.
```mermaid
graph TB
subgraph "User Interface"
CLI[CLI Interface typer + rich]
end
subgraph "Orchestration Layer"
WF[Workflow Engine llama-index-workflows]
EVT[Event System]
end
subgraph "Intelligence Layer"
AGENT[FsExplorer Agent]
LLM[Google Gemini 2.0 Flash Structured JSON Output]
PROMPT[System Prompt Three-Phase Strategy]
end
subgraph "Tools Layer"
TOOLS[Tool Registry]
SCAN[scan_folder Parallel Scan]
PREVIEW[preview_file Quick Preview]
PARSE[parse_file Deep Read]
READ[read Text Files]
GREP[grep Pattern Search]
GLOB[glob File Search]
end
subgraph "Document Processing"
DOCLING[Docling Document Converter]
CACHE[Document Cache]
end
subgraph "Filesystem"
FS[(Local Filesystem)]
PDF[PDF Files]
DOCX[DOCX Files]
MD[Markdown Files]
OTHER[Other Formats]
end
CLI --> WF
WF --> EVT
EVT --> AGENT
AGENT --> LLM
AGENT --> PROMPT
AGENT --> TOOLS
TOOLS --> SCAN
TOOLS --> PREVIEW
TOOLS --> PARSE
TOOLS --> READ
TOOLS --> GREP
TOOLS --> GLOB
SCAN --> DOCLING
PREVIEW --> DOCLING
PARSE --> DOCLING
DOCLING --> CACHE
CACHE --> FS
FS --> PDF
FS --> DOCX
FS --> MD
FS --> OTHER
style LLM fill:#4285f4,color:#fff
style DOCLING fill:#ff6b6b,color:#fff
style CACHE fill:#ffd93d,color:#000
style AGENT fill:#6bcb77,color:#fff
```
---
## Component Architecture
### High-Level Component Diagram
```mermaid
graph LR
subgraph "Entry Point"
MAIN[main.py CLI Entry]
end
subgraph "Workflow"
WORKFLOW[workflow.py Event Orchestration]
end
subgraph "Agent"
AGENT_MOD[agent.py AI Decision Making]
end
subgraph "Models"
MODELS[models.py Pydantic Schemas]
end
subgraph "Filesystem"
FS_MOD[fs.py File Operations]
end
MAIN --> WORKFLOW
WORKFLOW --> AGENT_MOD
AGENT_MOD --> MODELS
AGENT_MOD --> FS_MOD
WORKFLOW --> MODELS
style MAIN fill:#e1f5fe
style WORKFLOW fill:#f3e5f5
style AGENT_MOD fill:#e8f5e9
style MODELS fill:#fff3e0
style FS_MOD fill:#fce4ec
```
### Module Dependencies
```mermaid
graph TD
subgraph "fs_explorer package"
INIT[__init__.py Public API Exports]
MAIN[main.py]
WORKFLOW[workflow.py]
AGENT[agent.py]
MODELS[models.py]
FS[fs.py]
end
subgraph "External Dependencies"
TYPER[typer CLI Framework]
RICH[rich Terminal UI]
WORKFLOWS[llama-index-workflows Event System]
GENAI[google-genai Gemini API]
PYDANTIC[pydantic Data Validation]
DOCLING[docling Document Parsing]
end
INIT --> AGENT
INIT --> WORKFLOW
INIT --> MODELS
MAIN --> TYPER
MAIN --> RICH
MAIN --> WORKFLOW
WORKFLOW --> WORKFLOWS
WORKFLOW --> AGENT
WORKFLOW --> MODELS
WORKFLOW --> FS
AGENT --> GENAI
AGENT --> MODELS
AGENT --> FS
MODELS --> PYDANTIC
FS --> DOCLING
style GENAI fill:#4285f4,color:#fff
style DOCLING fill:#ff6b6b,color:#fff
```
---
## Core Modules
### models.py - Data Schemas
Defines the structured output format for the AI agent using Pydantic models.
```mermaid
classDiagram
class Action {
+action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction
+reason: str
+to_action_type() ActionType
}
class ToolCallAction {
+tool_name: Tools
+tool_input: list[ToolCallArg]
+to_fn_args() dict
}
class ToolCallArg {
+parameter_name: str
+parameter_value: Any
}
class GoDeeperAction {
+directory: str
}
class StopAction {
+final_result: str
}
class AskHumanAction {
+question: str
}
Action --> ToolCallAction
Action --> GoDeeperAction
Action --> StopAction
Action --> AskHumanAction
ToolCallAction --> ToolCallArg
note for Action "Main container returned by LLM"
note for ToolCallAction "Invokes filesystem tools"
note for StopAction "Contains final answer with citations"
```
### agent.py - AI Agent
The core intelligence component that interacts with Google Gemini.
```mermaid
classDiagram
class FsExplorerAgent {
-_client: GenAIClient
-_chat_history: list[Content]
+token_usage: TokenUsage
+__init__(api_key: str)
+configure_task(task: str) void
+take_action() tuple[Action, ActionType]
+call_tool(tool_name: Tools, tool_input: dict) void
+reset() void
}
class TokenUsage {
+prompt_tokens: int
+completion_tokens: int
+total_tokens: int
+api_calls: int
+tool_result_chars: int
+documents_parsed: int
+documents_scanned: int
+add_api_call(prompt_tokens, completion_tokens) void
+add_tool_result(result, tool_name) void
+summary() str
}
class TOOLS {
<>
+read: read_file
+grep: grep_file_content
+glob: glob_paths
+scan_folder: scan_folder
+preview_file: preview_file
+parse_file: parse_file
}
FsExplorerAgent --> TokenUsage
FsExplorerAgent --> TOOLS
```
### fs.py - Filesystem Operations
All filesystem and document parsing utilities.
```mermaid
classDiagram
class FilesystemModule {
<>
+SUPPORTED_EXTENSIONS: frozenset
+DEFAULT_PREVIEW_CHARS: int = 3000
+DEFAULT_SCAN_PREVIEW_CHARS: int = 1500
+DEFAULT_MAX_WORKERS: int = 4
}
class DocumentCache {
<>
-_DOCUMENT_CACHE: dict[str, str]
+clear_document_cache() void
+_get_cached_or_parse(file_path) str
}
class DirectoryOps {
<>
+describe_dir_content(directory) str
+glob_paths(directory, pattern) str
}
class FileOps {
<>
+read_file(file_path) str
+grep_file_content(file_path, pattern) str
}
class DocumentOps {
<>
+preview_file(file_path, max_chars) str
+parse_file(file_path) str
+scan_folder(directory, max_workers, preview_chars) str
}
FilesystemModule --> DocumentCache
FilesystemModule --> DirectoryOps
FilesystemModule --> FileOps
FilesystemModule --> DocumentOps
DocumentOps --> DocumentCache
```
---
## Workflow Engine
The workflow engine uses an event-driven architecture based on `llama-index-workflows`.
### Workflow State Machine
```mermaid
stateDiagram-v2
[*] --> StartExploration: InputEvent(task)
StartExploration --> ToolCall: ToolCallEvent
StartExploration --> GoDeeper: GoDeeperEvent
StartExploration --> AskHuman: AskHumanEvent
StartExploration --> End: StopAction
ToolCall --> ToolCall: ToolCallEvent
ToolCall --> GoDeeper: GoDeeperEvent
ToolCall --> AskHuman: AskHumanEvent
ToolCall --> End: StopAction
GoDeeper --> ToolCall: ToolCallEvent
GoDeeper --> GoDeeper: GoDeeperEvent
GoDeeper --> AskHuman: AskHumanEvent
GoDeeper --> End: StopAction
AskHuman --> WaitForHuman: InputRequiredEvent
WaitForHuman --> ProcessHumanResponse: HumanAnswerEvent
ProcessHumanResponse --> ToolCall: ToolCallEvent
ProcessHumanResponse --> GoDeeper: GoDeeperEvent
ProcessHumanResponse --> AskHuman: AskHumanEvent
ProcessHumanResponse --> End: StopAction
End --> [*]: ExplorationEndEvent
note right of StartExploration
Initial task processing
Describes current directory
Asks LLM for first action
end note
note right of ToolCall
Executes filesystem tool
Adds result to chat history
Asks LLM for next action
end note
note right of GoDeeper
Updates current directory
Describes new directory
Asks LLM for next action
end note
```
### Event Types
```mermaid
graph TB
subgraph "Start Events"
IE[InputEvent task: str]
end
subgraph "Intermediate Events"
TCE[ToolCallEvent tool_name, tool_input, reason]
GDE[GoDeeperEvent directory, reason]
AHE[AskHumanEvent question, reason]
HAE[HumanAnswerEvent response]
end
subgraph "End Events"
EEE[ExplorationEndEvent final_result, error]
end
IE --> TCE
IE --> GDE
IE --> AHE
IE --> EEE
TCE --> TCE
TCE --> GDE
TCE --> AHE
TCE --> EEE
GDE --> TCE
GDE --> GDE
GDE --> AHE
GDE --> EEE
AHE --> HAE
HAE --> TCE
HAE --> GDE
HAE --> AHE
HAE --> EEE
style IE fill:#4caf50,color:#fff
style EEE fill:#f44336,color:#fff
style TCE fill:#2196f3,color:#fff
style GDE fill:#9c27b0,color:#fff
style AHE fill:#ff9800,color:#fff
```
### Workflow Steps
```mermaid
sequenceDiagram
participant CLI as CLI (main.py)
participant WF as Workflow
participant Agent as FsExplorerAgent
participant LLM as Gemini API
participant Tools as Tool Registry
participant FS as Filesystem
CLI->>WF: InputEvent(task)
WF->>Agent: configure_task(initial_prompt)
Agent->>LLM: generate_content(chat_history)
LLM-->>Agent: Action JSON
alt ToolCallAction
Agent->>Tools: call_tool(name, args)
Tools->>FS: execute operation
FS-->>Tools: result
Tools-->>Agent: tool result
Agent->>Agent: add to chat_history
WF-->>CLI: ToolCallEvent (stream)
WF->>Agent: configure_task("next action?")
Note over WF,Agent: Loop continues
else GoDeeperAction
WF->>WF: update current_directory
WF-->>CLI: GoDeeperEvent (stream)
WF->>Agent: configure_task("next action?")
Note over WF,Agent: Loop continues
else AskHumanAction
WF-->>CLI: AskHumanEvent (stream)
CLI->>CLI: Wait for user input
CLI->>WF: HumanAnswerEvent(response)
WF->>Agent: configure_task(response)
Note over WF,Agent: Loop continues
else StopAction
WF-->>CLI: ExplorationEndEvent(final_result)
end
```
---
## Agent Decision Loop
### Single Decision Cycle
```mermaid
flowchart TB
subgraph "Agent.take_action()"
START([Start]) --> SEND[Send chat_history to Gemini]
SEND --> RECEIVE[Receive JSON response]
RECEIVE --> TRACK[Track token usage]
TRACK --> PARSE[Parse Action from JSON]
PARSE --> CHECK{Action Type?}
CHECK -->|toolcall| EXEC[Execute Tool]
EXEC --> RESULT[Get tool result]
RESULT --> ADD[Add result to chat_history]
ADD --> RETURN1[Return Action, ActionType]
CHECK -->|godeeper| RETURN2[Return Action, ActionType]
CHECK -->|askhuman| RETURN3[Return Action, ActionType]
CHECK -->|stop| RETURN4[Return Action, ActionType]
RETURN1 --> END([End])
RETURN2 --> END
RETURN3 --> END
RETURN4 --> END
end
style START fill:#4caf50,color:#fff
style END fill:#f44336,color:#fff
style CHECK fill:#ff9800,color:#000
```
### Chat History Evolution
```mermaid
sequenceDiagram
participant User
participant Agent
participant LLM
Note over Agent: chat_history = []
User->>Agent: configure_task("Initial prompt + directory listing")
Note over Agent: chat_history = [user: initial_prompt]
Agent->>LLM: generate_content(chat_history)
LLM-->>Agent: {action: scan_folder, reason: "..."}
Note over Agent: chat_history = [user: initial_prompt, model: action1]
Agent->>Agent: Execute scan_folder, add result
Note over Agent: chat_history = [user: initial_prompt, model: action1, user: tool_result1]
User->>Agent: configure_task("What's next?")
Note over Agent: chat_history = [..., user: "What's next?"]
Agent->>LLM: generate_content(chat_history)
LLM-->>Agent: {action: parse_file, reason: "..."}
Note over Agent: chat_history = [..., model: action2]
Note over Agent: Pattern continues until StopAction
```
---
## Document Processing Pipeline
### Docling Integration
```mermaid
flowchart LR
subgraph "Input Formats"
PDF[PDF]
DOCX[DOCX]
PPTX[PPTX]
XLSX[XLSX]
HTML[HTML]
MD[Markdown]
end
subgraph "Docling"
DC[DocumentConverter]
DETECT[Format Detection]
PIPELINE[Processing Pipeline]
EXPORT[Markdown Export]
end
subgraph "Output"
MARKDOWN[Markdown Text]
end
PDF --> DC
DOCX --> DC
PPTX --> DC
XLSX --> DC
HTML --> DC
MD --> DC
DC --> DETECT
DETECT --> PIPELINE
PIPELINE --> EXPORT
EXPORT --> MARKDOWN
style DC fill:#ff6b6b,color:#fff
```
### Caching Strategy
```mermaid
flowchart TB
subgraph "Cache Key Generation"
PATH[file_path] --> ABS[os.path.abspath]
ABS --> MTIME[os.path.getmtime]
MTIME --> KEY["cache_key = f'{abs_path}:{mtime}'"]
end
subgraph "Cache Lookup"
KEY --> CHECK{Key in cache?}
CHECK -->|Yes| HIT[Return cached content]
CHECK -->|No| MISS[Parse with Docling]
MISS --> STORE[Store in cache]
STORE --> RETURN[Return content]
end
subgraph "_DOCUMENT_CACHE"
CACHE[(dict: str → str)]
end
HIT --> CACHE
STORE --> CACHE
style CACHE fill:#ffd93d,color:#000
```
### Parallel Document Scanning
```mermaid
flowchart TB
subgraph "scan_folder(directory)"
START([Start]) --> LIST[List directory files]
LIST --> FILTER[Filter by SUPPORTED_EXTENSIONS]
FILTER --> POOL[Create ThreadPoolExecutor max_workers=4]
subgraph "Parallel Processing"
POOL --> T1[Thread 1 _preview_single_file]
POOL --> T2[Thread 2 _preview_single_file]
POOL --> T3[Thread 3 _preview_single_file]
POOL --> T4[Thread 4 _preview_single_file]
end
T1 --> COLLECT[Collect Results]
T2 --> COLLECT
T3 --> COLLECT
T4 --> COLLECT
COLLECT --> SORT[Sort by filename]
SORT --> FORMAT[Format output report]
FORMAT --> END([Return summary])
end
style START fill:#4caf50,color:#fff
style END fill:#4caf50,color:#fff
style POOL fill:#2196f3,color:#fff
```
---
## Three-Phase Exploration Strategy
### Phase Overview
```mermaid
flowchart TB
subgraph "PHASE 1: Parallel Scan"
P1_START([User Query]) --> P1_SCAN[scan_folder]
P1_SCAN --> P1_PREVIEW[Get previews of ALL documents]
P1_PREVIEW --> P1_CATEGORIZE[Categorize documents]
P1_CATEGORIZE --> REL[RELEVANT Directly related]
P1_CATEGORIZE --> MAYBE[MAYBE Potentially useful]
P1_CATEGORIZE --> SKIP[SKIP Not relevant]
end
subgraph "PHASE 2: Deep Dive"
REL --> P2_PARSE[parse_file on RELEVANT docs]
MAYBE -.->|If needed| P2_PARSE
P2_PARSE --> P2_EXTRACT[Extract key information]
P2_EXTRACT --> P2_CROSS{Cross-references found?}
end
subgraph "PHASE 3: Backtracking"
P2_CROSS -->|Yes| P3_CHECK{Referenced doc was SKIPPED?}
P3_CHECK -->|Yes| P3_BACKTRACK[Go back and parse referenced document]
P3_BACKTRACK --> P2_EXTRACT
P3_CHECK -->|No| P3_CONTINUE[Continue analysis]
P2_CROSS -->|No| P3_CONTINUE
end
subgraph "Final Answer"
P3_CONTINUE --> ANSWER[Generate answer with citations]
ANSWER --> SOURCES[List sources consulted]
SOURCES --> END([Return to user])
end
style P1_START fill:#4caf50,color:#fff
style END fill:#4caf50,color:#fff
style REL fill:#4caf50,color:#fff
style MAYBE fill:#ff9800,color:#000
style SKIP fill:#9e9e9e,color:#fff
style P3_BACKTRACK fill:#e91e63,color:#fff
```
### Cross-Reference Detection
```mermaid
flowchart LR
subgraph "Document Content"
DOC[Parsed Document]
end
subgraph "Pattern Matching"
DOC --> P1["'See Exhibit A/B/C...'"]
DOC --> P2["'As stated in [Document]...'"]
DOC --> P3["'Refer to [filename]...'"]
DOC --> P4["'per Document: [name]'"]
DOC --> P5["'[Doc #XX]'"]
end
subgraph "Action"
P1 --> FOUND[Cross-reference found]
P2 --> FOUND
P3 --> FOUND
P4 --> FOUND
P5 --> FOUND
FOUND --> CHECK{Was referenced doc SKIPPED?}
CHECK -->|Yes| BACKTRACK[Backtrack and parse]
CHECK -->|No| CONTINUE[Continue]
end
style BACKTRACK fill:#e91e63,color:#fff
```
---
## Token Tracking & Cost Estimation
### TokenUsage Class
```mermaid
flowchart TB
subgraph "Input Tracking"
API[API Call] --> PROMPT[prompt_token_count]
API --> COMPLETION[candidates_token_count]
PROMPT --> ADD_API[add_api_call]
COMPLETION --> ADD_API
end
subgraph "Tool Tracking"
TOOL[Tool Execution] --> RESULT[result string]
RESULT --> ADD_TOOL[add_tool_result]
ADD_TOOL --> CHARS[tool_result_chars += len]
ADD_TOOL --> PARSED{tool_name?}
PARSED -->|parse_file| INC_PARSED[documents_parsed++]
PARSED -->|preview_file| INC_PARSED
PARSED -->|scan_folder| INC_SCANNED[documents_scanned += count]
end
subgraph "Cost Calculation"
ADD_API --> TOTALS[Update totals]
TOTALS --> CALC[_calculate_cost]
CALC --> INPUT_COST["input_cost = prompt_tokens × $0.075/1M"]
CALC --> OUTPUT_COST["output_cost = completion_tokens × $0.30/1M"]
INPUT_COST --> TOTAL_COST[total_cost]
OUTPUT_COST --> TOTAL_COST
end
subgraph "Summary Output"
TOTAL_COST --> SUMMARY[summary]
CHARS --> SUMMARY
INC_PARSED --> SUMMARY
INC_SCANNED --> SUMMARY
end
```
### Cost Estimation Formula
```mermaid
graph LR
subgraph "Gemini 2.0 Flash Pricing"
INPUT["Input: $0.075 / 1M tokens"]
OUTPUT["Output: $0.30 / 1M tokens"]
end
subgraph "Calculation"
PROMPT[prompt_tokens] --> DIV1[÷ 1,000,000]
DIV1 --> MULT1[× $0.075]
MULT1 --> INPUT_COST[Input Cost]
COMP[completion_tokens] --> DIV2[÷ 1,000,000]
DIV2 --> MULT2[× $0.30]
MULT2 --> OUTPUT_COST[Output Cost]
INPUT_COST --> SUM[+]
OUTPUT_COST --> SUM
SUM --> TOTAL[Total Estimated Cost]
end
style TOTAL fill:#4caf50,color:#fff
```
---
## CLI Interface
### Output Formatting
```mermaid
flowchart TB
subgraph "Event Handling"
EVENT{Event Type}
EVENT -->|ToolCallEvent| TOOL_PANEL[format_tool_panel]
EVENT -->|GoDeeperEvent| NAV_PANEL[format_navigation_panel]
EVENT -->|AskHumanEvent| HUMAN_PANEL[Human Input Panel]
EVENT -->|ExplorationEndEvent| FINAL_PANEL[Final Answer Panel]
end
subgraph "Tool Panel Components"
TOOL_PANEL --> ICON[Tool Icon 📂📖👁️🔍]
TOOL_PANEL --> STEP[Step Number]
TOOL_PANEL --> PHASE[Phase Label]
TOOL_PANEL --> TARGET[Target File/Directory]
TOOL_PANEL --> REASON[Agent's Reasoning]
end
subgraph "Final Panel Components"
FINAL_PANEL --> ANSWER[Answer with Citations]
FINAL_PANEL --> SOURCES[Sources Consulted]
end
subgraph "Summary Panel"
SUMMARY[Workflow Summary]
SUMMARY --> STEPS[Total Steps]
SUMMARY --> CALLS[API Calls]
SUMMARY --> DOCS[Documents Scanned/Parsed]
SUMMARY --> TOKENS[Token Usage]
SUMMARY --> COST[Estimated Cost]
end
FINAL_PANEL --> SUMMARY
```
### Visual Elements
```mermaid
graph TB
subgraph "Panel Styles"
TOOL["📂 Tool Call border: yellow"]
NAV["📁 Navigation border: magenta"]
HUMAN["❓ Human Input border: red"]
FINAL["✅ Final Answer border: green"]
SUMMARY["📊 Summary border: blue"]
end
subgraph "Tool Icons"
I1["📂 scan_folder"]
I2["👁️ preview_file"]
I3["📖 parse_file"]
I4["📄 read"]
I5["🔍 grep"]
I6["🔎 glob"]
end
subgraph "Phase Labels"
PH1["Phase 1: Parallel Document Scan"]
PH2["Phase 2: Deep Dive"]
PH3["Phase 1/2: Quick Preview"]
end
style TOOL fill:#ffeb3b,color:#000
style NAV fill:#e1bee7,color:#000
style HUMAN fill:#ffcdd2,color:#000
style FINAL fill:#c8e6c9,color:#000
style SUMMARY fill:#bbdefb,color:#000
```
---
## Data Flow
### Complete Request Flow
```mermaid
sequenceDiagram
participant User
participant CLI as main.py
participant WF as Workflow
participant Agent as FsExplorerAgent
participant LLM as Gemini API
participant Tools as Tool Registry
participant Docling
participant Cache
participant FS as Filesystem
User->>CLI: uv run explore --task "..."
CLI->>CLI: print_workflow_header()
CLI->>WF: workflow.run(InputEvent)
loop Until StopAction
WF->>Agent: configure_task()
Agent->>LLM: generate_content()
LLM-->>Agent: Action JSON
Agent->>Agent: Track tokens
alt ToolCallAction
Agent->>Tools: TOOLS[name](**args)
alt Document Tool
Tools->>Cache: Check cache
alt Cache Hit
Cache-->>Tools: Cached content
else Cache Miss
Cache->>Docling: Convert document
Docling->>FS: Read file
FS-->>Docling: Raw bytes
Docling-->>Cache: Markdown content
Cache-->>Tools: Content
end
else Filesystem Tool
Tools->>FS: Execute operation
FS-->>Tools: Result
end
Tools-->>Agent: Tool result
Agent->>Agent: Track tool metrics
WF-->>CLI: ToolCallEvent
CLI->>CLI: format_tool_panel()
else GoDeeperAction
WF->>WF: Update directory state
WF-->>CLI: GoDeeperEvent
CLI->>CLI: format_navigation_panel()
else AskHumanAction
WF-->>CLI: AskHumanEvent
CLI->>User: Display question
User->>CLI: Enter response
CLI->>WF: HumanAnswerEvent
else StopAction
WF-->>CLI: ExplorationEndEvent
end
end
CLI->>CLI: Display final answer
CLI->>CLI: print_workflow_summary()
CLI-->>User: Complete output
```
---
## File Structure
```
fs-explorer/
├── src/
│ └── fs_explorer/
│ ├── __init__.py # Public API exports
│ ├── main.py # CLI entry point (typer)
│ ├── workflow.py # Event-driven workflow orchestration
│ ├── agent.py # AI agent + Gemini integration
│ ├── models.py # Pydantic action schemas
│ └── fs.py # Filesystem + Docling operations
├── tests/
│ ├── conftest.py # Test fixtures and mocks
│ ├── test_agent.py # Agent unit tests
│ ├── test_fs.py # Filesystem function tests
│ ├── test_models.py # Model tests
│ ├── test_e2e.py # End-to-end integration tests
│ └── testfiles/ # Test data
├── data/
│ ├── large_acquisition/ # Sample PDF documents
│ └── test_acquisition/ # Test document set
├── scripts/
│ ├── generate_test_docs.py
│ └── generate_large_docs.py
├── pyproject.toml # Project configuration
├── Makefile # Development commands
├── README.md # User documentation
└── ARCHITECTURE.md # This file
```
---
## Extension Points
### Adding New Tools
```mermaid
flowchart LR
subgraph "Step 1: Define Function"
FUNC[def new_tool(args) -> str]
end
subgraph "Step 2: Register Tool"
TOOLS["TOOLS dict in agent.py"]
FUNC --> TOOLS
end
subgraph "Step 3: Update Types"
TYPES["Tools TypeAlias in models.py"]
TOOLS --> TYPES
end
subgraph "Step 4: Update Prompt"
PROMPT["SYSTEM_PROMPT in agent.py"]
TYPES --> PROMPT
end
style FUNC fill:#e3f2fd
style TOOLS fill:#f3e5f5
style TYPES fill:#fff3e0
style PROMPT fill:#e8f5e9
```
### Adding New Document Formats
```mermaid
flowchart LR
subgraph "Docling Supported"
PDF[PDF] --> DOCLING[Docling]
DOCX[DOCX] --> DOCLING
PPTX[PPTX] --> DOCLING
XLSX[XLSX] --> DOCLING
HTML[HTML] --> DOCLING
MD[Markdown] --> DOCLING
end
subgraph "To Add New Format"
NEW[New Format] --> CHECK{Docling supports?}
CHECK -->|Yes| ADD["Add to SUPPORTED_EXTENSIONS"]
CHECK -->|No| CUSTOM["Create custom handler in fs.py"]
end
DOCLING --> OUTPUT[Markdown]
ADD --> OUTPUT
CUSTOM --> OUTPUT
```
### Customizing the System Prompt
The system prompt in `agent.py` can be modified to:
1. **Add new exploration strategies**
2. **Change citation format**
3. **Adjust categorization criteria**
4. **Add domain-specific instructions**
```python
SYSTEM_PROMPT = """
# Customize this prompt to change agent behavior
## Your custom instructions here
...
"""
```
---
## Performance Characteristics
| Metric | Typical Value | Notes |
|--------|---------------|-------|
| Parallel scan threads | 4 | Configurable via `DEFAULT_MAX_WORKERS` |
| Preview size | 1500 chars | ~1 page of content |
| Full preview size | 3000 chars | ~2-3 pages |
| Document cache | In-memory | Keyed by path + mtime |
| Workflow timeout | 300 seconds | 5 minutes for complex queries |
| API model | gemini-2.0-flash | Fast, cost-effective |
---
## Security Considerations
1. **API Key**: Stored in environment variable `GOOGLE_API_KEY`
2. **Local Processing**: Documents parsed locally via Docling (no cloud upload)
3. **Filesystem Access**: Limited to current working directory
4. **No Persistent Storage**: Document cache is in-memory only
---
*Last updated: 2026-01-03*
================================================
FILE: CLAUDE.md
================================================
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Agentic File Search is an AI-powered document search agent that explores files dynamically rather than using pre-computed embeddings. It uses a three-phase strategy: parallel scan, deep dive, and backtracking for cross-references. There is also an optional DuckDB-backed indexing pipeline for pre-indexed semantic+metadata retrieval.
**Tech Stack:** Python 3.10+, Google Gemini 3 Flash, LlamaIndex Workflows, Docling (document parsing), DuckDB (indexing), langextract (optional metadata extraction), FastAPI + WebSocket, Typer + Rich CLI.
## Common Commands
```bash
# Install dependencies
uv pip install .
uv pip install -e ".[dev]" # with dev dependencies
# Run CLI (agentic exploration)
uv run explore --task "What is the purchase price?" --folder data/test_acquisition/
# Run CLI (indexed query - requires prior indexing)
uv run explore index data/test_acquisition/
uv run explore query --task "What is the purchase price?" --folder data/test_acquisition/
# Schema management
uv run explore schema discover data/test_acquisition/
uv run explore schema show data/test_acquisition/
# Run web UI
uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000
# Run tests
uv run pytest # all tests
uv run pytest tests/test_fs.py # single file
uv run pytest -k "test_name" # single test
# Lint, format, typecheck (also available via Makefile)
uv run pre-commit run -a # lint (or: make lint)
uv run ruff check . # ruff only
uv run ruff format # format (or: make format)
uv run ty check src/fs_explorer/ # typecheck (or: make typecheck)
```
Entry points defined in `pyproject.toml`: `explore` → `fs_explorer.main:app`, `explore-ui` → `fs_explorer.server:run_server`.
## Architecture
### Core Flow (Agentic Mode)
```
User Query → Workflow (LlamaIndex) → Agent (Gemini) → Tools → Docling → Filesystem
```
### Core Flow (Indexed Mode)
```
User Query → Workflow → Agent → semantic_search/get_document → DuckDB → Ranked Results
```
### Key Modules (src/fs_explorer/)
- **workflow.py**: Event-driven orchestration using `llama-index-workflows`. Defines `FsExplorerWorkflow` with steps: `start_exploration`, `go_deeper_action`, `tool_call_action`, `receive_human_answer`. Uses singleton agent via `get_agent()`.
- **agent.py**: `FsExplorerAgent` manages Gemini API interaction. Chat history accumulates in `_chat_history`. `take_action()` sends history to LLM, receives structured JSON `Action`, auto-executes tool calls. `TokenUsage` tracks costs. Also contains the `TOOLS` registry (9 tools), `SYSTEM_PROMPT`, and indexed tool functions (`semantic_search`, `get_document`, `list_indexed_documents`). Index context is managed via module-level `set_index_context()`/`clear_index_context()`.
- **models.py**: Pydantic schemas for structured LLM output. `Action` contains one of: `ToolCallAction`, `GoDeeperAction`, `StopAction`, `AskHumanAction`. `Tools` TypeAlias defines all available tool names.
- **fs.py**: Filesystem operations. `scan_folder()` uses ThreadPoolExecutor for parallel document processing. `_DOCUMENT_CACHE` (dict) caches parsed documents keyed by `path:mtime`. Docling converts PDF/DOCX/PPTX/XLSX/HTML/MD to markdown.
- **main.py**: Typer CLI entry point with subcommands: default (agentic explore), `index`, `query`, `schema discover`, `schema show`.
- **server.py**: FastAPI server with WebSocket endpoint `/ws/explore` for real-time streaming.
- **exploration_trace.py**: Records tool call paths and extracts cited sources from final answers for the CLI summary.
### Indexing Subsystem (src/fs_explorer/indexing/)
- **pipeline.py**: `IndexingPipeline` orchestrates document parsing → chunking → metadata extraction → DuckDB upsert. Walks a folder for supported files, delegates to `SmartChunker` and `extract_metadata()`, handles schema resolution and deleted-file cleanup.
- **chunker.py**: `SmartChunker` splits parsed document text into overlapping chunks.
- **schema.py**: `SchemaDiscovery` auto-discovers metadata schemas from a corpus folder (file types, heuristic boolean fields like `mentions_currency`/`mentions_dates`). Optionally includes langextract fields.
- **metadata.py**: `extract_metadata()` produces per-document metadata dicts. Heuristic fields (filename, extension, document_type, currency/date detection) are always available. Optional langextract integration calls the `langextract` library for entity extraction (organizations, people, deal terms, etc.) via configurable profiles.
### Search Subsystem (src/fs_explorer/search/)
- **query.py**: `IndexedQueryEngine` runs parallel semantic (chunk text matching) + metadata (JSON filter) retrieval paths using ThreadPoolExecutor, then merges and ranks via `RankedDocument.combined_score`.
- **filters.py**: `parse_metadata_filters()` parses a human-readable filter DSL (`field=value`, `field>=num`, `field in (a, b)`, `field~substring`) into `MetadataFilter` objects. Validates against allowed schema fields.
- **ranker.py**: `RankedDocument` dataclass with `combined_score` (semantic * 100 + metadata * 10). `rank_documents()` sorts and limits.
### Storage Subsystem (src/fs_explorer/storage/)
- **duckdb.py**: `DuckDBStorage` manages four tables: `corpora`, `documents`, `chunks`, `schemas`. Key operations: `upsert_document`, `search_chunks` (keyword-based scoring), `search_documents_by_metadata` (JSON path filtering via `json_extract_string`), schema CRUD. Corpus/doc/chunk IDs are SHA1-based stable hashes.
- **base.py**: `StorageBackend` protocol and shared dataclasses (`DocumentRecord`, `ChunkRecord`, `SchemaRecord`).
### Index Config
- **index_config.py**: `resolve_db_path()` resolves DuckDB path with precedence: CLI `--db-path` > `FS_EXPLORER_DB_PATH` env > `~/.fs_explorer/index.duckdb`.
### Workflow Event Types
- `InputEvent` → starts exploration
- `ToolCallEvent` → tool execution
- `GoDeeperEvent` → directory navigation
- `AskHumanEvent`/`HumanAnswerEvent` → human interaction
- `ExplorationEndEvent` → completion with `final_result` or `error`
### Adding New Tools
1. Implement function in `fs.py` (filesystem) or `agent.py` (indexed) returning `str`
2. Add to `TOOLS` dict in `agent.py`
3. Add to `Tools` TypeAlias in `models.py`
4. Update `SYSTEM_PROMPT` in `agent.py`
5. Update `TOOL_ICONS` and `PHASE_DESCRIPTIONS` in `main.py`
## Environment
- `GOOGLE_API_KEY` (required) — in `.env` file or environment variable
- `FS_EXPLORER_DB_PATH` (optional) — override default DuckDB location
- `FS_EXPLORER_LANGEXTRACT_MAX_CHARS` (optional) — max chars sent to langextract (default 6000)
- `FS_EXPLORER_LANGEXTRACT_MODEL` (optional) — model for langextract (default `gemini-3-flash-preview`)
## Testing
Tests mock the Gemini client via `MockGenAIClient` in `conftest.py`. Use `reset_agent()` to clear singleton state between tests. The mock always returns a `StopAction` response.
Key test files:
- `test_agent.py` / `test_e2e.py` — agent and workflow integration
- `test_fs.py` — filesystem tools
- `test_indexing.py` / `test_cli_indexing.py` — indexing pipeline and CLI
- `test_search.py` — search/filter/ranking
- `test_exploration_trace.py` — trace and citation extraction
Test documents live in `data/test_acquisition/` and `data/large_acquisition/`. Test fixtures for unit tests are in `tests/testfiles/`.
================================================
FILE: IMPLEMENTATION_PLAN.md
================================================
# Implementation Plan: Hybrid Semantic + Agentic Search (Revised)
## Overview
Add semantic search with optional metadata filtering to `agentic-file-search` without regressing the current agentic workflow.
The revised approach keeps the current CLI and behavior stable first, introduces indexing as opt-in, and only enables auto-detection after compatibility and quality checks pass.
- Storage: DuckDB + `vss` (embedded, local file)
- Embeddings: Gemini embeddings (API-backed)
- Metadata extraction: `langextract` (optional)
- Infrastructure model: no external database service (no Docker/Postgres required)
---
## Goals
1. Preserve existing `explore --task` behavior and UX by default.
2. Add a fast indexed path for large corpora.
3. Support metadata-aware filtering when metadata is available.
4. Keep agentic deep-read and cross-reference behavior available.
## Non-Goals (Initial Release)
1. Replacing the existing agentic strategy entirely.
2. Forcing index usage for all queries.
3. Heuristic/NLP folder extraction from free-form task text.
---
## Current Codebase Constraints to Respect
1. CLI currently has one root command (`explore --task`) and no subcommands.
2. Workflow and server currently use shared/global process state (`os.chdir`, singleton agent).
3. Existing tests assert the current 6-tool model and prompt behavior.
These constraints require a staged rollout to avoid breaking current users.
---
## High-Level Architecture
```text
INDEX TIME
├── Parse documents (Docling)
├── Chunk content (paragraph/sentence-aware)
├── Generate embeddings (provider-configured dimension)
├── [optional] Extract metadata (langextract)
└── Persist in DuckDB (corpus-scoped)
QUERY TIME
├── Retrieve by semantic search
├── [optional] Retrieve by metadata filter
├── Union + rank results
├── Expand via cross-references where needed
└── Agent continues deep exploration using existing tools
```
---
## Data Model (DuckDB)
Use corpus-scoped tables and file freshness fields to prevent collisions and stale indexes.
```sql
-- Install and load extension programmatically
-- INSTALL vss; LOAD vss;
CREATE TABLE IF NOT EXISTS corpora (
id VARCHAR PRIMARY KEY,
root_path VARCHAR NOT NULL UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS documents (
id VARCHAR PRIMARY KEY,
corpus_id VARCHAR NOT NULL REFERENCES corpora(id),
relative_path VARCHAR NOT NULL,
absolute_path VARCHAR NOT NULL,
content VARCHAR NOT NULL,
metadata JSON NOT NULL DEFAULT '{}',
file_mtime DOUBLE NOT NULL,
file_size BIGINT NOT NULL,
content_sha256 VARCHAR NOT NULL,
last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
is_deleted BOOLEAN DEFAULT FALSE,
UNIQUE(corpus_id, relative_path)
);
-- EMBEDDING_DIM is configured in code at index creation time.
CREATE TABLE IF NOT EXISTS chunks (
id VARCHAR PRIMARY KEY,
doc_id VARCHAR NOT NULL REFERENCES documents(id),
text VARCHAR NOT NULL,
embedding FLOAT[${EMBEDDING_DIM}] NOT NULL,
embedding_dim INTEGER NOT NULL,
position INTEGER NOT NULL,
start_char INTEGER NOT NULL,
end_char INTEGER NOT NULL
);
CREATE TABLE IF NOT EXISTS schemas (
id INTEGER PRIMARY KEY,
corpus_id VARCHAR REFERENCES corpora(id),
name VARCHAR,
schema_def JSON NOT NULL,
is_active BOOLEAN DEFAULT FALSE,
UNIQUE(corpus_id, name)
);
CREATE INDEX IF NOT EXISTS idx_chunks_embedding
ON chunks USING HNSW (embedding) WITH (metric = 'cosine');
```
### Embedding Dimension Rule
`EMBEDDING_DIM` must be a runtime config constant validated at startup. Do not hardcode `1536` across modules.
### DB Location
Default: `~/.fs_explorer/index.duckdb`
Override via:
- `FS_EXPLORER_DB_PATH`
- CLI: `--db-path`
---
## CLI Contract and Rollout
### Compatibility Rules (Required)
1. `uv run explore --task "..."` must keep working as-is.
2. Existing non-indexed behavior remains default in initial rollout.
3. New indexed behavior is opt-in first.
### New Commands
```bash
# Index management
uv run explore index
uv run explore index --with-metadata
uv run explore index --schema schema.json
# Indexed query path
uv run explore query --task "..." --folder [--filter "..."]
# Schema inspection
uv run explore schema --discover
uv run explore schema --show --folder
# Existing command (backward-compatible)
uv run explore --task "..." [--folder ] [--use-index]
```
### Folder Resolution (Deterministic)
For commands that need corpus selection:
1. If `--folder` is provided, use it.
2. Else use current working directory (`.`).
3. Do not parse folder intent from natural language task text in v1.
### Auto-Detection Strategy
- v1: explicit `--use-index` only.
- v2: optional auto-detect behind feature flag `FS_EXPLORER_AUTO_INDEX=1`.
- v3: default auto-detect only after parity tests and quality benchmarks pass.
---
## Server and Concurrency Requirements
Before adding indexing/search endpoints:
1. Remove request-level `os.chdir` usage; pass absolute target folder through workflow state.
2. Avoid global singleton agent across concurrent requests; instantiate per workflow run/session.
3. Add per-corpus index lock to avoid concurrent write corruption.
4. Keep read queries concurrent-safe.
---
## Module Structure
```text
src/fs_explorer/
├── storage/
│ ├── __init__.py
│ ├── base.py
│ └── duckdb.py
├── indexing/
│ ├── __init__.py
│ ├── pipeline.py
│ ├── chunker.py
│ ├── metadata.py
│ └── schema.py
├── search/
│ ├── __init__.py
│ ├── query.py
│ ├── semantic.py
│ ├── filters.py
│ └── ranker.py
├── embeddings.py
└── index_config.py
```
---
## Files to Modify
| File | Changes |
|------|---------|
| `src/fs_explorer/agent.py` | Add indexed tools and prompt guidance while keeping existing tools |
| `src/fs_explorer/models.py` | Extend `Tools` type alias |
| `src/fs_explorer/main.py` | Add subcommands + `--folder` + `--use-index` while preserving root command |
| `src/fs_explorer/workflow.py` | Remove global/shared run-state assumptions |
| `src/fs_explorer/fs.py` | Support safe path resolution without cwd mutation |
| `src/fs_explorer/server.py` | Add index/search endpoints and remove `os.chdir` coupling |
| `pyproject.toml` | Add `duckdb`, `langextract` |
---
## Implementation Phases
### Phase 0: Contracts and Safety (New)
1. Freeze CLI compatibility requirements (`explore --task` must remain stable).
2. Define deterministic folder resolution contract.
3. Define per-request state model for workflow/server.
4. Add failing tests for compatibility and concurrency assumptions.
### Phase 1: Storage + Embeddings
5. Implement `storage/base.py` (backend interface).
6. Implement `storage/duckdb.py` with corpus-scoped schema.
7. Implement `embeddings.py` with configurable embedding dimension.
8. Add storage/embedding tests (including dimension validation).
### Phase 2: Indexing Pipeline
9. Implement `indexing/chunker.py`.
10. Implement optional `indexing/metadata.py`.
11. Implement `indexing/schema.py`.
12. Implement `indexing/pipeline.py` with freshness checks (`mtime`, hash, deleted files).
13. Add indexing tests.
### Phase 3: Search Pipeline
14. Implement `search/filters.py`.
15. Implement `search/ranker.py`.
16. Implement `search/query.py` (parallel retrieval + union).
17. Implement cross-reference expansion hooks.
18. Add search tests.
### Phase 4: Agent Integration (Opt-in)
19. Add tools: `semantic_search`, `get_document`, `list_indexed_documents`.
20. Keep existing 6 filesystem tools available.
21. Add indexed prompt guidance without removing current strategy.
22. Add tool-selection tests for indexed and non-indexed paths.
### Phase 5: CLI + Server Integration
23. Add `explore index/query/schema` commands.
24. Add `--folder` and `--use-index` to root command.
25. Integrate indexed path into workflow when explicitly requested.
26. Add `/api/index` and `/api/search` endpoints.
27. Remove `os.chdir` in server workflow path.
### Phase 6: Auto-Detect Rollout (Guarded)
28. Add feature-flagged auto-detect (`FS_EXPLORER_AUTO_INDEX`).
29. Add parity checks between indexed and baseline runs on test corpora.
30. Keep fallback to legacy behavior on index errors.
### Phase 7: Testing and Docs
31. Full integration tests.
32. Backward compatibility tests.
33. Concurrency tests for WebSocket/API usage.
34. Performance benchmarks and docs updates.
---
## Revised Design Decisions
1. **Opt-in First**: indexed retrieval starts behind `--use-index` to avoid regressions.
2. **Deterministic Corpus Selection**: explicit `--folder` or `.` fallback only.
3. **Corpus-Scoped Storage**: avoid global path collisions by namespacing.
4. **Freshness Tracking**: incremental reindex using mtime/hash/deletion markers.
5. **No Global Request State**: remove `os.chdir` and shared singleton pitfalls in server flows.
6. **Configurable Embedding Dimension**: validated at runtime; not hardcoded everywhere.
7. **No External DB Service**: embedded local DB only; APIs are still external dependencies.
---
## Verification Steps
```bash
# Baseline safety (must stay green)
uv run pytest tests/test_models.py tests/test_fs.py tests/test_agent.py -v
# Phase 1-3
uv run pytest tests/test_storage.py tests/test_embeddings.py tests/test_search.py -v
# Index build + inspect
uv run explore index data/test_acquisition/
uv run python -c "import duckdb; db=duckdb.connect('~/.fs_explorer/index.duckdb'); print(db.execute('SELECT COUNT(*) FROM documents').fetchone())"
# Opt-in indexed execution
uv run explore --task "Search for acquisition terms" --folder data/test_acquisition --use-index
# Compatibility execution (legacy path)
uv run explore --task "Look in data/test_acquisition/. Who is the CTO?"
# CLI checks
uv run explore --help
uv run explore index --help
uv run explore query --help
uv run explore schema --help
# Full suite
uv run pytest tests/ -v
```
---
## Dependencies to Add
```toml
# pyproject.toml
dependencies = [
# ... existing ...
"duckdb>=1.0.0",
"langextract>=1.0.0",
]
```
---
## Critical Files Summary
| Purpose | Path |
|---------|------|
| Storage interface | `src/fs_explorer/storage/base.py` |
| DuckDB backend | `src/fs_explorer/storage/duckdb.py` |
| Embeddings | `src/fs_explorer/embeddings.py` |
| Chunking | `src/fs_explorer/indexing/chunker.py` |
| Metadata extraction | `src/fs_explorer/indexing/metadata.py` |
| Schema discovery | `src/fs_explorer/indexing/schema.py` |
| Indexing pipeline | `src/fs_explorer/indexing/pipeline.py` |
| Query pipeline | `src/fs_explorer/search/query.py` |
| Filter parsing | `src/fs_explorer/search/filters.py` |
| Result ranking | `src/fs_explorer/search/ranker.py` |
| Agent tools/prompt | `src/fs_explorer/agent.py` |
| Tool types | `src/fs_explorer/models.py` |
| CLI commands | `src/fs_explorer/main.py` |
| Workflow safety | `src/fs_explorer/workflow.py` |
| Server safety/endpoints | `src/fs_explorer/server.py` |
================================================
FILE: Makefile
================================================
.PHONY: test lint format format-check typecheck build
all: test lint format typecheck
test:
$(info ****************** running tests ******************)
uv run pytest tests
lint:
$(info ****************** linting ******************)
uv run pre-commit run -a
format:
$(info ****************** formatting ******************)
uv run ruff format
format-check:
$(info ****************** checking formatting ******************)
uv run ruff format --check
typecheck:
$(info ****************** type checking ******************)
uv run ty check src/fs_explorer/
build:
$(info ****************** building ******************)
uv build
================================================
FILE: README.md
================================================
# Agentic File Search
> **Based on**: [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer) — The original CLI agent for filesystem exploration.
An AI-powered document search agent that explores files like a human would — scanning, reasoning, and following cross-references. Unlike traditional RAG systems that rely on pre-computed embeddings, this agent dynamically navigates documents to find answers.
## Why Agentic Search?
Traditional RAG (Retrieval-Augmented Generation) has limitations:
- **Chunks lose context** — Splitting documents destroys relationships between sections
- **Cross-references are invisible** — "See Exhibit B" means nothing to embeddings
- **Similarity ≠ Relevance** — Semantic matching misses logical connections
This system uses a **three-phase strategy**:
1. **Parallel Scan** — Preview all documents in a folder at once
2. **Deep Dive** — Full extraction on relevant documents only
3. **Backtrack** — Follow cross-references to previously skipped documents
## Watch the video
This video explains the architecture of the project and how to run it.
[](https://www.youtube.com/watch?v=rMADSuus6jg)
## Features
- 🔍 **6 Tools**: `scan_folder`, `preview_file`, `parse_file`, `read`, `grep`, `glob`
- 📄 **Document Support**: PDF, DOCX, PPTX, XLSX, HTML, Markdown (via Docling)
- 🤖 **Powered by**: Google Gemini 3 Flash with structured JSON output
- 💰 **Cost Efficient**: ~$0.001 per query with token tracking
- 🌐 **Web UI**: Real-time WebSocket streaming interface
- 📊 **Citations**: Answers include source references
## Installation
```bash
# Clone the repository
git clone https://github.com/PromtEngineer/agentic-file-search.git
cd agentic-file-search
# Install with uv (recommended)
uv pip install .
# Or with pip
pip install .
```
## Configuration
Create a `.env` file in the project root:
```bash
GOOGLE_API_KEY=your_api_key_here
```
Get your API key from [Google AI Studio](https://aistudio.google.com/apikey).
## Usage
### CLI
```bash
# Basic query
uv run explore --task "What is the purchase price in data/test_acquisition/?"
# Multi-document query
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms including adjustments and escrow?"
```
### Web UI
```bash
# Start the server
uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000
# Open http://127.0.0.1:8000 in your browser
```
The web UI provides:
- Folder browser to select target directory
- Real-time step-by-step execution log
- Final answer with citations
- Token usage and cost statistics
## Architecture
```
User Query
↓
┌─────────────────┐
│ Workflow Engine │ ←→ LlamaIndex Workflows (event-driven)
└────────┬────────┘
↓
┌─────────────────┐
│ Agent │ ←→ Gemini 3 Flash (structured JSON)
└────────┬────────┘
↓
┌─────────────────────────────────────────┐
│ scan_folder │ preview │ parse │ read │ grep │ glob │
└─────────────────────────────────────────┘
↓
Document Parser (Docling - local)
```
See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed diagrams.
## Test Documents
The repo includes test document sets for evaluation:
- `data/test_acquisition/` — 10 interconnected legal documents
- `data/large_acquisition/` — 25 documents with extensive cross-references
Example queries:
```bash
# Simple (single doc)
uv run explore --task "Look in data/test_acquisition/. Who is the CTO?"
# Cross-reference required
uv run explore --task "Look in data/test_acquisition/. What is the adjusted purchase price?"
# Multi-document synthesis
uv run explore --task "Look in data/large_acquisition/. What happens to employees after the acquisition?"
```
## Tech Stack
| Component | Technology |
|-----------|------------|
| LLM | Google Gemini 3 Flash |
| Document Parsing | Docling (local, open-source) |
| Orchestration | LlamaIndex Workflows |
| CLI | Typer + Rich |
| Web Server | FastAPI + WebSocket |
| Package Manager | uv |
## Project Structure
```
src/fs_explorer/
├── agent.py # Gemini client, token tracking
├── workflow.py # LlamaIndex workflow engine
├── fs.py # File tools: scan, parse, grep
├── models.py # Pydantic models for actions
├── main.py # CLI entry point
├── server.py # FastAPI + WebSocket server
└── ui.html # Single-file web interface
```
## Development
```bash
# Install dev dependencies
uv pip install -e ".[dev]"
# Run tests
uv run pytest
# Lint
uv run ruff check .
```
## License
MIT
## Acknowledgments
- Original concept from [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer)
- Document parsing by [Docling](https://github.com/DS4SD/docling)
- Powered by [Google Gemini](https://deepmind.google/technologies/gemini/)
## Star History
[](https://star-history.com/#PromtEngineer/agentic-file-search&Date)
================================================
FILE: YOUTUBE_DEMO_TESTS.md
================================================
# YouTube Demo: FS-Explorer Test Results
## System Overview
- **25 PDF documents** (~93 pages total)
- **63 cross-references** between documents
- **Parallel document scanning** using ThreadPoolExecutor
- **Three-phase exploration**: Scan → Filter → Deep Dive + Backtracking
---
## Test Results Summary
| Test | Question | Docs Parsed | Tokens Used | Cost | Time |
|------|----------|-------------|-------------|------|------|
| 1 | CTO & retention bonus | 1 | 27,376 | $0.0021 | ~30s |
| 2 | Litigation & insurance | 2 | 41,431 | $0.0033 | ~30s |
| 3 | Patents owned | 1 | 27,416 | $0.0022 | ~27s |
---
## TEST 1: Simple Fact Lookup ✅
**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
```
**Agent Answer:**
> Dr. Sarah Chen is the CTO and their retention bonus is $1,200,000.
**Verification (from 07_employee_schedule.pdf):**
```
1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000
```
| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 1 |
| Tokens Used | 27,376 |
| Cost | $0.0021 |
| ✅ Accuracy | **100% Correct** |
---
## TEST 2: Cross-Reference Following ✅
**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
```
**Agent Answer:**
> The total litigation exposure is between $650,000 and $2,350,000. The Smith v. InnovateTech Solutions, Inc. case, with an exposure of $150,000 - $350,000, is covered by insurance. The DataTech LLC v. InnovateTech Solutions, Inc. case, with an exposure of $500,000 - $2,000,000, does not specify if it is covered by insurance.
**Verification (from 12_litigation_schedule.pdf):**
```
1. Smith v. InnovateTech Solutions, Inc.
Exposure: $150,000 - $350,000 (covered by insurance)
2. DataTech LLC v. InnovateTech Solutions, Inc.
Exposure: Preliminary assessment $500,000 - $2,000,000
```
**Cross-Reference Followed:**
- Agent noticed reference to "Legal Opinion Letter"
- Parsed `24_legal_opinion.pdf` to get complete picture
| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 2 (litigation + legal opinion) |
| Tokens Used | 41,431 |
| Cost | $0.0033 |
| ✅ Accuracy | **100% Correct** |
---
## TEST 3: Multi-Document IP Analysis ✅
**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. What patents does the company own and are there any pending applications?"
```
**Agent Answer:**
> The company owns the following patents:
> - US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021
> - US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022
> - US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023
>
> There is one pending application: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024.
**Verification (from 06_ip_schedule.pdf):**
```
US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021
US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022
US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023
Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024
```
| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 1 |
| Tokens Used | 27,416 |
| Cost | $0.0022 |
| ✅ Accuracy | **100% Correct** |
---
## Additional Demo Tests
### Purchase Price & Payment Structure
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price and how is it being paid?"
```
**Expected:** $125M total ($80M cash + $30M stock + $15M escrow)
### Closing Conditions Status
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
```
**Expected:** HSR ✅, State filings ✅, MegaCorp consent ✅, GlobalBank pending, Employee retention ✅, Legal opinion ✅, Good standing ordered
### Key Employee Compensation
```bash
uv run explore --task "Look in data/large_acquisition/. List all the key employees and their retention bonuses"
```
**Expected:** 5 employees totaling $3.5M in retention bonuses
---
## Key Architecture Points to Highlight
### 1. Parallel Scanning (scan_folder)
- Scans ALL 26 documents simultaneously using ThreadPoolExecutor
- Takes ~25 seconds for entire folder
- Returns quick preview of each document
### 2. Smart Filtering
- LLM reviews all previews at once
- Identifies which documents are relevant
- Avoids parsing irrelevant documents
### 3. Cross-Reference Discovery
- Agent watches for document references like:
- "See Document: Legal Opinion Letter"
- "Per Document: Risk Assessment Memo"
- Automatically follows references (backtracking)
### 4. Document Caching
- Documents cached after first parse
- Backtracking is free (no re-parsing)
---
## Cost Analysis
| Scenario | Tokens | Est. Cost |
|----------|--------|-----------|
| Simple query (1 doc) | ~27K | $0.002 |
| Cross-ref query (2-3 docs) | ~40K | $0.003 |
| Complex synthesis (5+ docs) | ~60K | $0.005 |
| All 25 documents parsed | ~150K | $0.012 |
**Key Insight:** Even with 25 documents, costs are minimal because the system only parses what's needed!
---
## Commands to Run Demo
```bash
# Setup
cd /path/to/fs-explorer
export GOOGLE_API_KEY="your-key"
# Run any test
uv run explore --task "Look in data/large_acquisition/. [YOUR QUESTION]"
```
---
## What to Show in Video
1. **The folder scan** - Watch as 26 documents are scanned in parallel
2. **Smart filtering** - Note which documents the agent CHOOSES to parse
3. **Cross-reference following** - Show agent backtracking to referenced docs
4. **Token usage summary** - Highlight the efficiency stats at the end
5. **Verification** - Show the actual PDF content matches the answer
================================================
FILE: data/large_acquisition/TEST_QUESTIONS.md
================================================
# Test Questions for Large Document Set
## Document Overview
- 25 interconnected documents
- Each document 3-6 pages
- Extensive cross-references between documents
- Total content: ~100+ pages
## Test Questions
### Level 1: Single Document (Easy)
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?"
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
uv run explore --task "Look in data/large_acquisition/. What patents does the company own?"
```
### Level 2: Cross-Reference Required (Medium)
```bash
uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?"
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?"
```
### Level 3: Multi-Document Synthesis (Hard)
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks."
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?"
```
### Level 4: Deep Cross-Reference (Expert)
```bash
uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?"
uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation."
uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents."
```
================================================
FILE: data/test_acquisition/TEST_QUESTIONS.md
================================================
# Test Questions for Document Exploration
These questions are designed to test the two-stage document exploration approach with cross-reference discovery.
## Test Scenario
**Context:** TechCorp Industries is acquiring StartupXYZ LLC. There are 10 documents in this folder related to the acquisition.
---
## Question Set 1: Simple (Single Document)
These questions can be answered from a single document:
```bash
# Q1: What is the purchase price?
explore --task "What is the total purchase price for the StartupXYZ acquisition?"
# Q2: When did the NDA get signed?
explore --task "When was the Non-Disclosure Agreement between TechCorp and StartupXYZ signed?"
# Q3: How many patents does StartupXYZ have?
explore --task "How many patents does StartupXYZ own?"
```
**Expected Behavior:**
- Agent should preview documents
- Identify the relevant document quickly
- Parse only that document for the answer
---
## Question Set 2: Medium (2-3 Documents with Cross-References)
These questions require following cross-references:
```bash
# Q4: What risks were identified and how were they addressed?
explore --task "What are the key risks identified in this acquisition and what mitigation measures were put in place?"
# Q5: What's the adjusted purchase price?
explore --task "The original purchase price was $45M. Were there any adjustments? What is the final amount?"
# Q6: What happened with customer consents?
explore --task "Which customers required consent for the acquisition, and was consent obtained from all of them?"
```
**Expected Behavior:**
- Agent previews documents
- Reads Risk Assessment Memo
- Notices references to Financial Adjustments, Customer Consents
- Follows cross-references to get complete picture
---
## Question Set 3: Complex (Multiple Documents, Deep Cross-References)
These questions require synthesizing information from many documents:
```bash
# Q7: Complete IP status
explore --task "Give me a complete picture of StartupXYZ's intellectual property - what do they own, is it properly certified, and are there any pending matters or risks?"
# Q8: Due diligence findings and resolution
explore --task "What did the due diligence process uncover, and how were any issues resolved before closing?"
# Q9: Full timeline and status
explore --task "Create a timeline of this acquisition from NDA signing to closing. What are the key milestones and their status?"
# Q10: Closing readiness
explore --task "Is this acquisition ready to close? What items are complete and what's still pending?"
```
**Expected Behavior:**
- Agent should preview all documents first
- Read the most relevant documents (e.g., Closing Checklist references everything)
- Follow cross-references to IP Certification, Due Diligence, Risk Assessment, etc.
- Synthesize information from 5+ documents
---
## Question Set 4: Adversarial (Tests Cross-Reference Discovery)
These questions specifically test if the agent goes back to previously-skipped documents:
```bash
# Q11: Following exhibit references
explore --task "The Acquisition Agreement mentions 'Exhibit A - Financial Terms'. What are the detailed financial terms?"
# Q12: Understanding document relationships
explore --task "How does the Legal Opinion Letter relate to other documents in this acquisition?"
# Q13: Hidden connection
explore --task "Is there anything about MegaCorp in these documents? Why are they important to this deal?"
```
**Expected Behavior:**
- Q11: Agent might initially skip Financial Adjustments, but should go back when Acquisition Agreement references Exhibit A
- Q12: Agent should trace all documents referenced BY and FROM the Legal Opinion
- Q13: MegaCorp is mentioned in Due Diligence, Risk Assessment, and Customer Consents - agent should connect the dots
---
## Scoring Rubric
| Metric | Description |
|--------|-------------|
| **Preview Usage** | Did the agent use `preview_file` before `parse_file`? |
| **Selective Parsing** | Did the agent avoid parsing irrelevant documents? |
| **Cross-Reference Discovery** | Did the agent follow document references? |
| **Backtracking** | Did the agent return to previously-skipped documents when needed? |
| **Answer Completeness** | Was the final answer comprehensive and accurate? |
---
## Running a Test
```bash
export GOOGLE_API_KEY="your-key"
cd /path/to/fs-explorer
uv run explore --task "YOUR QUESTION HERE"
```
Watch for:
1. Which documents get previewed
2. Which documents get fully parsed
3. Whether the agent mentions cross-references
4. Whether the agent goes back to read referenced documents
================================================
FILE: data/testfile.txt
================================================
This is a test.
================================================
FILE: docker/docker-compose.yml
================================================
version: '3.8'
services:
postgres:
image: pgvector/pgvector:pg17
container_name: fs-explorer-db
environment:
POSTGRES_USER: ${POSTGRES_USER:-fs_explorer}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-devpassword}
POSTGRES_DB: ${POSTGRES_DB:-fs_explorer}
ports:
- "${POSTGRES_PORT:-5432}:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U fs_explorer -d fs_explorer"]
interval: 5s
timeout: 5s
retries: 5
restart: unless-stopped
volumes:
postgres_data:
================================================
FILE: pyproject.toml
================================================
[build-system]
requires = ["uv_build>=0.9.10,<0.10.0"]
build-backend = "uv_build"
[project]
name = "fs-explorer"
version = "0.1.0"
description = "Explore and understand your filesystem better with AI."
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"docling>=2.55.0",
"duckdb>=1.0.0",
"fastapi>=0.115.0",
"google-genai>=1.55.0",
"langextract>=1.0.0",
"llama-index-workflows>=2.11.5",
"python-dotenv>=1.0.0",
"reportlab>=4.4.7",
"rich>=13.0.0",
"typer>=0.12.5,<0.20.0",
"uvicorn>=0.34.0",
"websockets>=14.0",
]
[dependency-groups]
dev = [
"pre-commit>=4.5.0",
"pytest>=9.0.2",
"pytest-asyncio>=1.3.0",
"ruff>=0.14.9",
"ty>=0.0.1a33",
]
[project.scripts]
explore = "fs_explorer.main:app"
explore-ui = "fs_explorer.server:run_server"
================================================
FILE: scripts/generate_large_docs.py
================================================
#!/usr/bin/env python3
"""
Generate a large set of interconnected legal documents for testing.
Creates 25 documents, each 3-5 pages, with extensive cross-references.
"""
import os
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
OUTPUT_DIR = "data/large_acquisition"
# Document metadata with cross-references
DOCUMENTS = {
"01_master_agreement": {
"title": "MASTER ACQUISITION AGREEMENT",
"refs": ["02_schedules", "03_exhibits", "04_disclosure_schedules", "05_ancillary_agreements"],
"pages": 5
},
"02_schedules": {
"title": "SCHEDULES TO ACQUISITION AGREEMENT",
"refs": ["01_master_agreement", "06_ip_schedule", "07_employee_schedule", "08_contract_schedule"],
"pages": 4
},
"03_exhibits": {
"title": "EXHIBITS TO ACQUISITION AGREEMENT",
"refs": ["01_master_agreement", "09_escrow_agreement", "10_stock_purchase"],
"pages": 3
},
"04_disclosure_schedules": {
"title": "SELLER DISCLOSURE SCHEDULES",
"refs": ["01_master_agreement", "11_financial_statements", "12_litigation_schedule"],
"pages": 5
},
"05_ancillary_agreements": {
"title": "ANCILLARY AGREEMENTS INDEX",
"refs": ["13_nda", "14_non_compete", "15_consulting_agreement", "16_transition_services"],
"pages": 2
},
"06_ip_schedule": {
"title": "SCHEDULE 3.12 - INTELLECTUAL PROPERTY",
"refs": ["01_master_agreement", "17_patent_assignments", "18_trademark_registrations"],
"pages": 4
},
"07_employee_schedule": {
"title": "SCHEDULE 3.15 - EMPLOYEE MATTERS",
"refs": ["01_master_agreement", "19_retention_agreements", "20_benefit_plans"],
"pages": 4
},
"08_contract_schedule": {
"title": "SCHEDULE 3.13 - MATERIAL CONTRACTS",
"refs": ["01_master_agreement", "21_customer_contracts", "22_vendor_contracts"],
"pages": 5
},
"09_escrow_agreement": {
"title": "ESCROW AGREEMENT",
"refs": ["01_master_agreement", "03_exhibits", "11_financial_statements"],
"pages": 4
},
"10_stock_purchase": {
"title": "STOCK PURCHASE DETAILS - EXHIBIT B",
"refs": ["01_master_agreement", "11_financial_statements"],
"pages": 3
},
"11_financial_statements": {
"title": "AUDITED FINANCIAL STATEMENTS",
"refs": ["04_disclosure_schedules", "23_audit_report"],
"pages": 6
},
"12_litigation_schedule": {
"title": "SCHEDULE 3.9 - LITIGATION AND CLAIMS",
"refs": ["04_disclosure_schedules", "24_legal_opinion"],
"pages": 3
},
"13_nda": {
"title": "NON-DISCLOSURE AGREEMENT",
"refs": ["01_master_agreement"],
"pages": 3
},
"14_non_compete": {
"title": "NON-COMPETITION AGREEMENT",
"refs": ["01_master_agreement", "07_employee_schedule"],
"pages": 3
},
"15_consulting_agreement": {
"title": "CONSULTING AGREEMENT - FOUNDER",
"refs": ["01_master_agreement", "07_employee_schedule", "19_retention_agreements"],
"pages": 4
},
"16_transition_services": {
"title": "TRANSITION SERVICES AGREEMENT",
"refs": ["01_master_agreement", "25_closing_checklist"],
"pages": 4
},
"17_patent_assignments": {
"title": "PATENT ASSIGNMENT AGREEMENTS",
"refs": ["06_ip_schedule", "01_master_agreement"],
"pages": 3
},
"18_trademark_registrations": {
"title": "TRADEMARK REGISTRATION SCHEDULE",
"refs": ["06_ip_schedule"],
"pages": 2
},
"19_retention_agreements": {
"title": "KEY EMPLOYEE RETENTION AGREEMENTS",
"refs": ["07_employee_schedule", "15_consulting_agreement"],
"pages": 4
},
"20_benefit_plans": {
"title": "EMPLOYEE BENEFIT PLAN SCHEDULE",
"refs": ["07_employee_schedule"],
"pages": 3
},
"21_customer_contracts": {
"title": "MAJOR CUSTOMER CONTRACT SUMMARIES",
"refs": ["08_contract_schedule", "01_master_agreement"],
"pages": 5
},
"22_vendor_contracts": {
"title": "MAJOR VENDOR CONTRACT SUMMARIES",
"refs": ["08_contract_schedule"],
"pages": 3
},
"23_audit_report": {
"title": "INDEPENDENT AUDITOR'S REPORT",
"refs": ["11_financial_statements", "04_disclosure_schedules"],
"pages": 4
},
"24_legal_opinion": {
"title": "LEGAL OPINION LETTER",
"refs": ["01_master_agreement", "12_litigation_schedule", "06_ip_schedule"],
"pages": 3
},
"25_closing_checklist": {
"title": "CLOSING CHECKLIST AND CONDITIONS",
"refs": ["01_master_agreement", "09_escrow_agreement", "16_transition_services",
"17_patent_assignments", "21_customer_contracts"],
"pages": 4
}
}
def generate_content(doc_id: str, meta: dict) -> list:
"""Generate realistic legal document content."""
styles = getSampleStyleSheet()
title_style = ParagraphStyle('Title', parent=styles['Heading1'], fontSize=16, spaceAfter=20)
heading_style = ParagraphStyle('Heading', parent=styles['Heading2'], fontSize=12, spaceAfter=10)
body_style = ParagraphStyle('Body', parent=styles['Normal'], fontSize=10, spaceAfter=8, leading=14)
content = []
# Title
content.append(Paragraph(meta["title"], title_style))
content.append(Spacer(1, 0.3*inch))
# Document intro with cross-references
refs_text = ", ".join([f"Document: {DOCUMENTS[r]['title']}" for r in meta["refs"][:3]])
intro = f"""
This document is part of the acquisition transaction between GlobalTech Corporation ("Buyer")
and InnovateTech Solutions, Inc. ("Seller") dated as of February 15, 2025. This document should
be read in conjunction with {refs_text}, and all other transaction documents.
"""
content.append(Paragraph(intro.strip(), body_style))
content.append(Spacer(1, 0.2*inch))
# Generate sections based on document type
sections = generate_sections(doc_id, meta)
for section_title, section_content in sections:
content.append(Paragraph(section_title, heading_style))
for para in section_content:
content.append(Paragraph(para, body_style))
content.append(Spacer(1, 0.15*inch))
return content
def generate_sections(doc_id: str, meta: dict) -> list:
"""Generate document-specific sections with legal content."""
sections = []
# Add document-specific content
if "master_agreement" in doc_id:
sections = [
("ARTICLE I - DEFINITIONS", [
"1.1 'Acquisition' means the purchase by Buyer of all outstanding capital stock of Seller.",
"1.2 'Purchase Price' means One Hundred Twenty-Five Million Dollars ($125,000,000), subject to adjustments.",
"1.3 'Closing Date' means April 1, 2025, or such other date as mutually agreed.",
"1.4 'Material Adverse Effect' means any change that is materially adverse to the business of Seller.",
"1.5 'Knowledge of Seller' means the actual knowledge of the officers listed in Schedule 1.5.",
]),
("ARTICLE II - PURCHASE AND SALE", [
"2.1 Subject to the terms hereof, Seller agrees to sell and Buyer agrees to purchase all Shares.",
"2.2 The Purchase Price shall be paid as follows: (a) $80,000,000 in cash at Closing; "
"(b) $30,000,000 in Buyer common stock per Document: Stock Purchase Details - Exhibit B; "
"(c) $15,000,000 in escrow per Document: Escrow Agreement.",
"2.3 Purchase Price adjustments are detailed in Document: Audited Financial Statements.",
"2.4 Working capital target is $8,500,000 as calculated per Schedule 2.4.",
]),
("ARTICLE III - REPRESENTATIONS AND WARRANTIES", [
"3.1 Organization. Seller is duly organized under Delaware law.",
"3.9 Litigation. Except as set forth in Document: Schedule 3.9 - Litigation and Claims, "
"there are no pending legal proceedings against Seller.",
"3.12 Intellectual Property. All IP is listed in Document: Schedule 3.12 - Intellectual Property. "
"Patent assignments are documented in Document: Patent Assignment Agreements.",
"3.13 Material Contracts. All contracts exceeding $100,000 annually are in Document: Schedule 3.13 - Material Contracts.",
"3.15 Employees. Employee matters are disclosed in Document: Schedule 3.15 - Employee Matters.",
]),
("ARTICLE IV - COVENANTS", [
"4.1 Conduct of Business. Prior to Closing, Seller shall operate in ordinary course.",
"4.2 Access. Seller shall provide Buyer access to facilities, books, and records.",
"4.3 Confidentiality. Parties shall comply with Document: Non-Disclosure Agreement.",
"4.4 Non-Competition. Key employees shall execute Document: Non-Competition Agreement.",
]),
("ARTICLE V - CONDITIONS TO CLOSING", [
"5.1 Buyer's conditions: (a) accuracy of representations; (b) material consents obtained; "
"(c) no Material Adverse Effect; (d) receipt of Document: Legal Opinion Letter.",
"5.2 Regulatory approvals as specified in Document: Closing Checklist and Conditions.",
"5.3 Third-party consents from customers in Document: Major Customer Contract Summaries.",
]),
]
elif "financial" in doc_id:
sections = [
("BALANCE SHEET", [
"As of December 31, 2024:",
"Total Assets: $47,250,000 (Current: $18,500,000; Non-current: $28,750,000)",
"Total Liabilities: $12,300,000 (Current: $8,200,000; Long-term: $4,100,000)",
"Stockholders' Equity: $34,950,000",
"Working Capital: $10,300,000 (above target of $8,500,000 per Document: Master Acquisition Agreement)",
]),
("INCOME STATEMENT", [
"For fiscal year ended December 31, 2024:",
"Total Revenue: $52,400,000 (SaaS: $41,920,000; Professional Services: $10,480,000)",
"Cost of Revenue: $15,720,000 (Gross Margin: 70%)",
"Operating Expenses: $28,600,000 (R&D: $12,100,000; S&M: $11,500,000; G&A: $5,000,000)",
"Operating Income: $8,080,000 (EBITDA: $11,200,000)",
"Net Income: $6,464,000",
]),
("REVENUE BREAKDOWN BY CUSTOMER", [
"Top 5 customers represent 62% of revenue (see Document: Major Customer Contract Summaries):",
"1. MegaCorp Industries: $12,576,000 (24%) - Contract through 2027",
"2. GlobalBank Holdings: $8,384,000 (16%) - Renewal pending",
"3. HealthFirst Systems: $5,240,000 (10%) - Multi-year agreement",
"4. RetailMax Inc.: $3,668,000 (7%) - Expansion discussion ongoing",
"5. TechPrime Solutions: $2,620,000 (5%) - New customer 2024",
]),
("NOTES TO FINANCIAL STATEMENTS", [
"Note 1: Significant Accounting Policies - Revenue recognized per ASC 606.",
"Note 2: Deferred Revenue of $4,200,000 represents prepaid annual subscriptions.",
"Note 3: Contingent liabilities detailed in Document: Schedule 3.9 - Litigation and Claims.",
"Note 4: Related party transactions with founder disclosed in Document: Consulting Agreement - Founder.",
]),
]
elif "ip_schedule" in doc_id or "patent" in doc_id:
sections = [
("PATENTS", [
"Seller owns or has rights to the following patents:",
"US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021",
"US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022",
"US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023",
"Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024",
"Assignment agreements in Document: Patent Assignment Agreements.",
]),
("TRADEMARKS", [
"Registered trademarks (see Document: Trademark Registration Schedule):",
"INNOVATETECH (word mark) - Reg. No. 5,123,456 - Software services",
"INNOVATETECH (logo) - Reg. No. 5,234,567 - Software services",
"DATAFLOW PRO - Reg. No. 5,345,678 - Data analytics software",
]),
("TRADE SECRETS AND KNOW-HOW", [
"Seller maintains trade secrets including proprietary algorithms and processes.",
"All employees have executed invention assignment agreements per Document: Schedule 3.15 - Employee Matters.",
"Key technical personnel retention addressed in Document: Key Employee Retention Agreements.",
]),
]
elif "employee" in doc_id or "retention" in doc_id:
sections = [
("EMPLOYEE CENSUS", [
"Total Employees: 127 (Full-time: 120; Part-time: 7)",
"Engineering: 68 employees (Senior: 24; Mid-level: 32; Junior: 12)",
"Sales & Marketing: 28 employees",
"Customer Success: 18 employees",
"G&A: 13 employees",
]),
("KEY EMPLOYEES", [
"The following are Key Employees subject to Document: Key Employee Retention Agreements:",
"1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000",
"2. Michael Rodriguez - VP Engineering - Leads 45-person team - Retention: $800,000",
"3. Jennifer Walsh - VP Sales - $18M quota achievement - Retention: $600,000",
"4. David Kim - Principal Architect - Core platform expertise - Retention: $500,000",
"5. Amanda Foster - VP Customer Success - 95% retention rate - Retention: $400,000",
"Founder consulting terms in Document: Consulting Agreement - Founder.",
]),
("BENEFIT PLANS", [
"Active benefit plans (details in Document: Employee Benefit Plan Schedule):",
"401(k) Plan - Company match 4% - $2.1M annual cost",
"Health Insurance - PPO and HMO options - $1.8M annual cost",
"Stock Option Plan - 2,500,000 shares reserved - 1,800,000 granted",
"Treatment of equity awards addressed in Document: Master Acquisition Agreement Section 2.6.",
]),
]
elif "customer" in doc_id or "contract_schedule" in doc_id:
sections = [
("MATERIAL CUSTOMER CONTRACTS", [
"Contracts with annual value exceeding $500,000:",
"",
"1. MEGACORP INDUSTRIES - Master Services Agreement",
" Annual Value: $12,576,000 | Term: Through December 2027",
" Change of Control: Consent required (OBTAINED February 8, 2025)",
" Renewal Terms: Auto-renew with 90-day notice",
"",
"2. GLOBALBANK HOLDINGS - Enterprise License Agreement",
" Annual Value: $8,384,000 | Term: Through June 2025",
" Change of Control: 60-day notice required (PROVIDED January 15, 2025)",
" Renewal: Currently in negotiation for 3-year extension",
"",
"3. HEALTHFIRST SYSTEMS - SaaS Subscription Agreement",
" Annual Value: $5,240,000 | Term: Through December 2026",
" Change of Control: No restrictions",
"",
"See Document: Closing Checklist and Conditions for consent status.",
]),
("CONSENT REQUIREMENTS", [
"Customer consents required for acquisition (per Document: Master Acquisition Agreement):",
"- MegaCorp Industries: OBTAINED (see Exhibit A hereto)",
"- GlobalBank Holdings: NOTICE PROVIDED (awaiting acknowledgment)",
"- Other customers: No consent required",
"Risk assessment in Document: Legal Opinion Letter.",
]),
]
elif "litigation" in doc_id:
sections = [
("PENDING LITIGATION", [
"1. Smith v. InnovateTech Solutions, Inc.",
" Court: California Superior Court, Santa Clara County",
" Claims: Wrongful termination, discrimination",
" Status: Discovery phase; trial set for September 2025",
" Exposure: $150,000 - $350,000 (covered by insurance)",
" Opinion: See Document: Legal Opinion Letter",
"",
"2. DataTech LLC v. InnovateTech Solutions, Inc.",
" Court: US District Court, Northern District of California",
" Claims: Patent infringement (US Patent 9,876,543)",
" Status: Motion to dismiss pending; hearing March 2025",
" Exposure: Preliminary assessment $500,000 - $2,000,000",
" IP validity analysis in Document: Schedule 3.12 - Intellectual Property",
]),
("THREATENED CLAIMS", [
"Demand letter received from former contractor re: unpaid invoices ($45,000).",
"Resolution expected prior to Closing per Document: Closing Checklist and Conditions.",
]),
("INSURANCE COVERAGE", [
"D&O Insurance: $5,000,000 limit | Deductible: $50,000",
"E&O Insurance: $3,000,000 limit | Deductible: $25,000",
"General Liability: $2,000,000 limit",
]),
]
elif "closing" in doc_id:
sections = [
("PRE-CLOSING CONDITIONS", [
"The following conditions must be satisfied prior to Closing:",
"",
"1. REGULATORY APPROVALS",
" [X] HSR Filing - Early termination granted February 1, 2025",
" [X] State filings - Completed in all required jurisdictions",
"",
"2. THIRD-PARTY CONSENTS",
" [X] MegaCorp Industries - Obtained February 8, 2025",
" [ ] GlobalBank Holdings - Pending (expected by March 15)",
" Per Document: Major Customer Contract Summaries",
"",
"3. EMPLOYEE MATTERS",
" [X] Key employee retention agreements executed",
" [X] Founder consulting agreement finalized",
" Per Document: Key Employee Retention Agreements",
"",
"4. LEGAL DELIVERABLES",
" [X] Legal opinion - See Document: Legal Opinion Letter",
" [ ] Good standing certificates - Ordered",
]),
("CLOSING DELIVERABLES", [
"SELLER DELIVERABLES:",
"- Stock certificates endorsed in blank",
"- Officer's certificate re: representations",
"- Secretary's certificate with resolutions",
"- IP assignments per Document: Patent Assignment Agreements",
"- Third-party consents per above",
"",
"BUYER DELIVERABLES:",
"- Cash payment: $80,000,000 by wire transfer",
"- Stock consideration: 1,500,000 shares per Document: Stock Purchase Details - Exhibit B",
"- Escrow deposit: $15,000,000 per Document: Escrow Agreement",
]),
("POST-CLOSING OBLIGATIONS", [
"1. Transition services per Document: Transition Services Agreement (6 months)",
"2. Earnout payments per Exhibit C to Document: Master Acquisition Agreement",
"3. Escrow release schedule per Document: Escrow Agreement",
"4. Employee benefit plan merger per Document: Employee Benefit Plan Schedule",
]),
]
elif "escrow" in doc_id:
sections = [
("ESCROW TERMS", [
"Escrow Amount: $15,000,000 (12% of Purchase Price)",
"Escrow Agent: First National Trust Company",
"Term: 18 months from Closing Date",
"",
"Release Schedule:",
"- 6 months: $5,000,000 released (absent claims)",
"- 12 months: $5,000,000 released (absent claims)",
"- 18 months: Remaining balance released",
"",
"Claims may be made for breaches of representations in Document: Master Acquisition Agreement.",
]),
("INDEMNIFICATION", [
"Indemnification provisions per Article VII of Document: Master Acquisition Agreement:",
"- Basket: $500,000 (1% of escrow)",
"- Cap: $15,000,000 (escrow amount) for general reps",
"- Fundamental reps: Full Purchase Price cap",
"",
"Specific indemnities for matters in Document: Schedule 3.9 - Litigation and Claims.",
]),
]
elif "legal_opinion" in doc_id:
sections = [
("OPINIONS RENDERED", [
"Wilson & Associates LLP, counsel to Seller, renders the following opinions:",
"",
"1. Seller is a corporation duly organized under Delaware law.",
"2. Seller has corporate power to execute Document: Master Acquisition Agreement.",
"3. Transaction documents are valid and enforceable obligations.",
"4. No conflicts with charter documents or material agreements.",
"5. Based on review of Document: Schedule 3.9 - Litigation and Claims, pending "
"litigation does not pose material risk to transaction.",
"6. IP matters reviewed per Document: Schedule 3.12 - Intellectual Property; "
"no infringement claims other than disclosed.",
]),
("QUALIFICATIONS AND ASSUMPTIONS", [
"This opinion is subject to standard qualifications regarding:",
"- Bankruptcy and insolvency laws",
"- Equitable principles",
"- Public policy considerations",
"",
"We have relied upon certificates from officers of Seller and representations "
"in Document: Seller Disclosure Schedules.",
]),
]
elif "audit" in doc_id:
sections = [
("INDEPENDENT AUDITOR'S REPORT", [
"To the Board of Directors of InnovateTech Solutions, Inc.:",
"",
"We have audited the accompanying financial statements, which comprise the "
"balance sheet as of December 31, 2024, and the related statements of income, "
"comprehensive income, stockholders' equity, and cash flows for the year then ended.",
"",
"OPINION",
"In our opinion, the financial statements present fairly, in all material respects, "
"the financial position of InnovateTech Solutions, Inc. as of December 31, 2024, "
"in accordance with accounting principles generally accepted in the United States.",
]),
("KEY AUDIT MATTERS", [
"1. REVENUE RECOGNITION",
" SaaS revenue recognized ratably over subscription period per ASC 606.",
" Deferred revenue of $4,200,000 verified to customer contracts.",
"",
"2. STOCK-BASED COMPENSATION",
" Options valued using Black-Scholes model.",
" Expense of $2,100,000 recorded in accordance with ASC 718.",
"",
"3. CONTINGENCIES",
" Litigation matters reviewed with counsel (see Document: Schedule 3.9 - Litigation and Claims).",
" Accruals of $350,000 determined to be appropriate.",
]),
]
else:
# Generic sections for other documents
sections = [
("OVERVIEW", [
f"This {meta['title']} is executed in connection with the acquisition transaction.",
f"Reference documents: {', '.join([DOCUMENTS[r]['title'] for r in meta['refs'][:2]])}.",
]),
("TERMS AND CONDITIONS", [
"Standard terms apply as set forth in the Master Acquisition Agreement.",
"Amendments require written consent of all parties.",
]),
("MISCELLANEOUS", [
"Governing Law: State of Delaware",
"Dispute Resolution: Arbitration in San Francisco, California",
"Notices: As specified in Master Acquisition Agreement",
]),
]
# Add boilerplate to reach target page count
for i in range(meta["pages"] - 2):
sections.append((f"SECTION {len(sections) + 1}", [
f"Additional provisions related to {meta['title']}.",
"All terms defined in Document: Master Acquisition Agreement apply herein.",
f"Cross-reference: See {DOCUMENTS[meta['refs'][i % len(meta['refs'])]]['title']} for related provisions.",
"The parties acknowledge receipt of all schedules and exhibits referenced herein.",
"This section shall survive the Closing Date as specified in Article VIII of the Master Agreement.",
]))
return sections
def create_pdf(doc_id: str, meta: dict, output_dir: str):
"""Create a PDF document."""
filepath = os.path.join(output_dir, f"{doc_id}.pdf")
doc = SimpleDocTemplate(filepath, pagesize=letter,
topMargin=0.75*inch, bottomMargin=0.75*inch,
leftMargin=1*inch, rightMargin=1*inch)
content = generate_content(doc_id, meta)
doc.build(content)
print(f" Created: {filepath}")
def main():
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"\nGenerating {len(DOCUMENTS)} large documents in {OUTPUT_DIR}/\n")
for doc_id, meta in DOCUMENTS.items():
create_pdf(doc_id, meta, OUTPUT_DIR)
# Create test questions
questions_path = os.path.join(OUTPUT_DIR, "TEST_QUESTIONS.md")
with open(questions_path, "w") as f:
f.write("""# Test Questions for Large Document Set
## Document Overview
- 25 interconnected documents
- Each document 3-6 pages
- Extensive cross-references between documents
- Total content: ~100+ pages
## Test Questions
### Level 1: Single Document (Easy)
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?"
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
uv run explore --task "Look in data/large_acquisition/. What patents does the company own?"
```
### Level 2: Cross-Reference Required (Medium)
```bash
uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?"
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?"
```
### Level 3: Multi-Document Synthesis (Hard)
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks."
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?"
```
### Level 4: Deep Cross-Reference (Expert)
```bash
uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?"
uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation."
uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents."
```
""")
print(f" Created: {questions_path}")
# Summary
total_pages = sum(m["pages"] for m in DOCUMENTS.values())
total_refs = sum(len(m["refs"]) for m in DOCUMENTS.values())
print(f"\n{'='*60}")
print(f"SUMMARY")
print(f"{'='*60}")
print(f" Documents created: {len(DOCUMENTS)}")
print(f" Total pages: ~{total_pages}")
print(f" Cross-references: {total_refs}")
print(f" Output directory: {OUTPUT_DIR}/")
print(f"{'='*60}\n")
if __name__ == "__main__":
main()
================================================
FILE: scripts/generate_test_docs.py
================================================
#!/usr/bin/env python3
"""
Generate test PDF documents for testing the two-stage document exploration approach.
Scenario: TechCorp's acquisition of StartupXYZ
Documents have cross-references to test the agent's ability to follow document relationships.
"""
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
import os
OUTPUT_DIR = "data/test_acquisition"
DOCUMENTS = {
"01_acquisition_agreement.pdf": {
"title": "ACQUISITION AGREEMENT",
"content": """
ACQUISITION AGREEMENT
This Acquisition Agreement ("Agreement") is entered into as of January 15, 2025,
by and between TechCorp Industries, Inc. ("Buyer") and StartupXYZ LLC ("Seller").
ARTICLE I - DEFINITIONS
1.1 "Acquisition" means the purchase of all outstanding shares of Seller by Buyer.
1.2 "Purchase Price" means $45,000,000 USD as detailed in Exhibit A - Financial Terms.
1.3 "Closing Date" means March 1, 2025, subject to conditions in Article IV.
1.4 "Employee Matters" shall be governed by Schedule 3 - Employee Transition Plan.
ARTICLE II - PURCHASE AND SALE
2.1 Subject to the terms and conditions of this Agreement, Seller agrees to sell,
and Buyer agrees to purchase, all of the issued and outstanding shares of Seller.
2.2 The Purchase Price shall be paid as follows:
(a) $30,000,000 in cash at Closing
(b) $10,000,000 in Buyer's common stock (see Exhibit B - Stock Valuation)
(c) $5,000,000 in earnout payments (see Exhibit C - Earnout Terms)
ARTICLE III - REPRESENTATIONS AND WARRANTIES
3.1 Seller represents and warrants that the financial statements provided in
Document: Due Diligence Report are accurate and complete.
3.2 Seller represents that all intellectual property is properly documented in
Schedule 1 - IP Assets and is free of encumbrances as certified in
Document: IP Certification Letter.
3.3 All material contracts are listed in Schedule 2 - Material Contracts.
ARTICLE IV - CONDITIONS TO CLOSING
4.1 Buyer's obligation to close is subject to:
(a) Receipt of regulatory approval as documented in Document: Regulatory Approval Letter
(b) Completion of due diligence per Document: Due Diligence Report
(c) No material adverse change as defined in Section 1.5
4.2 Both parties acknowledge the risks identified in Document: Risk Assessment Memo.
ARTICLE V - CONFIDENTIALITY
5.1 This Agreement is subject to the terms of the Document: Non-Disclosure Agreement
executed between the parties on October 1, 2024.
IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first above written.
_________________________
TechCorp Industries, Inc.
By: James Mitchell, CEO
_________________________
StartupXYZ LLC
By: Sarah Chen, Founder & CEO
"""
},
"02_due_diligence_report.pdf": {
"title": "DUE DILIGENCE REPORT",
"content": """
CONFIDENTIAL DUE DILIGENCE REPORT
Prepared for: TechCorp Industries, Inc. Subject: StartupXYZ LLC Date: December 20, 2024 Prepared by: Morrison & Associates, LLP
EXECUTIVE SUMMARY
This report summarizes our findings from the due diligence investigation of StartupXYZ LLC
in connection with the proposed acquisition described in the Document: Acquisition Agreement.
1. FINANCIAL REVIEW
1.1 Revenue for FY2024: $12.3 million (growth of 45% YoY)
1.2 EBITDA: $2.1 million (17% margin)
1.3 Cash position: $3.2 million as of November 30, 2024
1.4 Outstanding debt: $1.5 million (detailed in Exhibit A - Financial Terms of the Acquisition Agreement)
KEY FINDING: Financial statements are materially accurate. Minor adjustments
recommended as noted in Document: Financial Adjustments Memo.
2. INTELLECTUAL PROPERTY
2.1 StartupXYZ holds 12 patents related to AI/ML technology
2.2 All patents verified as valid per Document: IP Certification Letter
2.3 No pending litigation affecting IP (confirmed in Document: Legal Opinion Letter)
2.4 Full IP inventory in Schedule 1 - IP Assets of the Acquisition Agreement
3. EMPLOYEE MATTERS
3.1 Total employees: 47 (32 engineering, 8 sales, 7 operations)
3.2 Key employee retention risk: HIGH for 5 senior engineers
3.3 Retention bonuses recommended per Schedule 3 - Employee Transition Plan
3.4 No pending employment disputes
4. MATERIAL CONTRACTS
4.1 23 active customer contracts reviewed (see Schedule 2 - Material Contracts)
4.2 3 contracts contain change-of-control provisions requiring consent
4.3 Largest customer (MegaCorp) accounts for 28% of revenue - concentration risk noted in
Document: Risk Assessment Memo
5. REGULATORY COMPLIANCE
5.1 Company is compliant with all applicable regulations
5.2 HSR filing required - timeline in Document: Regulatory Approval Letter
6. RECOMMENDATIONS
Based on our findings, we recommend proceeding with the acquisition subject to:
(a) Obtaining customer consents for change-of-control contracts
(b) Implementing retention packages for key employees
(c) Addressing items in Document: Financial Adjustments Memo
Date: December 15, 2024 To: TechCorp Industries, Inc. From: PatentWatch Legal Services Re: IP Certification for StartupXYZ LLC Acquisition
Dear Mr. Mitchell,
In connection with the proposed acquisition of StartupXYZ LLC as described in the
Document: Acquisition Agreement, we have conducted a comprehensive review of
StartupXYZ's intellectual property portfolio.
CERTIFICATION
We hereby certify the following:
1. PATENTS
StartupXYZ owns 12 U.S. patents as listed in Schedule 1 - IP Assets:
- US Patent 10,123,456: "Neural Network Optimization Method"
- US Patent 10,234,567: "Distributed AI Training System"
- US Patent 10,345,678: "Real-time Data Processing Pipeline"
- [9 additional patents listed in Schedule 1]
All patents are valid, enforceable, and free of liens or encumbrances.
We have reviewed StartupXYZ's trade secret protection protocols. All employees have
signed appropriate NDAs. See Document: Non-Disclosure Agreement template.
There is one pending patent application (Application No. 17/456,789) for "Advanced
Federated Learning System" expected to issue Q2 2025. This is noted in
Document: Risk Assessment Memo as a minor risk item.
6. LITIGATION
No IP-related litigation is pending or threatened. This is confirmed in
Document: Legal Opinion Letter.
This certification is provided in connection with the due diligence process and
may be relied upon by TechCorp Industries, Inc.
To: TechCorp Board of Directors From: Corporate Development Team Date: December 22, 2024 Re: Risk Assessment - StartupXYZ Acquisition
This memo summarizes key risks identified in connection with the proposed acquisition
as documented in the Document: Acquisition Agreement.
1. HIGH-PRIORITY RISKS
1.1 Customer Concentration (HIGH)
- MegaCorp represents 28% of StartupXYZ revenue
- MegaCorp contract contains change-of-control clause
- Mitigation: Obtain consent prior to closing (see Document: Customer Consent Letters)
- Impact if materialized: $3.4M annual revenue at risk
1.2 Key Employee Retention (HIGH)
- 5 senior engineers critical to product development
- 2 have expressed interest in leaving post-acquisition
- Mitigation: Retention packages per Schedule 3 - Employee Transition Plan
- Estimated cost: $2.5M in retention bonuses
2. MEDIUM-PRIORITY RISKS
2.1 Earnout Structure (MEDIUM)
- $5M earnout tied to 2025-2026 performance metrics
- Metrics defined in Exhibit C - Earnout Terms of the Acquisition Agreement
- Risk: Disagreement on metric calculation methodology
- Mitigation: Clear definitions in agreement; third-party arbitration clause
2.2 Integration Costs (MEDIUM)
- Estimated integration costs: $4.2M over 18 months
- Systems integration detailed in Document: Integration Plan
- Risk: Cost overruns of 20-30% typical in tech acquisitions
3. LOW-PRIORITY RISKS
3.1 Pending Patent Application (LOW)
- One patent pending as noted in Document: IP Certification Letter
- Low risk of rejection based on patent attorney's assessment
3.2 Regulatory Approval (LOW)
- HSR filing required but expected to clear without issues
- Timeline in Document: Regulatory Approval Letter
4. FINANCIAL IMPACT SUMMARY
Total risk-adjusted impact: $6.2M - $8.7M
This is reflected in purchase price negotiations per Document: Financial Adjustments Memo
5. RECOMMENDATION
Despite identified risks, we recommend proceeding with the acquisition. The strategic
value of StartupXYZ's AI technology platform justifies the purchase price when
accounting for risk mitigation costs. All findings are consistent with
Document: Due Diligence Report.
To: Deal Team From: Finance Department Date: December 23, 2024 Re: Purchase Price Adjustments - StartupXYZ Acquisition
Following our review in connection with the Document: Due Diligence Report,
we recommend the following adjustments to the purchase price as set forth in
Exhibit A - Financial Terms of the Document: Acquisition Agreement.
1. WORKING CAPITAL ADJUSTMENT
Target working capital: $1,200,000
Estimated closing working capital: $980,000
Adjustment: ($220,000)
Deferred revenue requiring restatement: $340,000
Impact on EBITDA: ($85,000)
Implied value adjustment (at 15x): ($1,275,000)
4. CONTINGENT LIABILITY RESERVE
As noted in Document: Risk Assessment Memo, we recommend establishing
reserves for:
- Customer concentration risk: $500,000
- Integration contingency: $800,000
Total reserve: $1,300,000 (to be held in escrow per Exhibit C - Earnout Terms)
5. SUMMARY OF ADJUSTMENTS
Original Purchase Price: $45,000,000
Working Capital Adjustment: ($220,000)
Debt Adjustment: ($175,000)
Revenue Recognition: ($1,275,000) Adjusted Purchase Price: $43,330,000
Plus escrow reserve: $1,300,000 Total Cash Required at Closing: $44,630,000
6. PAYMENT STRUCTURE
As revised from Document: Acquisition Agreement Section 2.2:
(a) Cash at closing: $28,330,000 (adjusted)
(b) Stock consideration: $10,000,000 (per Exhibit B - Stock Valuation)
(c) Earnout: $5,000,000 (unchanged, per Exhibit C - Earnout Terms)
(d) Escrow: $1,300,000 (18-month release schedule)
These adjustments have been discussed with Seller's representatives and are
subject to final negotiation.
Please refer to Document: Closing Checklist for timeline and requirements.
"""
},
"06_legal_opinion.pdf": {
"title": "LEGAL OPINION LETTER",
"content": """
LEGAL OPINION LETTER
Date: December 18, 2024
TechCorp Industries, Inc.
500 Technology Drive
San Francisco, CA 94105
Re: Acquisition of StartupXYZ LLC
Ladies and Gentlemen:
We have acted as legal counsel to StartupXYZ LLC ("Company") in connection with
the proposed acquisition by TechCorp Industries, Inc. pursuant to the
Document: Acquisition Agreement dated January 15, 2025.
DOCUMENTS REVIEWED
In connection with this opinion, we have reviewed:
1. The Acquisition Agreement and all Exhibits and Schedules
2. Document: Due Diligence Report prepared by Morrison & Associates
3. Document: IP Certification Letter from PatentWatch Legal Services
4. All material contracts listed in Schedule 2 - Material Contracts
5. Corporate records and organizational documents of the Company
6. Document: Non-Disclosure Agreement between the parties
OPINIONS
Based on our review, we are of the opinion that:
1. Corporate Status
The Company is a limited liability company duly organized, validly existing, and
in good standing under the laws of Delaware.
2. Authority
The Company has full power and authority to execute and deliver the Acquisition
Agreement and to consummate the transactions contemplated thereby.
3. No Conflicts
The execution and delivery of the Acquisition Agreement does not violate any
provision of the Company's organizational documents or any material contract,
except for change-of-control provisions noted in Document: Customer Consent Letters.
4. Litigation
There is no litigation, arbitration, or governmental proceeding pending or, to
our knowledge, threatened against the Company that would have a material adverse
effect on the Company or the transactions contemplated by the Acquisition Agreement.
This opinion confirms the representations in the Document: IP Certification Letter
regarding absence of IP litigation.
5. Regulatory Compliance
The Company is in material compliance with all applicable laws and regulations.
The HSR filing requirements are addressed in Document: Regulatory Approval Letter.
QUALIFICATIONS
This opinion is subject to the following qualifications:
1. We express no opinion on tax matters (see separate tax opinion)
2. This opinion is limited to Delaware and federal law
3. Certain contracts require third-party consents as noted above
This opinion is provided solely for your benefit in connection with the
transactions contemplated by the Acquisition Agreement.
Very truly yours,
Wilson & Partners LLP
By: Jennifer Walsh, Partner
"""
},
"07_nda.pdf": {
"title": "NON-DISCLOSURE AGREEMENT",
"content": """
MUTUAL NON-DISCLOSURE AGREEMENT
This Mutual Non-Disclosure Agreement ("NDA") is entered into as of October 1, 2024,
by and between:
TechCorp Industries, Inc. ("TechCorp")
500 Technology Drive, San Francisco, CA 94105
and
StartupXYZ LLC ("StartupXYZ")
123 Innovation Way, Palo Alto, CA 94301
(each a "Party" and collectively the "Parties")
RECITALS
The Parties wish to explore a potential business relationship, including a possible
acquisition of StartupXYZ by TechCorp (the "Purpose"), which is now documented in
the Document: Acquisition Agreement.
1. DEFINITION OF CONFIDENTIAL INFORMATION
"Confidential Information" means any non-public information disclosed by either
Party, including but not limited to:
- Financial information (as contained in Document: Due Diligence Report)
- Technical information (as certified in Document: IP Certification Letter)
- Business strategies and plans
- Customer and supplier information
- Employee information (as detailed in Schedule 3 - Employee Transition Plan)
2. OBLIGATIONS
Each Party agrees to:
(a) Hold Confidential Information in strict confidence
(b) Not disclose Confidential Information to third parties without prior written consent
(c) Use Confidential Information solely for the Purpose
(d) Limit access to Confidential Information to employees with a need to know
3. TERM
This NDA shall remain in effect for three (3) years from the date first written
above, or until superseded by the confidentiality provisions in the
Document: Acquisition Agreement Article V.
4. EXCLUSIONS
Confidential Information does not include information that:
(a) Is or becomes publicly available through no fault of the receiving Party
(b) Was rightfully in the receiving Party's possession prior to disclosure
(c) Is rightfully obtained from a third party without restriction
(d) Is independently developed without use of Confidential Information
5. RETURN OF MATERIALS
Upon request or termination, each Party shall return or destroy all Confidential
Information, except as required for legal or regulatory purposes.
6. NO LICENSE
Nothing in this NDA grants any rights to intellectual property, except as
subsequently agreed in the Document: Acquisition Agreement and
Schedule 1 - IP Assets.
IN WITNESS WHEREOF, the Parties have executed this NDA as of the date first above written.
TechCorp Industries, Inc.
By: ______________________
Name: James Mitchell
Title: CEO
StartupXYZ LLC
By: ______________________
Name: Sarah Chen
Title: Founder & CEO
"""
},
"08_regulatory_approval.pdf": {
"title": "REGULATORY APPROVAL LETTER",
"content": """
FEDERAL TRADE COMMISSION PREMERGER NOTIFICATION OFFICE
January 28, 2025
TechCorp Industries, Inc.
500 Technology Drive
San Francisco, CA 94105
StartupXYZ LLC
123 Innovation Way
Palo Alto, CA 94301
Re: Early Termination of HSR Waiting Period Transaction: Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.
Dear Parties:
This letter confirms that the Federal Trade Commission has granted early
termination of the waiting period under the Hart-Scott-Rodino Antitrust
Improvements Act of 1976 for the above-referenced transaction.
FILING DETAILS
Filing Date: January 10, 2025
Transaction Value: $45,000,000 (as stated in Document: Acquisition Agreement)
HSR Filing Fee: $30,000
Early Termination Granted: January 28, 2025
EFFECT OF EARLY TERMINATION
The parties may now consummate the transaction at any time. This early termination
satisfies the condition precedent set forth in Article IV, Section 4.1(a) of the
Document: Acquisition Agreement.
Please note that early termination of the waiting period does not preclude the
Commission from taking any action it deems necessary to protect competition.
NEXT STEPS
Per the Document: Closing Checklist, you may now proceed with the closing
scheduled for March 1, 2025, subject to satisfaction of other conditions in the
Document: Acquisition Agreement.
The Document: Risk Assessment Memo correctly identified this as a low-risk
item. The market analysis in the Document: Due Diligence Report supported
the determination that this transaction does not raise competitive concerns.
Date: February 15, 2025 To: Deal Team From: Legal Department Re: Change of Control Consent Status
As required by Schedule 2 - Material Contracts of the
Document: Acquisition Agreement, this memo summarizes the status of
customer consents for contracts containing change-of-control provisions.
CONSENT STATUS SUMMARY
1. MegaCorp Inc. - OBTAINED
Contract Value: $3.4M annual
Consent Received: February 10, 2025
Notes: MegaCorp requested meeting with TechCorp leadership; meeting held 2/8/25.
Consent granted with no additional conditions. This addresses the primary concern
noted in Document: Risk Assessment Memo Section 1.1.
2. DataFlow Systems - OBTAINED
Contract Value: $1.2M annual
Consent Received: February 5, 2025
Notes: Standard consent process. No concerns raised.
3. CloudTech Partners - PENDING
Contract Value: $890K annual
Status: Consent requested February 1, 2025
Expected: February 20, 2025
Notes: Legal review in progress at CloudTech. Their counsel has reviewed the
Document: Acquisition Agreement and has no objections. Verbal confirmation
received; written consent expected shortly.
IMPACT ANALYSIS
Per Document: Due Diligence Report Section 4, there were 3 contracts
requiring consent:
- 2 obtained (representing $4.6M annual revenue)
- 1 pending (representing $890K annual revenue)
CLOSING IMPLICATIONS
The Document: Acquisition Agreement Article IV requires "material" customer
consents as a closing condition. With MegaCorp consent obtained, this condition
is substantially satisfied. The pending CloudTech consent is expected before
the March 1 closing date per Document: Closing Checklist.
ATTACHMENTS
Attached hereto:
- Exhibit A: MegaCorp Consent Letter (dated February 10, 2025)
- Exhibit B: DataFlow Systems Consent Letter (dated February 5, 2025)
- Exhibit C: CloudTech Partners Draft Consent (pending signature)
RECOMMENDATION
We recommend proceeding with closing preparations. The risk of CloudTech
withholding consent is low based on discussions with their counsel. This
is consistent with the risk mitigation strategy in Document: Risk Assessment Memo.
"""
},
"10_closing_checklist.pdf": {
"title": "CLOSING CHECKLIST",
"content": """
CLOSING CHECKLIST Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.
Closing Date: March 1, 2025 Closing Location: Wilson & Partners LLP, San Francisco
I. PRE-CLOSING CONDITIONS
A. Regulatory
[X] HSR Filing submitted - Document: Regulatory Approval Letter
[X] Early termination received (January 28, 2025)
[ ] State regulatory filings (if required)
C. Due Diligence Completion
[X] Financial due diligence - Document: Due Diligence Report
[X] Legal due diligence - Document: Legal Opinion Letter
[X] IP due diligence - Document: IP Certification Letter
[X] Risk assessment - Document: Risk Assessment Memo
II. CLOSING DOCUMENTS
A. Transaction Documents
[ ] Executed Document: Acquisition Agreement
[ ] Bill of Sale
[ ] Assignment and Assumption Agreement
[ ] IP Assignment Agreement (per Schedule 1 - IP Assets)
B. Corporate Documents
[ ] Seller's Certificate of Good Standing
[ ] Secretary's Certificate (resolutions, incumbency)
[ ] Buyer's Certificate of Good Standing
C. Financial Documents
[ ] Closing Statement per Document: Financial Adjustments Memo
[ ] Wire transfer instructions
[ ] Escrow Agreement (per Exhibit C - Earnout Terms)
[ ] Stock certificates or book entry (per Exhibit B - Stock Valuation)
D. Employment Documents
[ ] Retention agreements per Schedule 3 - Employee Transition Plan
[ ] Offer letters for key employees
[ ] WARN Act compliance (if applicable)
III. CLOSING FUNDS
Per Document: Financial Adjustments Memo:
[ ] Cash payment: $28,330,000
[ ] Escrow deposit: $1,300,000
[ ] Stock issuance: $10,000,000
Total at Closing: $39,630,000