Full Code of VectifyAI/PageIndex for AI

main 4b4b20f9c425 cached

37 files

527.2 KB

119.1k tokens

115 symbols

1 requests

Download .txt

Showing preview only (547K chars total). Download the full file or copy to clipboard to get everything.

Repository: VectifyAI/PageIndex
Branch: main
Commit: 4b4b20f9c425
Files: 37
Total size: 527.2 KB

Directory structure:
gitextract_hq6ob10f/

├── .claude/
│   └── commands/
│       └── dedupe.md
├── .gitattributes
├── .github/
│   └── workflows/
│       ├── autoclose-labeled-issues.yml
│       ├── backfill-dedupe.yml
│       ├── issue-dedupe.yml
│       └── remove-autoclose-label.yml
├── .gitignore
├── CHANGELOG.md
├── LICENSE
├── README.md
├── cookbook/
│   ├── README.md
│   ├── agentic_retrieval.ipynb
│   ├── pageIndex_chat_quickstart.ipynb
│   ├── pageindex_RAG_simple.ipynb
│   └── vision_RAG_pageindex.ipynb
├── pageindex/
│   ├── __init__.py
│   ├── config.yaml
│   ├── page_index.py
│   ├── page_index_md.py
│   └── utils.py
├── requirements.txt
├── run_pageindex.py
├── scripts/
│   ├── autoclose-labeled-issues.js
│   └── comment-on-duplicates.sh
├── tests/
│   └── results/
│       ├── 2023-annual-report-truncated_structure.json
│       ├── 2023-annual-report_structure.json
│       ├── PRML_structure.json
│       ├── Regulation Best Interest_Interpretive release_structure.json
│       ├── Regulation Best Interest_proposed rule_structure.json
│       ├── earthmover_structure.json
│       ├── four-lectures_structure.json
│       └── q1-fy25-earnings_structure.json
└── tutorials/
    ├── doc-search/
    │   ├── README.md
    │   ├── description.md
    │   ├── metadata.md
    │   └── semantics.md
    └── tree-search/
        └── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .claude/commands/dedupe.md
================================================
---
allowed-tools:
  - Bash(gh:*)
  - Bash(./scripts/comment-on-duplicates.sh:*)
---

You are a GitHub issue deduplication assistant. Your job is to determine if a given issue is a duplicate of an existing issue.

## Input

The issue to check: $ARGUMENTS

## Steps

### 1. Pre-checks

First, check if the issue should be skipped:

```
gh issue view <number> --json state,labels,title,body,comments
```

Skip if:
- The issue is already closed
- The issue already has a `duplicate` label
- The issue already has a dedupe comment (check comments for "possible duplicate")

### 2. Understand the issue

Read the issue carefully and generate a concise summary of the core problem or feature request. Extract 3-5 key technical terms or concepts.

### 3. Search for duplicates

Launch 5 parallel searches using different keyword strategies to maximize coverage:

1. **Exact terms**: Use the most specific technical terms from the issue title
2. **Synonyms**: Use alternative phrasings for the core problem
3. **Error messages**: If the issue contains error messages, search for those
4. **Component names**: Search by the specific component/module mentioned
5. **Broad category**: Search by the general category of the issue

For each search, use:
```
gh search issues "<keywords> state:open" --repo $REPOSITORY --limit 20
```

### 4. Analyze candidates

For each unique candidate issue found:
- Compare the core problem being described
- Look past superficial wording differences
- Consider whether they describe the same root cause
- Only flag as duplicate if you are at least 85% confident

### 5. Filter false positives

Remove candidates that:
- Are only superficially similar (same area but different problems)
- Are related but describe distinct issues
- Are too old or already resolved differently

### 6. Report results

If you found duplicates (max 3), call:
```
./scripts/comment-on-duplicates.sh --base-issue <number> --potential-duplicates <dup1> <dup2> ...
```

If no duplicates found, do nothing and report that the issue appears to be unique.


================================================
FILE: .gitattributes
================================================
*.ipynb linguist-vendored

================================================
FILE: .github/workflows/autoclose-labeled-issues.yml
================================================
# Auto-closes duplicate issues after 3 days if no human activity or thumbs-down reaction.
# Runs daily at 09:00 UTC.
name: Auto-close Duplicate Issues

on:
  schedule:
    - cron: '0 9 * * *'
  workflow_dispatch:
    inputs:
      dry_run:
        description: 'Dry run - report but do not close issues'
        required: false
        default: 'false'
        type: choice
        options:
          - 'false'
          - 'true'

permissions:
  issues: write
  contents: read

jobs:
  autoclose:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Close inactive duplicate issues
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO_OWNER: ${{ github.repository_owner }}
          REPO_NAME: ${{ github.event.repository.name }}
          DRY_RUN: ${{ inputs.dry_run || 'false' }}
        run: node scripts/autoclose-labeled-issues.js


================================================
FILE: .github/workflows/backfill-dedupe.yml
================================================
# Backfills duplicate detection for historical issues using Claude Code.
# Triggered manually via workflow_dispatch.
name: Backfill Duplicate Detection

on:
  workflow_dispatch:
    inputs:
      days_back:
        description: 'How many days back to look for issues (default: 30)'
        required: false
        default: '30'
        type: number

permissions:
  contents: read
  issues: write
  actions: write

jobs:
  backfill:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4

      - name: Fetch issues and run dedupe
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO: ${{ github.repository }}
          DAYS_BACK: ${{ inputs.days_back || '30' }}
        run: |
          if ! [[ "$DAYS_BACK" =~ ^[0-9]+$ ]]; then
            echo "Error: days_back must be a number"
            exit 1
          fi

          SINCE=$(date -u -d "$DAYS_BACK days ago" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-${DAYS_BACK}d +%Y-%m-%dT%H:%M:%SZ)
          echo "Fetching open issues since $SINCE"

          # Get open issues via gh api --paginate, filter out PRs and already-labeled ones
          ISSUES=$(gh api --paginate "repos/$REPO/issues?state=open&per_page=100" \
            --jq "[.[] | select(.pull_request == null) | select(.created_at >= \"$SINCE\") | select([.labels[].name] | index(\"duplicate\") | not)] | .[].number" | xargs)

          if [ -z "$ISSUES" ]; then
            echo "No issues to process"
            exit 0
          fi

          BATCH_SIZE=10
          COUNT=0
          echo "Issues to process: $ISSUES"
          for NUMBER in $ISSUES; do
            echo "Triggering dedupe for issue #$NUMBER"
            gh workflow run issue-dedupe.yml --repo "$REPO" -f issue_number="$NUMBER"
            COUNT=$((COUNT + 1))
            if [ $((COUNT % BATCH_SIZE)) -eq 0 ]; then
              echo "Pausing 60s after $COUNT issues..."
              sleep 60
            else
              sleep 5
            fi
          done

          echo "Backfill triggered for $COUNT issues"


================================================
FILE: .github/workflows/issue-dedupe.yml
================================================
# Detects duplicate issues using Claude Code with the /dedupe command.
# Triggered automatically when a new issue is opened, or manually for a single issue.
name: Issue Duplicate Detection

on:
  issues:
    types: [opened]
  workflow_dispatch:
    inputs:
      issue_number:
        description: 'Issue number to check for duplicates'
        required: true
        type: string

permissions:
  contents: read
  issues: write

concurrency:
  group: dedupe-${{ github.event.issue.number || inputs.issue_number }}
  cancel-in-progress: true

jobs:
  detect-duplicate:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    # Skip pull-requests that surface as issues and bot-opened issues
    if: >
      (github.event_name == 'workflow_dispatch') ||
      (github.event.issue.pull_request == null &&
       !endsWith(github.actor, '[bot]') &&
       github.actor != 'github-actions')
    steps:
      - uses: actions/checkout@v4

      - name: Determine issue number
        id: issue
        env:
          EVENT_NAME: ${{ github.event_name }}
          INPUT_NUMBER: ${{ inputs.issue_number }}
          ISSUE_NUMBER: ${{ github.event.issue.number }}
        run: |
          if [ "$EVENT_NAME" = "workflow_dispatch" ]; then
            echo "number=$INPUT_NUMBER" >> "$GITHUB_OUTPUT"
          else
            echo "number=$ISSUE_NUMBER" >> "$GITHUB_OUTPUT"
          fi

      - uses: anthropics/claude-code-action@v1
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          prompt: "/dedupe ${{ github.repository }}/issues/${{ steps.issue.outputs.number }}"
          anthropic_api_key: ${{ secrets.AUTHROPIC_API_KEY }}
          github_token: ${{ secrets.GITHUB_TOKEN }}
          allowed_bots: "github-actions"
          allowed_non_write_users: "*"


================================================
FILE: .github/workflows/remove-autoclose-label.yml
================================================
# Removes the "duplicate" label when a human (non-bot) comments on a
# duplicate-flagged issue, signaling that the issue needs re-evaluation.
# The auto-close script also independently checks for human activity,
# so this provides an additional visible signal.
name: Remove Duplicate Label on Human Activity

on:
  issue_comment:
    types: [created]

permissions:
  issues: write

jobs:
  remove-label:
    # Only run for issue comments (not PR comments)
    if: >
      github.event.issue.pull_request == null &&
      !endsWith(github.actor, '[bot]') &&
      github.actor != 'github-actions'
    runs-on: ubuntu-latest
    steps:
      - name: Remove duplicate label if human commented
        uses: actions/github-script@v7
        with:
          script: |
            const issue = context.payload.issue;
            const labels = (issue.labels || []).map(l => l.name);

            if (!labels.includes('duplicate')) {
              core.info('Issue does not have "duplicate" label - nothing to do.');
              return;
            }

            await github.rest.issues.removeLabel({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: issue.number,
              name: 'duplicate',
            });

            core.info(
              `Removed "duplicate" label from #${issue.number} ` +
              `after human comment by ${context.actor}`
            );


================================================
FILE: .gitignore
================================================
.ipynb_checkpoints
__pycache__
files
index
temp/*
chroma-collections.parquet
chroma-embeddings.parquet
.DS_Store
.env*
notebook
SDK/*
log/*
logs/
parts/*
json_results/*


================================================
FILE: CHANGELOG.md
================================================
# Change Log
All notable changes to this project will be documented in this file.

## Beta - 2025-04-23

### Fixed
- [x] Fixed a bug introduced on April 18 where `start_index` was incorrectly passed.

## Beta - 2025-04-03

### Added
- [x] Add node_id, node summary
- [x] Add document discription

### Changed
- [x] Change "child_nodes" -> "nodes" to simplify the structure


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2025 Vectify AI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
<div align="center">
  
<a href="https://vectify.ai/pageindex" target="_blank">
  <img src="https://github.com/user-attachments/assets/46201e72-675b-43bc-bfbd-081cc6b65a1d" alt="PageIndex Banner" />
</a>

<br/>
<br/>

<p align="center">
  <a href="https://trendshift.io/repositories/14736" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14736" alt="VectifyAI%2FPageIndex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>

# PageIndex: Vectorless, Reasoning-based RAG

<p align="center"><b>Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</b></p>

<h4 align="center">
  <a href="https://vectify.ai">🏠 Homepage</a>&nbsp; • &nbsp;
  <a href="https://chat.pageindex.ai">🖥️ Chat Platform</a>&nbsp; • &nbsp;
  <a href="https://pageindex.ai/mcp">🔌 MCP</a>&nbsp; • &nbsp;
  <a href="https://docs.pageindex.ai">📚 Docs</a>&nbsp; • &nbsp;
  <a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp; • &nbsp;
  <a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact</a>&nbsp;
</h4>
  
</div>


<details open>
<summary><h3>📢 Latest Updates</h3></summary>

 **🔥 Releases:**
- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like document-analysis agent [platform](https://chat.pageindex.ai) built for professional long documents. Can also be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).
<!-- - [**PageIndex Chat API**](https://docs.pageindex.ai/quickstart): An API that brings PageIndex's advanced long-document intelligence directly into your applications and workflows. -->
<!-- - [PageIndex MCP](https://pageindex.ai/mcp): Bring PageIndex into Claude, Cursor, or any MCP-enabled agent. Chat with long PDFs in a reasoning-based, human-like way. -->
 
 **📝 Articles:**
- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context* *tree index* that enables LLMs to perform *reasoning-based*, *human-like retrieval* over long documents, without vector DB or chunking.
<!-- - [Do We Still Need OCR?](https://pageindex.ai/blog/do-we-need-ocr): Explores how vision-based, reasoning-native RAG challenges the traditional OCR pipeline, and why the future of document AI might be *vectorless* and *vision-based*. -->

 **🧪 Cookbooks:**
- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, hands-on example of reasoning-based RAG using PageIndex. No vectors, no chunking, and human-like retrieval.
- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): OCR-free, vision-only RAG with PageIndex's reasoning-native retrieval workflow that works directly over PDF page images.
</details>

---

# 📑 Introduction to PageIndex

Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.

Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a **vectorless**, **reasoning-based RAG** system that builds a **hierarchical tree index** from long documents and uses LLMs to **reason** *over that index* for **agentic, context-aware retrieval**.
It simulates how *human experts* navigate and extract knowledge from complex documents through *tree search*, enabling LLMs to *think* and *reason* their way to the most relevant document sections. PageIndex performs retrieval in two steps:

1. Generate a “Table-of-Contents” **tree structure index** of documents
2. Perform reasoning-based retrieval through **tree search**

<div align="center">
  <a href="https://pageindex.ai/blog/pageindex-intro" target="_blank" title="The PageIndex Framework">
    <img src="https://docs.pageindex.ai/images/cookbook/vectorless-rag.png" width="70%">
  </a>
</div>

### 🎯 Core Features 

Compared to traditional vector-based RAG, **PageIndex** features:
- **No Vector DB**: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
- **No Chunking**: Documents are organized into natural sections, not artificial chunks.
- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents.
- **Better Explainability and Traceability**: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search (“vibe retrieval”).

PageIndex powers a reasoning-based RAG system that achieved **state-of-the-art** [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis (see our [blog post](https://vectify.ai/blog/Mafin2.5) for details).

### 📍 Explore PageIndex

To learn more, please see a detailed introduction of the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out this GitHub repo for open-source code, and the [cookbooks](https://docs.pageindex.ai/cookbook), [tutorials](https://docs.pageindex.ai/tutorials), and [blog](https://pageindex.ai/blog) for additional usage guides and examples. 

The PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or can be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).

### 🛠️ Deployment Options
- Self-host — run locally with this open-source repo.
- Cloud Service — try instantly with our [Chat Platform](https://chat.pageindex.ai/), or integrate with [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).
- _Enterprise_ — private or on-prem deployment. [Contact us](https://ii2abc2jejf.typeform.com/to/tK3AXl8T) or [book a demo](https://calendly.com/pageindex/meet) for more details.

### 🧪 Quick Hands-on

- Try the [**Vectorless RAG**](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) notebook — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.
- Experiment with [*Vision-based Vectorless RAG*](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.
  
<div align="center">
  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb" target="_blank" rel="noopener">
    <img src="https://img.shields.io/badge/Open_In_Colab-Vectorless_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vectorless RAG" />
  </a>
  &nbsp;&nbsp;
  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb" target="_blank" rel="noopener">
    <img src="https://img.shields.io/badge/Open_In_Colab-Vision_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vision RAG" />
  </a>
</div>

---

# 🌲 PageIndex Tree Structure
PageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a _"table of contents"_ but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.

Below is an example PageIndex tree structure. Also see more example [documents](https://github.com/VectifyAI/PageIndex/tree/main/tests/pdfs) and generated [tree structures](https://github.com/VectifyAI/PageIndex/tree/main/tests/results).

```jsonc
...
{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}
...
```

You can generate the PageIndex tree structure with this open-source repo, or use our [API](https://docs.pageindex.ai/quickstart) 

---

# ⚙️ Package Usage

You can follow these steps to generate a PageIndex tree from a PDF document.

### 1. Install dependencies

```bash
pip3 install --upgrade -r requirements.txt
```

### 2. Set your OpenAI API key

Create a `.env` file in the root directory and add your API key:

```bash
CHATGPT_API_KEY=your_openai_key_here
```

### 3. Run PageIndex on your PDF

```bash
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
```

<details>
<summary><strong>Optional parameters</strong></summary>
<br>
You can customize the processing with additional optional arguments:

```
--model                 OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages       Pages to check for table of contents (default: 20)
--max-pages-per-node    Max pages per node (default: 10)
--max-tokens-per-node   Max tokens per node (default: 20000)
--if-add-node-id        Add node ID (yes/no, default: yes)
--if-add-node-summary   Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)
```
</details>

<details>
<summary><strong>Markdown support</strong></summary>
<br>
We also provide markdown support for PageIndex. You can use the `-md_path` flag to generate a tree structure for a markdown file.

```bash
python3 run_pageindex.py --md_path /path/to/your/document.md
```

> Note: in this function, we use "#" to determine node heading and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our [PageIndex OCR](https://pageindex.ai/blog/ocr), which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.
</details>

<!-- 
# ☁️ Improved Tree Generation with PageIndex OCR

This repo is designed for generating PageIndex tree structure for simple PDFs, but many real-world use cases involve complex PDFs that are hard to parse by classic Python tools. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy.

To address this, we introduced PageIndex OCR — the first long-context OCR model designed to preserve the global structure of documents. PageIndex OCR significantly outperforms other leading OCR tools, such as those from Mistral and Contextual AI, in recognizing true hierarchy and semantic relationships across document pages.

- Experience next-level OCR quality with PageIndex OCR at our [Dashboard](https://dash.pageindex.ai/).
- Integrate PageIndex OCR seamlessly into your stack via our [API](https://docs.pageindex.ai/quickstart).

<p align="center">
  <img src="https://github.com/user-attachments/assets/eb35d8ae-865c-4e60-a33b-ebbd00c41732" width="80%">
</p>
-->

---

# 📈 Case Study: PageIndex Leads Finance QA Benchmark

[Mafin 2.5](https://vectify.ai/mafin) is a reasoning-based RAG system for financial document analysis, powered by **PageIndex**. It achieved a state-of-the-art [**98.7% accuracy**](https://vectify.ai/blog/Mafin2.5) on the [FinanceBench](https://arxiv.org/abs/2311.11944) benchmark, significantly outperforming traditional vector-based RAG systems.

PageIndex's hierarchical indexing and reasoning-driven retrieval enable precise navigation and extraction of relevant context from complex financial reports, such as SEC filings and earnings disclosures.

Explore the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) and our [blog post](https://vectify.ai/blog/Mafin2.5) for detailed comparisons and performance metrics.

<div align="center">
  <a href="https://github.com/VectifyAI/Mafin2.5-FinanceBench">
    <img src="https://github.com/user-attachments/assets/571aa074-d803-43c7-80c4-a04254b782a3" width="70%">
  </a>
</div>

---

# 🧭 Resources

* 🧪 [Cookbooks](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): hands-on, runnable examples and advanced use cases.
* 📖 [Tutorials](https://docs.pageindex.ai/doc-search): practical guides and strategies, including *Document Search* and *Tree Search*.
* 📝 [Blog](https://pageindex.ai/blog): technical articles, research insights, and product updates.
* 🔌 [MCP setup](https://pageindex.ai/mcp#quick-setup) & [API docs](https://docs.pageindex.ai/quickstart): integration details and configuration options.

---

# ⭐ Support Us
Please cite this work as:
```
Mingtian Zhang, Yu Tang and PageIndex Team,
"PageIndex: Next-Generation Vectorless, Reasoning-based RAG",
PageIndex Blog, Sep 2025.
```

Or use the BibTeX citation:

```
@article{zhang2025pageindex,
  author = {Mingtian Zhang and Yu Tang and PageIndex Team},
  title = {PageIndex: Next-Generation Vectorless, Reasoning-based RAG},
  journal = {PageIndex Blog},
  year = {2025},
  month = {September},
  note = {https://pageindex.ai/blog/pageindex-intro},
}
```

Leave us a star 🌟 if you like our project. Thank you!  

<p>
  <img src="https://github.com/user-attachments/assets/eae4ff38-48ae-4a7c-b19f-eab81201d794" width="80%">
</p>

### Connect with Us

[![Twitter](https://img.shields.io/badge/Twitter-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/PageIndexAI)&nbsp;
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/company/vectify-ai/)&nbsp;
[![Discord](https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.com/invite/VuXuf29EUj)&nbsp;
[![Contact Us](https://img.shields.io/badge/Contact_Us-3B82F6?style=for-the-badge&logo=envelope&logoColor=white)](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)

---

© 2025 [Vectify AI](https://vectify.ai)


================================================
FILE: cookbook/README.md
================================================
### 🧪 Cookbooks:

* [**Vectorless RAG notebook**](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb): A *minimal*, hands-on example of reasoning-based RAG using **PageIndex** — no vectors, no chunking, and human-like retrieval.
* [Vision-based Vectorless RAG notebook](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb): no OCR; reasoning-native RAG pipeline that retrieves and reasons directly over page images.

<div align="center">
  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb" target="_blank" rel="noopener">
    <img src="https://img.shields.io/badge/Open_In_Colab-Vectorless_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vectorless RAG" />
  </a>
  &nbsp;&nbsp;
  <a href="https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb" target="_blank" rel="noopener">
    <img src="https://img.shields.io/badge/Open_In_Colab-Vision_RAG-orange?style=for-the-badge&logo=googlecolab" alt="Open in Colab: Vision RAG" />
  </a>
</div>

================================================
FILE: cookbook/agentic_retrieval.ipynb
================================================
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XTboY7brzyp2"
      },
      "source": [
        "![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EtjMbl9Pz3S-"
      },
      "source": [
        "<p align=\"center\">Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</p>\n",
        "\n",
        "<p align=\"center\">\n",
        "  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://chat.pageindex.ai\">🖥️ Platform</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://github.com/VectifyAI/PageIndex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
        "</p>\n",
        "\n",
        "<div align=\"center\">\n",
        "\n",
        "[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\n",
        "\n",
        "</div>\n",
        "\n",
        "---\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bbC9uLWCz8zl"
      },
      "source": [
        "# Agentic Retrieval with PageIndex Chat API\n",
        "\n",
        "Similarity-based RAG based on Vector-DB has shown big limitations in recent AI applications, reasoning-based or agentic retrieval has become important in current developments. However, unlike classic RAG pipeine with embedding input, top-K chunks returns, re-rank, what should a agentic-native retreival API looks like?\n",
        "\n",
        "For an agentic-native retrieval system, we need the ability to prompt for retrieval just as naturally as you interact with ChatGPT. Below, we provide an example of how the PageIndex Chat API enables this style of prompt-driven retrieval.\n",
        "\n",
        "\n",
        "## PageIndex Chat API\n",
        "[PageIndex Chat](https://chat.pageindex.ai/) is a AI assistant that allow you chat with multiple super-long documents without worrying about limited context or context rot problem. It is based on [PageIndex](https://pageindex.ai/blog/pageindex-intro), a vectorless reasoning-based RAG framework which gives more transparent and reliable results like a human expert.\n",
        "<div align=\"center\">\n",
        "  <img src=\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\" width=\"70%\">\n",
        "</div>\n",
        "\n",
        "You can now access PageIndex Chat with API or SDK.\n",
        "\n",
        "## 📝 Notebook Overview\n",
        "\n",
        "This notebook demonstrates a simple, minimal example of agentic retrieval with PageIndex. You will learn:\n",
        "- [x] How to use PageIndex Chat API.\n",
        "- [x] How to prompt the PageIndex Chat to make it a retrieval system"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "77SQbPoe-LTN"
      },
      "source": [
        "### Install PageIndex SDK"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 36,
      "metadata": {
        "id": "6Eiv_cHf0OXz"
      },
      "outputs": [],
      "source": [
        "%pip install -q --upgrade pageindex"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UR9-qkdD-Om7"
      },
      "source": [
        "### Setup PageIndex"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 60,
      "metadata": {
        "id": "AFzsW4gq0fjh"
      },
      "outputs": [],
      "source": [
        "from pageindex import PageIndexClient\n",
        "\n",
        "# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
        "PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\"\n",
        "pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uvzf9oWL-Ts9"
      },
      "source": [
        "### Upload a document"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 39,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qf7sNRoL0hGw",
        "outputId": "529f53c1-c827-45a7-cf01-41f567d4feaa"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Downloaded https://arxiv.org/pdf/2507.13334.pdf\n",
            "Document Submitted: pi-cmi34m6jy01sg0bqzofch62n8\n"
          ]
        }
      ],
      "source": [
        "import os, requests\n",
        "\n",
        "pdf_url = \"https://arxiv.org/pdf/2507.13334.pdf\"\n",
        "pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
        "os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
        "\n",
        "response = requests.get(pdf_url)\n",
        "with open(pdf_path, \"wb\") as f:\n",
        "    f.write(response.content)\n",
        "print(f\"Downloaded {pdf_url}\")\n",
        "\n",
        "doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
        "print('Document Submitted:', doc_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U4hpLB4T-fCt"
      },
      "source": [
        "### Check the processing status"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 61,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PB1S_CWd2n87",
        "outputId": "472a64ab-747d-469c-9e46-3329456df212"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'createdAt': '2025-11-16T08:36:41.177000',\n",
            " 'description': 'This survey provides a comprehensive overview and taxonomy of '\n",
            "                'Context Engineering for Large Language Models, covering '\n",
            "                'foundational components, system implementations, evaluation '\n",
            "                'methods, and future research directions.',\n",
            " 'id': 'pi-cmi1gp1hg01t20do2l3bgzwz1',\n",
            " 'name': '2507.13334_19.pdf',\n",
            " 'pageNum': 166,\n",
            " 'status': 'completed'}\n",
            "\n",
            " Document ready! (166 pages)\n"
          ]
        }
      ],
      "source": [
        "from pprint import pprint\n",
        "\n",
        "doc_info = pi_client.get_document(doc_id)\n",
        "pprint(doc_info)\n",
        "\n",
        "if doc_info['status'] == 'completed':\n",
        "  print(f\"\\n Document ready! ({doc_info['pageNum']} pages)\")\n",
        "elif doc_info['status'] == 'processing':\n",
        "  print(\"\\n Document is still processing. Please wait and check again.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z1C9FOvO-p1m"
      },
      "source": [
        "### Ask a question about this document"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 55,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "X3RbQvy_0nt7",
        "outputId": "9bfb314b-24ad-4eb2-d26c-01be5728d3cc"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I'll help you find the evaluation methods used in this paper. Let me start by examining the document structure to locate the relevant sections.{\"doc_name\": \"2507.13334_19.pdf\"}Perfect! I can see there's a dedicated section on \"Evaluation\" (node_id: 0015) that covers pages 45-50. Let me extract the content from those pages to get detailed information about the evaluation methods.{\"doc_name\": \"2507.13334_19.pdf\", \"pages\": \"45-50\"}Based on the comprehensive evaluation section of the paper, here are the **evaluation methods** used:\n",
            "\n",
            "## Evaluation Framework Overview\n",
            "\n",
            "The paper presents a comprehensive evaluation framework organized into **Component-Level Assessment** and **System-Level Integration Assessment**.\n",
            "\n",
            "### 1. **Component-Level Assessment (Intrinsic Evaluation)**\n",
            "\n",
            "#### Prompt Engineering Evaluation:\n",
            "- **Semantic similarity metrics**\n",
            "- **Response quality assessment**\n",
            "- **Robustness testing** across diverse input variations\n",
            "- **Contextual calibration** assessment\n",
            "\n",
            "#### Long Context Processing Evaluation:\n",
            "- **\"Needle in a haystack\"** evaluation paradigm - tests models' ability to retrieve specific information embedded within long contexts\n",
            "- **Multi-document reasoning tasks** - assess synthesis capabilities\n",
            "- **Position interpolation techniques** evaluation\n",
            "- **Information retention, positional bias, and reasoning coherence** metrics\n",
            "\n",
            "#### Self-Contextualization Evaluation:\n",
            "- **Meta-learning assessments**\n",
            "- **Adaptation speed measurements**\n",
            "- **Consistency analysis** across multiple iterations\n",
            "- Self-refinement frameworks: **Self-Refine, Reflexion, N-CRITICS**\n",
            "- Performance improvements measured (~20% improvement with GPT-4)\n",
            "\n",
            "#### Structured/Relational Data Integration:\n",
            "- **Knowledge graph traversal accuracy**\n",
            "- **Table comprehension assessment**\n",
            "- **Database query generation evaluation**\n",
            "\n",
            "### 2. **System-Level Integration Assessment (Extrinsic Evaluation)**\n",
            "\n",
            "#### Retrieval-Augmented Generation (RAG):\n",
            "- **Precision, recall, relevance metrics**\n",
            "- **Factual accuracy assessment**\n",
            "- **Task decomposition accuracy**\n",
            "- **Multi-plan selection effectiveness**\n",
            "- Memory-augmented planning evaluation\n",
            "\n",
            "#### Memory Systems Evaluation:\n",
            "- **LongMemEval benchmark** (500 curated questions covering):\n",
            "  - Information extraction\n",
            "  - Temporal reasoning\n",
            "  - Multi-session reasoning\n",
            "  - Knowledge updates\n",
            "- Dedicated benchmarks: **NarrativeQA, QMSum, QuALITY, MEMENTO**\n",
            "- Accuracy degradation tracking (~30% degradation in extended interactions)\n",
            "\n",
            "#### Tool-Integrated Reasoning:\n",
            "- **MCP-RADAR framework** for standardized evaluation\n",
            "- **Berkeley Function Calling Leaderboard (BFCL)** - 2,000 test cases\n",
            "- **T-Eval** - 553 tool-use cases\n",
            "- **API-Bank** - 73 APIs, 314 dialogues\n",
            "- **ToolHop** - 995 queries, 3,912 tools\n",
            "- **StableToolBench** for API instability\n",
            "- **WebArena** and **Mind2Web** for web agents\n",
            "- **VideoWebArena** for multimodal agents\n",
            "- Metrics: tool selection accuracy, parameter extraction precision, execution success rates, error recovery\n",
            "\n",
            "#### Multi-Agent Systems:\n",
            "- **Communication effectiveness metrics**\n",
            "- **Coordination efficiency assessment**\n",
            "- **Protocol adherence evaluation**\n",
            "- **Task decomposition accuracy**\n",
            "- **Emergent collaborative behaviors** assessment\n",
            "- Context handling and transaction support evaluation\n",
            "\n",
            "### 3. **Emerging Evaluation Paradigms**\n",
            "\n",
            "#### Self-Refinement Evaluation:\n",
            "- Iterative improvement assessment across multiple cycles\n",
            "- Multi-dimensional feedback mechanisms\n",
            "- Ensemble-based evaluation approaches\n",
            "\n",
            "#### Multi-Aspect Feedback:\n",
            "- Correctness, relevance, clarity, and robustness dimensions\n",
            "- Self-rewarding mechanisms for autonomous evolution\n",
            "\n",
            "#### Criticism-Guided Evaluation:\n",
            "- Specialized critic models providing detailed feedback\n",
            "- Fine-grained assessment of reasoning quality, factual accuracy, logical consistency\n",
            "\n",
            "### 4. **Safety and Robustness Assessment**\n",
            "\n",
            "- **Adversarial attack resistance testing**\n",
            "- **Distribution shift evaluation**\n",
            "- **Input perturbation testing**\n",
            "- **Alignment assessment** (adherence to intended behaviors)\n",
            "- **Graceful degradation strategies**\n",
            "- **Error recovery protocols**\n",
            "- **Long-term behavior consistency** evaluation\n",
            "\n",
            "### Key Benchmarks Mentioned:\n",
            "- GAIA (general assistant tasks - 92% human vs 15% GPT-4 accuracy)\n",
            "- GTA benchmark (GPT-4 <50% task completion vs 92% human)\n",
            "- WebArena Leaderboard (with success rates ranging from 23.5% to 61.7%)\n",
            "\n",
            "### Challenges Identified:\n",
            "- Traditional metrics (BLEU, ROUGE, perplexity) inadequate for complex systems\n",
            "- Need for \"living\" benchmarks that co-evolve with AI capabilities\n",
            "- Longitudinal evaluation frameworks for tracking memory fidelity over time\n",
            "- Compositional generalization assessment\n",
            "- Evaluation of \"unknown unknowns\" in multi-agent systems\n",
            "\n",
            "The paper emphasizes a **paradigm shift from static benchmarks to dynamic, holistic assessments** that evaluate not just task success but reasoning quality, robustness, and long-term autonomy."
          ]
        }
      ],
      "source": [
        "query = \"What are the evaluation methods used in this paper?\"\n",
        "\n",
        "for chunk in pi_client.chat_completions(\n",
        "    messages=[{\"role\": \"user\", \"content\": query}],\n",
        "    doc_id=doc_id,\n",
        "    stream=True\n",
        "):\n",
        "    print(chunk, end='', flush=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RXrdnxv92LY1"
      },
      "source": [
        "## Agentic Retrieval with PageIndex Chat API\n",
        "\n",
        "You can now easily prompt the PageIndex Chat API to be a retrieval assistant."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 62,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "puyOGkAi0wrH",
        "outputId": "b08c682f-abfb-4381-8bb0-726d210120ad"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I'll help you find the evaluation methods discussed in this paper. Let me start by examining the document structure to locate relevant sections.{\"doc_name\": \"2507.13334_22.pdf\"}Now I can see the document structure. Section 6 is dedicated to \"Evaluation\" and spans pages 45-50. Let me retrieve the content from those pages to get the detailed evaluation methods.{\"doc_name\": \"2507.13334_22.pdf\", \"pages\": \"45-50\"}Perfect! I have retrieved the comprehensive evaluation section. Now let me compile the evaluation methods in the requested JSON format.\n",
            "\n",
            "```json\n",
            "[\n",
            "  {\n",
            "    \"page\": 45,\n",
            "    \"content\": \"## 6. Evaluation\\n\\nThe evaluation of context-engineered systems presents unprecedented challenges that transcend traditional language model assessment paradigms. These systems exhibit complex, multi-component architectures with dynamic, context-dependent behaviors requiring comprehensive evaluation frameworks that assess component-level diagnostics, task-based performance, and overall system robustness [841, 1141].\\n\\nThe heterogeneous nature of context engineering components-spanning retrieval mechanisms, memory systems, reasoning chains, and multi-agent coordination-demands evaluation methodologies that can capture both individual component effectiveness and emergent system-level behaviors [314, 939].\\n\\n### 6.1. Evaluation Frameworks and Methodologies\\n\\nThis subsection presents comprehensive approaches for evaluating both individual components and integrated systems in context engineering.\\n\\n#### 6.1.1. Component-Level Assessment\\n\\nIntrinsic evaluation focuses on the performance of individual components in isolation, providing foundational insights into system capabilities and failure modes.\\n\\nFor prompt engineering components, evaluation encompasses prompt effectiveness measurement through semantic similarity metrics, response quality assessment, and robustness testing across diverse input variations. Current approaches reveal brittleness and robustness challenges in prompt design, necessitating more sophisticated evaluation frameworks that can assess contextual calibration and adaptive prompt optimization $[1141,669]$.\"\n",
            "  },\n",
            "  {\n",
            "    \"page\": 46,\n",
            "    \"content\": \"Long context processing evaluation requires specialized metrics addressing information retention, positional bias, and reasoning coherence across extended sequences. The \\\"needle in a haystack\\\" evaluation paradigm tests models' ability to retrieve specific information embedded within long contexts, while multi-document reasoning tasks assess synthesis capabilities across multiple information sources. Position interpolation techniques and ultra-long sequence processing methods face significant computational challenges that limit practical evaluation scenarios [737, 299].\\n\\nSelf-contextualization mechanisms undergo evaluation through meta-learning assessments, adaptation speed measurements, and consistency analysis across multiple iterations. Self-refinement frameworks including Self-Refine, Reflexion, and N-CRITICS demonstrate substantial performance improvements, with GPT-4 achieving approximately 20\\\\% improvement through iterative self-refinement processes [741, 964, 795]. Multi-dimensional feedback mechanisms and ensemble-based evaluation approaches provide comprehensive assessment of autonomous evolution capabilities [583, 710].\\n\\nStructured and relational data integration evaluation examines accuracy in knowledge graph traversal, table comprehension, and database query generation. However, current evaluation frameworks face significant limitations in assessing structural reasoning capabilities, with high-quality structured training data development presenting ongoing challenges. LSTM-based models demonstrate increased errors when sequential and structural information conflict, highlighting the need for more sophisticated benchmarks testing structural understanding $[769,674,167]$.\\n\\n#### 6.1.2. System-Level Integration Assessment\\n\\nExtrinsic evaluation measures end-to-end performance on downstream tasks, providing holistic assessments of system utility through comprehensive benchmarks spanning question answering, reasoning, and real-world applications.\\n\\nSystem-level evaluation must capture emergent behaviors arising from component interactions, including synergistic effects where combined components exceed individual performance and potential interference patterns where component integration degrades overall effectiveness [841, 1141].\\n\\nRetrieval-Augmented Generation evaluation encompasses both retrieval quality and generation effectiveness through comprehensive metrics addressing precision, recall, relevance, and factual accuracy. Agentic RAG systems introduce additional complexity requiring evaluation of task decomposition accuracy, multi-plan selection effectiveness, and memory-augmented planning capabilities. Self-reflection mechanisms demonstrate iterative improvement through feedback loops, with MemoryBank implementations incorporating Ebbinghaus Forgetting Curve principles for enhanced memory evaluation [444, 166, 1372, 1192, 41].\\n\\nMemory systems evaluation encounters substantial difficulties stemming from the absence of standardized assessment frameworks and the inherently stateless characteristics of contemporary LLMs. LongMemEval offers 500 carefully curated questions that evaluate fundamental capabilities encompassing information extraction, temporal reasoning, multi-session reasoning, and knowledge updates. Commercial AI assistants exhibit $30 \\\\%$ accuracy degradation throughout extended interactions, underscoring significant deficiencies in memory persistence and retrieval effectiveness [1340, 1180, 463, 847, 390]. Dedicated benchmarks such as NarrativeQA, QMSum, QuALITY, and MEMENTO tackle episodic memory evaluation challenges [556, 572].\\n\\nTool-integrated reasoning systems require comprehensive evaluation covering the entire interaction trajectory, including tool selection accuracy, parameter extraction precision, execution success rates, and error recovery capabilities. The MCP-RADAR framework provides standardized evaluation employing objective metrics for software engineering and mathematical reasoning domains. Real-world evaluation reveals\"\n",
            "  },\n",
            "  {\n",
            "    \"page\": 47,\n",
            "    \"content\": \"significant performance gaps, with GPT-4 completing less than 50\\\\% of tasks in the GTA benchmark, compared to human performance of $92 \\\\%$ [314, 1098, 126, 939]. Advanced benchmarks including BFCL (2,000 testing cases), T-Eval (553 tool-use cases), API-Bank (73 APIs, 314 dialogues), and ToolHop ( 995 queries, 3,912 tools) address multi-turn interactions and nested tool calling scenarios [263, 363, 377, 1264, 160, 835].\\n\\nMulti-agent systems evaluation captures communication effectiveness, coordination efficiency, and collective outcome quality through specialized metrics addressing protocol adherence, task decomposition accuracy, and emergent collaborative behaviors. Contemporary orchestration frameworks including LangGraph, AutoGen, and CAMEL demonstrate insufficient transaction support, with validation limitations emerging as systems rely exclusively on LLM self-validation capabilities without independent validation procedures. Context handling failures compound challenges as agents struggle with long-term context maintenance encompassing both episodic and semantic information [128, 394, 901].\\n\\n### 6.2. Benchmark Datasets and Evaluation Paradigms\\n\\nThis subsection reviews specialized benchmarks and evaluation paradigms designed for assessing context engineering system performance.\\n\\n#### 6.2.1. Foundational Component Benchmarks\\n\\nLong context processing evaluation employs specialized benchmark suites designed to test information retention, reasoning, and synthesis across extended sequences. Current benchmarks face significant computational complexity challenges, with $\\\\mathrm{O}\\\\left(\\\\mathrm{n}^{2}\\\\right)$ scaling limitations in attention mechanisms creating substantial memory constraints for ultra-long sequences. Position interpolation and extension techniques require sophisticated evaluation frameworks that can assess both computational efficiency and reasoning quality across varying sequence lengths [737, 299, 1236].\\n\\nAdvanced architectures including LongMamba and specialized position encoding methods demonstrate promising directions for long context processing, though evaluation reveals persistent challenges in maintaining coherence across extended sequences. The development of sliding attention mechanisms and memory-efficient implementations requires comprehensive benchmarks that can assess both computational tractability and task performance [1267, 351].\\n\\nStructured and relational data integration benchmarks encompass diverse knowledge representation formats and reasoning patterns. However, current evaluation frameworks face limitations in assessing structural reasoning capabilities, with the development of high-quality structured training data presenting ongoing challenges. Evaluation must address the fundamental tension between sequential and structural information processing, particularly in scenarios where these information types conflict [769, 674, 167].\\n\\n#### 6.2.2. System Implementation Benchmarks\\n\\nRetrieval-Augmented Generation evaluation leverages comprehensive benchmark suites addressing diverse retrieval and generation challenges. Modular RAG architectures demonstrate enhanced flexibility through specialized modules for retrieval, augmentation, and generation, enabling fine-grained evaluation of individual components and their interactions. Graph-enhanced RAG systems incorporating GraphRAG and LightRAG demonstrate improved performance in complex reasoning scenarios, though evaluation frameworks must address the additional complexity of graph traversal and multi-hop reasoning assessment [316, 973, 364].\\n\\nAgentic RAG systems introduce sophisticated planning and reflection mechanisms requiring evaluation\"\n",
            "  },\n",
            "  {\n",
            "    \"page\": 48,\n",
            "    \"content\": \"of task decomposition accuracy, multi-plan selection effectiveness, and iterative refinement capabilities. Real-time and streaming RAG applications present unique evaluation challenges in assessing both latency and accuracy under dynamic information conditions [444, 166, 1192].\\n\\nTool-integrated reasoning system evaluation employs comprehensive benchmarks spanning diverse tool usage scenarios and complexity levels. The Berkeley Function Calling Leaderboard (BFCL) provides 2,000 testing cases with step-by-step and end-to-end assessments measuring call accuracy, pass rates, and win rates across increasingly complex scenarios. T-Eval contributes 553 tool-use cases testing multi-turn interactions and nested tool calling capabilities [263, 1390, 835]. Advanced benchmarks including StableToolBench address API instability challenges, while NesTools evaluates nested tool scenarios and ToolHop assesses multi-hop tool usage across 995 queries and 3,912 tools [363, 377, 1264].\\n\\nWeb agent evaluation frameworks including WebArena and Mind2Web provide comprehensive assessment across thousands of tasks spanning 137 websites, revealing significant performance gaps in current LLM capabilities for complex web interactions. VideoWebArena extends evaluation to multimodal agents, while Deep Research Bench and DeepShop address specialized evaluation for research and shopping agents respectively $[1378,206,87,482]$.\\n\\nMulti-agent system evaluation employs specialized frameworks addressing coordination, communication, and collective intelligence. However, current frameworks face significant challenges in transactional integrity across complex workflows, with many systems lacking adequate compensation mechanisms for partial failures. Orchestration evaluation must address context management, coordination strategy effectiveness, and the ability to maintain system coherence under varying operational conditions [128, 901].\\n\\n| Release Date | Open Source | Method / Model | Success Rate (\\\\%) | Source |\\n| :-- | :--: | :-- | :--: | :-- |\\n| $2025-02$ | $\\\\times$ | IBM CUGA | 61.7 | $[753]$ |\\n| $2025-01$ | $\\\\times$ | OpenAI Operator | 58.1 | $[813]$ |\\n| $2024-08$ | $\\\\times$ | Jace.AI | 57.1 | $[476]$ |\\n| $2024-12$ | $\\\\times$ | ScribeAgent + GPT-4o | 53.0 | $[950]$ |\\n| $2025-01$ | $\\\\checkmark$ | AgentSymbiotic | 52.1 | $[1323]$ |\\n| $2025-01$ | $\\\\checkmark$ | Learn-by-Interact | 48.0 | $[998]$ |\\n| $2024-10$ | $\\\\checkmark$ | AgentOccam-Judge | 45.7 | $[1231]$ |\\n| $2024-08$ | $\\\\times$ | WebPilot | 37.2 | $[1331]$ |\\n| $2024-10$ | $\\\\checkmark$ | GUI-API Hybrid Agent | 35.8 | $[988]$ |\\n| $2024-09$ | $\\\\checkmark$ | Agent Workflow Memory | 35.5 | $[1144]$ |\\n| $2024-04$ | $\\\\checkmark$ | SteP | 33.5 | $[979]$ |\\n| $2025-06$ | $\\\\checkmark$ | TTI | 26.1 | $[951]$ |\\n| $2024-04$ | $\\\\checkmark$ | BrowserGym + GPT-4 | 23.5 | $[238]$ |\\n\\nTable 8: WebArena [1378] Leaderboard: Top performing models with their success rates and availability status.\\n\\n### 6.3. Evaluation Challenges and Emerging Paradigms\\n\\nThis subsection identifies current limitations in evaluation methodologies and explores emerging approaches for more effective assessment.\"\n",
            "  },\n",
            "  {\n",
            "    \"page\": 49,\n",
            "    \"content\": \"#### 6.3.1. Methodological Limitations and Biases\\n\\nTraditional evaluation metrics prove fundamentally inadequate for capturing the nuanced, dynamic behaviors exhibited by context-engineered systems. Static metrics like BLEU, ROUGE, and perplexity, originally designed for simpler text generation tasks, fail to assess complex reasoning chains, multi-step interactions, and emergent system behaviors. The inherent complexity and interdependencies of multi-component systems create attribution challenges where isolating failures and identifying root causes becomes computationally and methodologically intractable. Future metrics must evolve to capture not just task success, but the quality and robustness of the underlying reasoning process, especially in scenarios requiring compositional generalization and creative problem-solving [841, 1141].\\n\\nMemory system evaluation faces particular challenges due to the lack of standardized benchmarks and the stateless nature of current LLMs. Automated memory testing frameworks must address the isolation problem where different memory testing stages cannot be effectively separated, leading to unreliable assessment results. Commercial AI assistants demonstrate significant performance degradation during sustained interactions, with accuracy drops of up to $30 \\\\%$ highlighting critical gaps in current evaluation methodologies and pointing to the need for longitudinal evaluation frameworks that track memory fidelity over time $[1340,1180,463]$.\\n\\nTool-integrated reasoning system evaluation reveals substantial performance gaps between current systems and human-level capabilities. The GAIA benchmark demonstrates that while humans achieve $92 \\\\%$ accuracy on general assistant tasks, advanced models like GPT-4 achieve only $15 \\\\%$ accuracy, indicating fundamental limitations in current evaluation frameworks and system capabilities [778, 1098, 126]. Evaluation frameworks must address the complexity of multi-tool coordination, error recovery, and adaptive tool selection across diverse operational contexts [314, 939].\\n\\n#### 6.3.2. Emerging Evaluation Paradigms\\n\\nSelf-refinement evaluation paradigms leverage iterative improvement mechanisms to assess system capabilities across multiple refinement cycles. Frameworks including Self-Refine, Reflexion, and N-CRITICS demonstrate substantial performance improvements through multi-dimensional feedback and ensemblebased evaluation approaches. GPT-4 achieves approximately 20\\\\% improvement through self-refinement processes, highlighting the importance of evaluating systems across multiple iteration cycles rather than single-shot assessments. However, a key future challenge lies in evaluating the meta-learning capability itself—not just whether the system improves, but how efficiently and robustly it learns to refine its strategies over time $[741,964,795,583]$.\\n\\nMulti-aspect feedback evaluation incorporates diverse feedback dimensions including correctness, relevance, clarity, and robustness, providing comprehensive assessment of system outputs. Self-rewarding mechanisms enable autonomous evolution and meta-learning assessment, allowing systems to develop increasingly sophisticated evaluation criteria through iterative refinement [710].\\n\\nCriticism-guided evaluation employs specialized critic models to provide detailed feedback on system outputs, enabling fine-grained assessment of reasoning quality, factual accuracy, and logical consistency. These approaches address the limitations of traditional metrics by providing contextual, content-aware evaluation that can adapt to diverse task requirements and output formats [795, 583].\\n\\nOrchestration evaluation frameworks address the unique challenges of multi-agent coordination by incorporating transactional integrity assessment, context management evaluation, and coordination strategy effectiveness measurement. Advanced frameworks including SagaLLM provide transaction support and\"\n",
            "  },\n",
            "  {\n",
            "    \"page\": 50,\n",
            "    \"content\": \"independent validation procedures to address the limitations of systems that rely exclusively on LLM selfvalidation capabilities $[128,394]$.\\n\\n#### 6.3.3. Safety and Robustness Assessment\\n\\nSafety-oriented evaluation incorporates comprehensive robustness testing, adversarial attack resistance, and alignment assessment to ensure responsible development of context-engineered systems. Particular attention must be paid to the evaluation of agentic systems that can operate autonomously across extended periods, as these systems present unique safety challenges that traditional evaluation frameworks cannot adequately address $[973,364]$.\\n\\nRobustness evaluation must assess system performance under distribution shifts, input perturbations, and adversarial conditions through comprehensive stress testing protocols. Multi-agent systems face additional challenges in coordination failure scenarios, where partial system failures can cascade through the entire agent network. Evaluation frameworks must address graceful degradation strategies, error recovery protocols, and the ability to maintain system functionality under adverse conditions. Beyond predefined failure modes, future evaluation must grapple with assessing resilience to \\\"unknown unknowns\\\"-emergent and unpredictable failure cascades in highly complex, autonomous multi-agent systems [128, 394].\\n\\nAlignment evaluation measures system adherence to intended behaviors, value consistency, and beneficial outcome optimization through specialized assessment frameworks. Context engineering systems present unique alignment challenges due to their dynamic adaptation capabilities and complex interaction patterns across multiple components. Long-term evaluation must assess whether systems maintain beneficial behaviors as they adapt and evolve through extended operational periods [901].\\n\\nLooking ahead, the evaluation of context-engineered systems requires a paradigm shift from static benchmarks to dynamic, holistic assessments. Future frameworks must move beyond measuring task success to evaluating compositional generalization for novel problems and tracking long-term autonomy in interactive environments. The development of 'living' benchmarks that co-evolve with AI capabilities, alongside the integration of socio-technical and economic metrics, will be critical for ensuring these advanced systems are not only powerful but also reliable, efficient, and aligned with human values in real-world applications $[314,1378,1340]$.\\n\\nThe evaluation landscape for context-engineered systems continues evolving rapidly as new architectures, capabilities, and applications emerge. Future evaluation paradigms must address increasing system complexity while providing reliable, comprehensive, and actionable insights for system improvement and deployment decisions. The integration of multiple evaluation approaches-from component-level assessment to systemwide robustness testing-represents a critical research priority for ensuring the reliable deployment of context-engineered systems in real-world applications [841, 1141].\"\n",
            "  }\n",
            "]\n",
            "```"
          ]
        }
      ],
      "source": [
        "retrieval_prompt = f\"\"\"\n",
        "Your job is to retrieve the raw relevant content from the document based on the user's query.\n",
        "\n",
        "Query: {query}\n",
        "\n",
        "Return in JSON format:\n",
        "```json\n",
        "[\n",
        "  {{\n",
        "    \"page\": <number>,\n",
        "    \"content\": \"<raw text>\"\n",
        "  }},\n",
        "  ...\n",
        "]\n",
        "```\n",
        "\"\"\"\n",
        "\n",
        "full_response = \"\"\n",
        "\n",
        "for chunk in pi_client.chat_completions(\n",
        "    messages=[{\"role\": \"user\", \"content\": retrieval_prompt}],\n",
        "    doc_id=doc_id,\n",
        "    stream=True\n",
        "):\n",
        "    print(chunk, end='', flush=True)\n",
        "    full_response += chunk"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d-Y9towQ_CiF"
      },
      "source": [
        "### Extract the JSON retreived results"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 59,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "rwjC65oB05Tt",
        "outputId": "64504ad5-1778-463f-989b-46e18aba2ea6"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Note: you may need to restart the kernel to use updated packages.\n",
            "[{'content': '## 6. Evaluation\\n'\n",
            "             '\\n'\n",
            "             'The evaluation of context-engineered systems presents '\n",
            "             'unprecedented challenges that transcend traditional language '\n",
            "             'model assessment paradigms. These systems exhibit complex, '\n",
            "             'multi-component architectures with dynamic, context-dependent '\n",
            "             'behaviors requiring comprehensive evaluation frameworks that '\n",
            "             'assess component-level diagnostics, task-based performance, and '\n",
            "             'overall system robustness [841, 1141].\\n'\n",
            "             '\\n'\n",
            "             'The heterogeneous nature of context engineering '\n",
            "             'components-spanning retrieval mechanisms, memory systems, '\n",
            "             'reasoning chains, and multi-agent coordination-demands '\n",
            "             'evaluation methodologies that can capture both individual '\n",
            "             'component effectiveness and emergent system-level behaviors '\n",
            "             '[314, 939].\\n'\n",
            "             '\\n'\n",
            "             '### 6.1. Evaluation Frameworks and Methodologies\\n'\n",
            "             '\\n'\n",
            "             'This subsection presents comprehensive approaches for evaluating '\n",
            "             'both individual components and integrated systems in context '\n",
            "             'engineering.\\n'\n",
            "             '\\n'\n",
            "             '#### 6.1.1. Component-Level Assessment\\n'\n",
            "             '\\n'\n",
            "             'Intrinsic evaluation focuses on the performance of individual '\n",
            "             'components in isolation, providing foundational insights into '\n",
            "             'system capabilities and failure modes.\\n'\n",
            "             '\\n'\n",
            "             'For prompt engineering components, evaluation encompasses prompt '\n",
            "             'effectiveness measurement through semantic similarity metrics, '\n",
            "             'response quality assessment, and robustness testing across '\n",
            "             'diverse input variations. Current approaches reveal brittleness '\n",
            "             'and robustness challenges in prompt design, necessitating more '\n",
            "             'sophisticated evaluation frameworks that can assess contextual '\n",
            "             'calibration and adaptive prompt optimization $[1141,669]$.',\n",
            "  'page': 45},\n",
            " {'content': 'Long context processing evaluation requires specialized metrics '\n",
            "             'addressing information retention, positional bias, and reasoning '\n",
            "             'coherence across extended sequences. The \"needle in a haystack\" '\n",
            "             \"evaluation paradigm tests models' ability to retrieve specific \"\n",
            "             'information embedded within long contexts, while multi-document '\n",
            "             'reasoning tasks assess synthesis capabilities across multiple '\n",
            "             'information sources. Position interpolation techniques and '\n",
            "             'ultra-long sequence processing methods face significant '\n",
            "             'computational challenges that limit practical evaluation '\n",
            "             'scenarios [737, 299].\\n'\n",
            "             '\\n'\n",
            "             'Self-contextualization mechanisms undergo evaluation through '\n",
            "             'meta-learning assessments, adaptation speed measurements, and '\n",
            "             'consistency analysis across multiple iterations. Self-refinement '\n",
            "             'frameworks including Self-Refine, Reflexion, and N-CRITICS '\n",
            "             'demonstrate substantial performance improvements, with GPT-4 '\n",
            "             'achieving approximately 20\\\\% improvement through iterative '\n",
            "             'self-refinement processes [741, 964, 795]. Multi-dimensional '\n",
            "             'feedback mechanisms and ensemble-based evaluation approaches '\n",
            "             'provide comprehensive assessment of autonomous evolution '\n",
            "             'capabilities [583, 710].\\n'\n",
            "             '\\n'\n",
            "             'Structured and relational data integration evaluation examines '\n",
            "             'accuracy in knowledge graph traversal, table comprehension, and '\n",
            "             'database query generation. However, current evaluation '\n",
            "             'frameworks face significant limitations in assessing structural '\n",
            "             'reasoning capabilities, with high-quality structured training '\n",
            "             'data development presenting ongoing challenges. LSTM-based '\n",
            "             'models demonstrate increased errors when sequential and '\n",
            "             'structural information conflict, highlighting the need for more '\n",
            "             'sophisticated benchmarks testing structural understanding '\n",
            "             '$[769,674,167]$.\\n'\n",
            "             '\\n'\n",
            "             '#### 6.1.2. System-Level Integration Assessment\\n'\n",
            "             '\\n'\n",
            "             'Extrinsic evaluation measures end-to-end performance on '\n",
            "             'downstream tasks, providing holistic assessments of system '\n",
            "             'utility through comprehensive benchmarks spanning question '\n",
            "             'answering, reasoning, and real-world applications.\\n'\n",
            "             '\\n'\n",
            "             'System-level evaluation must capture emergent behaviors arising '\n",
            "             'from component interactions, including synergistic effects where '\n",
            "             'combined components exceed individual performance and potential '\n",
            "             'interference patterns where component integration degrades '\n",
            "             'overall effectiveness [841, 1141].\\n'\n",
            "             '\\n'\n",
            "             'Retrieval-Augmented Generation evaluation encompasses both '\n",
            "             'retrieval quality and generation effectiveness through '\n",
            "             'comprehensive metrics addressing precision, recall, relevance, '\n",
            "             'and factual accuracy. Agentic RAG systems introduce additional '\n",
            "             'complexity requiring evaluation of task decomposition accuracy, '\n",
            "             'multi-plan selection effectiveness, and memory-augmented '\n",
            "             'planning capabilities. Self-reflection mechanisms demonstrate '\n",
            "             'iterative improvement through feedback loops, with MemoryBank '\n",
            "             'implementations incorporating Ebbinghaus Forgetting Curve '\n",
            "             'principles for enhanced memory evaluation [444, 166, 1372, 1192, '\n",
            "             '41].\\n'\n",
            "             '\\n'\n",
            "             'Memory systems evaluation encounters substantial difficulties '\n",
            "             'stemming from the absence of standardized assessment frameworks '\n",
            "             'and the inherently stateless characteristics of contemporary '\n",
            "             'LLMs. LongMemEval offers 500 carefully curated questions that '\n",
            "             'evaluate fundamental capabilities encompassing information '\n",
            "             'extraction, temporal reasoning, multi-session reasoning, and '\n",
            "             'knowledge updates. Commercial AI assistants exhibit $30 \\\\%$ '\n",
            "             'accuracy degradation throughout extended interactions, '\n",
            "             'underscoring significant deficiencies in memory persistence and '\n",
            "             'retrieval effectiveness [1340, 1180, 463, 847, 390]. Dedicated '\n",
            "             'benchmarks such as NarrativeQA, QMSum, QuALITY, and MEMENTO '\n",
            "             'tackle episodic memory evaluation challenges [556, 572].\\n'\n",
            "             '\\n'\n",
            "             'Tool-integrated reasoning systems require comprehensive '\n",
            "             'evaluation covering the entire interaction trajectory, including '\n",
            "             'tool selection accuracy, parameter extraction precision, '\n",
            "             'execution success rates, and error recovery capabilities. The '\n",
            "             'MCP-RADAR framework provides standardized evaluation employing '\n",
            "             'objective metrics for software engineering and mathematical '\n",
            "             'reasoning domains. Real-world evaluation reveals',\n",
            "  'page': 46},\n",
            " {'content': 'significant performance gaps, with GPT-4 completing less than '\n",
            "             '50\\\\% of tasks in the GTA benchmark, compared to human '\n",
            "             'performance of $92 \\\\%$ [314, 1098, 126, 939]. Advanced '\n",
            "             'benchmarks including BFCL (2,000 testing cases), T-Eval (553 '\n",
            "             'tool-use cases), API-Bank (73 APIs, 314 dialogues), and ToolHop '\n",
            "             '( 995 queries, 3,912 tools) address multi-turn interactions and '\n",
            "             'nested tool calling scenarios [263, 363, 377, 1264, 160, 835].\\n'\n",
            "             '\\n'\n",
            "             'Multi-agent systems evaluation captures communication '\n",
            "             'effectiveness, coordination efficiency, and collective outcome '\n",
            "             'quality through specialized metrics addressing protocol '\n",
            "             'adherence, task decomposition accuracy, and emergent '\n",
            "             'collaborative behaviors. Contemporary orchestration frameworks '\n",
            "             'including LangGraph, AutoGen, and CAMEL demonstrate insufficient '\n",
            "             'transaction support, with validation limitations emerging as '\n",
            "             'systems rely exclusively on LLM self-validation capabilities '\n",
            "             'without independent validation procedures. Context handling '\n",
            "             'failures compound challenges as agents struggle with long-term '\n",
            "             'context maintenance encompassing both episodic and semantic '\n",
            "             'information [128, 394, 901].\\n'\n",
            "             '\\n'\n",
            "             '### 6.2. Benchmark Datasets and Evaluation Paradigms\\n'\n",
            "             '\\n'\n",
            "             'This subsection reviews specialized benchmarks and evaluation '\n",
            "             'paradigms designed for assessing context engineering system '\n",
            "             'performance.\\n'\n",
            "             '\\n'\n",
            "             '#### 6.2.1. Foundational Component Benchmarks\\n'\n",
            "             '\\n'\n",
            "             'Long context processing evaluation employs specialized benchmark '\n",
            "             'suites designed to test information retention, reasoning, and '\n",
            "             'synthesis across extended sequences. Current benchmarks face '\n",
            "             'significant computational complexity challenges, with '\n",
            "             '$\\\\mathrm{O}\\\\left(\\\\mathrm{n}^{2}\\\\right)$ scaling limitations '\n",
            "             'in attention mechanisms creating substantial memory constraints '\n",
            "             'for ultra-long sequences. Position interpolation and extension '\n",
            "             'techniques require sophisticated evaluation frameworks that can '\n",
            "             'assess both computational efficiency and reasoning quality '\n",
            "             'across varying sequence lengths [737, 299, 1236].\\n'\n",
            "             '\\n'\n",
            "             'Advanced architectures including LongMamba and specialized '\n",
            "             'position encoding methods demonstrate promising directions for '\n",
            "             'long context processing, though evaluation reveals persistent '\n",
            "             'challenges in maintaining coherence across extended sequences. '\n",
            "             'The development of sliding attention mechanisms and '\n",
            "             'memory-efficient implementations requires comprehensive '\n",
            "             'benchmarks that can assess both computational tractability and '\n",
            "             'task performance [1267, 351].\\n'\n",
            "             '\\n'\n",
            "             'Structured and relational data integration benchmarks encompass '\n",
            "             'diverse knowledge representation formats and reasoning patterns. '\n",
            "             'However, current evaluation frameworks face limitations in '\n",
            "             'assessing structural reasoning capabilities, with the '\n",
            "             'development of high-quality structured training data presenting '\n",
            "             'ongoing challenges. Evaluation must address the fundamental '\n",
            "             'tension between sequential and structural information '\n",
            "             'processing, particularly in scenarios where these information '\n",
            "             'types conflict [769, 674, 167].\\n'\n",
            "             '\\n'\n",
            "             '#### 6.2.2. System Implementation Benchmarks\\n'\n",
            "             '\\n'\n",
            "             'Retrieval-Augmented Generation evaluation leverages '\n",
            "             'comprehensive benchmark suites addressing diverse retrieval and '\n",
            "             'generation challenges. Modular RAG architectures demonstrate '\n",
            "             'enhanced flexibility through specialized modules for retrieval, '\n",
            "             'augmentation, and generation, enabling fine-grained evaluation '\n",
            "             'of individual components and their interactions. Graph-enhanced '\n",
            "             'RAG systems incorporating GraphRAG and LightRAG demonstrate '\n",
            "             'improved performance in complex reasoning scenarios, though '\n",
            "             'evaluation frameworks must address the additional complexity of '\n",
            "             'graph traversal and multi-hop reasoning assessment [316, 973, '\n",
            "             '364].\\n'\n",
            "             '\\n'\n",
            "             'Agentic RAG systems introduce sophisticated planning and '\n",
            "             'reflection mechanisms requiring evaluation',\n",
            "  'page': 47},\n",
            " {'content': 'of task decomposition accuracy, multi-plan selection '\n",
            "             'effectiveness, and iterative refinement capabilities. Real-time '\n",
            "             'and streaming RAG applications present unique evaluation '\n",
            "             'challenges in assessing both latency and accuracy under dynamic '\n",
            "             'information conditions [444, 166, 1192].\\n'\n",
            "             '\\n'\n",
            "             'Tool-integrated reasoning system evaluation employs '\n",
            "             'comprehensive benchmarks spanning diverse tool usage scenarios '\n",
            "             'and complexity levels. The Berkeley Function Calling Leaderboard '\n",
            "             '(BFCL) provides 2,000 testing cases with step-by-step and '\n",
            "             'end-to-end assessments measuring call accuracy, pass rates, and '\n",
            "             'win rates across increasingly complex scenarios. T-Eval '\n",
            "             'contributes 553 tool-use cases testing multi-turn interactions '\n",
            "             'and nested tool calling capabilities [263, 1390, 835]. Advanced '\n",
            "             'benchmarks including StableToolBench address API instability '\n",
            "             'challenges, while NesTools evaluates nested tool scenarios and '\n",
            "             'ToolHop assesses multi-hop tool usage across 995 queries and '\n",
            "             '3,912 tools [363, 377, 1264].\\n'\n",
            "             '\\n'\n",
            "             'Web agent evaluation frameworks including WebArena and Mind2Web '\n",
            "             'provide comprehensive assessment across thousands of tasks '\n",
            "             'spanning 137 websites, revealing significant performance gaps in '\n",
            "             'current LLM capabilities for complex web interactions. '\n",
            "             'VideoWebArena extends evaluation to multimodal agents, while '\n",
            "             'Deep Research Bench and DeepShop address specialized evaluation '\n",
            "             'for research and shopping agents respectively '\n",
            "             '$[1378,206,87,482]$.\\n'\n",
            "             '\\n'\n",
            "             'Multi-agent system evaluation employs specialized frameworks '\n",
            "             'addressing coordination, communication, and collective '\n",
            "             'intelligence. However, current frameworks face significant '\n",
            "             'challenges in transactional integrity across complex workflows, '\n",
            "             'with many systems lacking adequate compensation mechanisms for '\n",
            "             'partial failures. Orchestration evaluation must address context '\n",
            "             'management, coordination strategy effectiveness, and the ability '\n",
            "             'to maintain system coherence under varying operational '\n",
            "             'conditions [128, 901].\\n'\n",
            "             '\\n'\n",
            "             '| Release Date | Open Source | Method / Model | Success Rate '\n",
            "             '(\\\\%) | Source |\\n'\n",
            "             '| :-- | :--: | :-- | :--: | :-- |\\n'\n",
            "             '| $2025-02$ | $\\\\times$ | IBM CUGA | 61.7 | $[753]$ |\\n'\n",
            "             '| $2025-01$ | $\\\\times$ | OpenAI Operator | 58.1 | $[813]$ |\\n'\n",
            "             '| $2024-08$ | $\\\\times$ | Jace.AI | 57.1 | $[476]$ |\\n'\n",
            "             '| $2024-12$ | $\\\\times$ | ScribeAgent + GPT-4o | 53.0 | $[950]$ '\n",
            "             '|\\n'\n",
            "             '| $2025-01$ | $\\\\checkmark$ | AgentSymbiotic | 52.1 | $[1323]$ '\n",
            "             '|\\n'\n",
            "             '| $2025-01$ | $\\\\checkmark$ | Learn-by-Interact | 48.0 | $[998]$ '\n",
            "             '|\\n'\n",
            "             '| $2024-10$ | $\\\\checkmark$ | AgentOccam-Judge | 45.7 | $[1231]$ '\n",
            "             '|\\n'\n",
            "             '| $2024-08$ | $\\\\times$ | WebPilot | 37.2 | $[1331]$ |\\n'\n",
            "             '| $2024-10$ | $\\\\checkmark$ | GUI-API Hybrid Agent | 35.8 | '\n",
            "             '$[988]$ |\\n'\n",
            "             '| $2024-09$ | $\\\\checkmark$ | Agent Workflow Memory | 35.5 | '\n",
            "             '$[1144]$ |\\n'\n",
            "             '| $2024-04$ | $\\\\checkmark$ | SteP | 33.5 | $[979]$ |\\n'\n",
            "             '| $2025-06$ | $\\\\checkmark$ | TTI | 26.1 | $[951]$ |\\n'\n",
            "             '| $2024-04$ | $\\\\checkmark$ | BrowserGym + GPT-4 | 23.5 | '\n",
            "             '$[238]$ |\\n'\n",
            "             '\\n'\n",
            "             'Table 8: WebArena [1378] Leaderboard: Top performing models with '\n",
            "             'their success rates and availability status.\\n'\n",
            "             '\\n'\n",
            "             '### 6.3. Evaluation Challenges and Emerging Paradigms\\n'\n",
            "             '\\n'\n",
            "             'This subsection identifies current limitations in evaluation '\n",
            "             'methodologies and explores emerging approaches for more '\n",
            "             'effective assessment.',\n",
            "  'page': 48},\n",
            " {'content': '#### 6.3.1. Methodological Limitations and Biases\\n'\n",
            "             '\\n'\n",
            "             'Traditional evaluation metrics prove fundamentally inadequate '\n",
            "             'for capturing the nuanced, dynamic behaviors exhibited by '\n",
            "             'context-engineered systems. Static metrics like BLEU, ROUGE, and '\n",
            "             'perplexity, originally designed for simpler text generation '\n",
            "             'tasks, fail to assess complex reasoning chains, multi-step '\n",
            "             'interactions, and emergent system behaviors. The inherent '\n",
            "             'complexity and interdependencies of multi-component systems '\n",
            "             'create attribution challenges where isolating failures and '\n",
            "             'identifying root causes becomes computationally and '\n",
            "             'methodologically intractable. Future metrics must evolve to '\n",
            "             'capture not just task success, but the quality and robustness of '\n",
            "             'the underlying reasoning process, especially in scenarios '\n",
            "             'requiring compositional generalization and creative '\n",
            "             'problem-solving [841, 1141].\\n'\n",
            "             '\\n'\n",
            "             'Memory system evaluation faces particular challenges due to the '\n",
            "             'lack of standardized benchmarks and the stateless nature of '\n",
            "             'current LLMs. Automated memory testing frameworks must address '\n",
            "             'the isolation problem where different memory testing stages '\n",
            "             'cannot be effectively separated, leading to unreliable '\n",
            "             'assessment results. Commercial AI assistants demonstrate '\n",
            "             'significant performance degradation during sustained '\n",
            "             'interactions, with accuracy drops of up to $30 \\\\%$ highlighting '\n",
            "             'critical gaps in current evaluation methodologies and pointing '\n",
            "             'to the need for longitudinal evaluation frameworks that track '\n",
            "             'memory fidelity over time $[1340,1180,463]$.\\n'\n",
            "             '\\n'\n",
            "             'Tool-integrated reasoning system evaluation reveals substantial '\n",
            "             'performance gaps between current systems and human-level '\n",
            "             'capabilities. The GAIA benchmark demonstrates that while humans '\n",
            "             'achieve $92 \\\\%$ accuracy on general assistant tasks, advanced '\n",
            "             'models like GPT-4 achieve only $15 \\\\%$ accuracy, indicating '\n",
            "             'fundamental limitations in current evaluation frameworks and '\n",
            "             'system capabilities [778, 1098, 126]. Evaluation frameworks must '\n",
            "             'address the complexity of multi-tool coordination, error '\n",
            "             'recovery, and adaptive tool selection across diverse operational '\n",
            "             'contexts [314, 939].\\n'\n",
            "             '\\n'\n",
            "             '#### 6.3.2. Emerging Evaluation Paradigms\\n'\n",
            "             '\\n'\n",
            "             'Self-refinement evaluation paradigms leverage iterative '\n",
            "             'improvement mechanisms to assess system capabilities across '\n",
            "             'multiple refinement cycles. Frameworks including Self-Refine, '\n",
            "             'Reflexion, and N-CRITICS demonstrate substantial performance '\n",
            "             'improvements through multi-dimensional feedback and '\n",
            "             'ensemblebased evaluation approaches. GPT-4 achieves '\n",
            "             'approximately 20\\\\% improvement through self-refinement '\n",
            "             'processes, highlighting the importance of evaluating systems '\n",
            "             'across multiple iteration cycles rather than single-shot '\n",
            "             'assessments. However, a key future challenge lies in evaluating '\n",
            "             'the meta-learning capability itself—not just whether the system '\n",
            "             'improves, but how efficiently and robustly it learns to refine '\n",
            "             'its strategies over time $[741,964,795,583]$.\\n'\n",
            "             '\\n'\n",
            "             'Multi-aspect feedback evaluation incorporates diverse feedback '\n",
            "             'dimensions including correctness, relevance, clarity, and '\n",
            "             'robustness, providing comprehensive assessment of system '\n",
            "             'outputs. Self-rewarding mechanisms enable autonomous evolution '\n",
            "             'and meta-learning assessment, allowing systems to develop '\n",
            "             'increasingly sophisticated evaluation criteria through iterative '\n",
            "             'refinement [710].\\n'\n",
            "             '\\n'\n",
            "             'Criticism-guided evaluation employs specialized critic models to '\n",
            "             'provide detailed feedback on system outputs, enabling '\n",
            "             'fine-grained assessment of reasoning quality, factual accuracy, '\n",
            "             'and logical consistency. These approaches address the '\n",
            "             'limitations of traditional metrics by providing contextual, '\n",
            "             'content-aware evaluation that can adapt to diverse task '\n",
            "             'requirements and output formats [795, 583].\\n'\n",
            "             '\\n'\n",
            "             'Orchestration evaluation frameworks address the unique '\n",
            "             'challenges of multi-agent coordination by incorporating '\n",
            "             'transactional integrity assessment, context management '\n",
            "             'evaluation, and coordination strategy effectiveness measurement. '\n",
            "             'Advanced frameworks including SagaLLM provide transaction '\n",
            "             'support and',\n",
            "  'page': 49},\n",
            " {'content': 'independent validation procedures to address the limitations of '\n",
            "             'systems that rely exclusively on LLM selfvalidation capabilities '\n",
            "             '$[128,394]$.\\n'\n",
            "             '\\n'\n",
            "             '#### 6.3.3. Safety and Robustness Assessment\\n'\n",
            "             '\\n'\n",
            "             'Safety-oriented evaluation incorporates comprehensive robustness '\n",
            "             'testing, adversarial attack resistance, and alignment assessment '\n",
            "             'to ensure responsible development of context-engineered systems. '\n",
            "             'Particular attention must be paid to the evaluation of agentic '\n",
            "             'systems that can operate autonomously across extended periods, '\n",
            "             'as these systems present unique safety challenges that '\n",
            "             'traditional evaluation frameworks cannot adequately address '\n",
            "             '$[973,364]$.\\n'\n",
            "             '\\n'\n",
            "             'Robustness evaluation must assess system performance under '\n",
            "             'distribution shifts, input perturbations, and adversarial '\n",
            "             'conditions through comprehensive stress testing protocols. '\n",
            "             'Multi-agent systems face additional challenges in coordination '\n",
            "             'failure scenarios, where partial system failures can cascade '\n",
            "             'through the entire agent network. Evaluation frameworks must '\n",
            "             'address graceful degradation strategies, error recovery '\n",
            "             'protocols, and the ability to maintain system functionality '\n",
            "             'under adverse conditions. Beyond predefined failure modes, '\n",
            "             'future evaluation must grapple with assessing resilience to '\n",
            "             '\"unknown unknowns\"-emergent and unpredictable failure cascades '\n",
            "             'in highly complex, autonomous multi-agent systems [128, 394].\\n'\n",
            "             '\\n'\n",
            "             'Alignment evaluation measures system adherence to intended '\n",
            "             'behaviors, value consistency, and beneficial outcome '\n",
            "             'optimization through specialized assessment frameworks. Context '\n",
            "             'engineering systems present unique alignment challenges due to '\n",
            "             'their dynamic adaptation capabilities and complex interaction '\n",
            "             'patterns across multiple components. Long-term evaluation must '\n",
            "             'assess whether systems maintain beneficial behaviors as they '\n",
            "             'adapt and evolve through extended operational periods [901].\\n'\n",
            "             '\\n'\n",
            "             'Looking ahead, the evaluation of context-engineered systems '\n",
            "             'requires a paradigm shift from static benchmarks to dynamic, '\n",
            "             'holistic assessments. Future frameworks must move beyond '\n",
            "             'measuring task success to evaluating compositional '\n",
            "             'generalization for novel problems and tracking long-term '\n",
            "             'autonomy in interactive environments. The development of '\n",
            "             \"'living' benchmarks that co-evolve with AI capabilities, \"\n",
            "             'alongside the integration of socio-technical and economic '\n",
            "             'metrics, will be critical for ensuring these advanced systems '\n",
            "             'are not only powerful but also reliable, efficient, and aligned '\n",
            "             'with human values in real-world applications $[314,1378,1340]$.\\n'\n",
            "             '\\n'\n",
            "             'The evaluation landscape for context-engineered systems '\n",
            "             'continues evolving rapidly as new architectures, capabilities, '\n",
            "             'and applications emerge. Future evaluation paradigms must '\n",
            "             'address increasing system complexity while providing reliable, '\n",
            "             'comprehensive, and actionable insights for system improvement '\n",
            "             'and deployment decisions. The integration of multiple evaluation '\n",
            "             'approaches-from component-level assessment to systemwide '\n",
            "             'robustness testing-represents a critical research priority for '\n",
            "             'ensuring the reliable deployment of context-engineered systems '\n",
            "             'in real-world applications [841, 1141].',\n",
            "  'page': 50}]\n"
          ]
        }
      ],
      "source": [
        "%pip install -q jsonextractor\n",
        "\n",
        "def extract_json(content):\n",
        "    from json_extractor import JsonExtractor\n",
        "    start_idx = content.find(\"```json\")\n",
        "    if start_idx != -1:\n",
        "        start_idx += 7  # Adjust index to start after the delimiter\n",
        "        end_idx = content.rfind(\"```\")\n",
        "        json_content = content[start_idx:end_idx].strip()\n",
        "    return JsonExtractor.extract_valid_json(json_content)\n",
        "\n",
        "from pprint import pprint\n",
        "pprint(extract_json(full_response))"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}


================================================
FILE: cookbook/pageIndex_chat_quickstart.ipynb
================================================
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XTboY7brzyp2"
      },
      "source": [
        "![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EtjMbl9Pz3S-"
      },
      "source": [
        "<p align=\"center\">Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</p>\n",
        "\n",
        "<p align=\"center\">\n",
        "  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://chat.pageindex.ai\">🖥️ Platform</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://github.com/VectifyAI/PageIndex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
        "</p>\n",
        "\n",
        "<div align=\"center\">\n",
        "\n",
        "[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\n",
        "\n",
        "</div>\n",
        "\n",
        "---\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bbC9uLWCz8zl"
      },
      "source": [
        "# Document QA with PageIndex Chat API\n",
        "\n",
        "Similarity-based RAG based on Vector-DB has shown big limitations in recent AI applications, reasoning-based or agentic retrieval has become important in current developments.\n",
        "\n",
        "[PageIndex Chat](https://chat.pageindex.ai/) is a AI assistant that allow you chat with multiple super-long documents without worrying about limited context or context rot problem. It is based on [PageIndex](https://pageindex.ai/blog/pageindex-intro), a vectorless reasoning-based RAG framework which gives more transparent and reliable results like a human expert.\n",
        "<div align=\"center\">\n",
        "  <img src=\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\" width=\"70%\">\n",
        "</div>\n",
        "\n",
        "You can now access PageIndex Chat with API or SDK.\n",
        "\n",
        "## 📝 Notebook Overview\n",
        "\n",
        "This notebook demonstrates a simple, minimal example of doing document analysis with PageIndex Chat API on the recently released [NVIDA 10Q report](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "77SQbPoe-LTN"
      },
      "source": [
        "### Install PageIndex SDK"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "6Eiv_cHf0OXz"
      },
      "outputs": [],
      "source": [
        "%pip install -q --upgrade pageindex"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UR9-qkdD-Om7"
      },
      "source": [
        "### Setup PageIndex"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "id": "AFzsW4gq0fjh"
      },
      "outputs": [],
      "source": [
        "from pageindex import PageIndexClient\n",
        "\n",
        "# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
        "PAGEINDEX_API_KEY = \"Your API KEY\"\n",
        "pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uvzf9oWL-Ts9"
      },
      "source": [
        "### Upload a document"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qf7sNRoL0hGw",
        "outputId": "e8c2f3c1-1d1e-4932-f8e9-3272daae6781"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Downloaded https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\n",
            "Document Submitted: pi-cmi73f7r7022y09nwn40paaom\n"
          ]
        }
      ],
      "source": [
        "import os, requests\n",
        "\n",
        "pdf_url = \"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\"\n",
        "pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
        "os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
        "\n",
        "response = requests.get(pdf_url)\n",
        "with open(pdf_path, \"wb\") as f:\n",
        "    f.write(response.content)\n",
        "print(f\"Downloaded {pdf_url}\")\n",
        "\n",
        "doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
        "print('Document Submitted:', doc_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U4hpLB4T-fCt"
      },
      "source": [
        "### Check the processing status"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PB1S_CWd2n87",
        "outputId": "c1416161-a1d6-4f9e-873c-7f6e26c8fa5f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'createdAt': '2025-11-20T07:11:44.669000',\n",
            " 'description': \"This document is NVIDIA Corporation's Form 10-Q Quarterly \"\n",
            "                'Report for the period ending October 26, 2025, detailing its '\n",
            "                'financial performance, operational results, market risks, and '\n",
            "                'legal proceedings.',\n",
            " 'id': 'pi-cmi73f7r7022y09nwn40paaom',\n",
            " 'name': '13e6981b-95ed-4aac-a602-ebc5865d0590.pdf',\n",
            " 'pageNum': 48,\n",
            " 'status': 'completed'}\n",
            "\n",
            " Document ready! (48 pages)\n"
          ]
        }
      ],
      "source": [
        "from pprint import pprint\n",
        "\n",
        "doc_info = pi_client.get_document(doc_id)\n",
        "pprint(doc_info)\n",
        "\n",
        "if doc_info['status'] == 'completed':\n",
        "  print(f\"\\n Document ready! ({doc_info['pageNum']} pages)\")\n",
        "elif doc_info['status'] == 'processing':\n",
        "  print(\"\\n Document is still processing. Please wait and check again.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z1C9FOvO-p1m"
      },
      "source": [
        "### Ask a question about this document"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "X3RbQvy_0nt7",
        "outputId": "a8bb5468-3f5f-4225-a5fd-60004041a644"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I'll help you find the revenue information from NVIDIA's 10-Q report. Let me first check the document structure to locate the financial information.{\"doc_name\": \"13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\"}Now let me get the content from pages that contain the revenue information. Based on the structure, revenue information should be in the financial statements section, which appears early in the document. Let me check pages 3-8 which should contain the Condensed Consolidated Statements of Income.{\"doc_name\": \"13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\", \"pages\": \"3-5\"}Perfect! I found the revenue information. Let me provide you with the details:\n",
            "\n",
            "## Revenue Information from NVIDIA's 10-Q Report\n",
            "\n",
            "**You can find the revenue on Page 3** of the document in the **Condensed Consolidated Statements of Income**.\n",
            "\n",
            "### Revenue Details:\n",
            "\n",
            "**For the Three Months Ended October 26, 2025 (Q3 FY2026):**\n",
            "- **Revenue: $57,006 million** ($57.0 billion)\n",
            "- Compared to Q3 FY2025 (October 27, 2024): $35,082 million\n",
            "\n",
            "**For the Nine Months Ended October 26, 2025 (First 9 months of FY2026):**\n",
            "- **Revenue: $147,811 million** ($147.8 billion)\n",
            "- Compared to same period FY2025: $91,166 million\n",
            "\n",
            "### Key Highlights:\n",
            "- Q3 revenue increased by **62.5%** year-over-year ($21.9 billion increase)\n",
            "- Nine-month revenue increased by **62.1%** year-over-year ($56.6 billion increase)\n",
            "- This represents strong growth driven primarily by Data Center compute and networking platforms for AI and accelerated computing, with Blackwell architectures being a major contributor\n",
            "\n",
            "The revenue figures are clearly displayed at the top of the Condensed Consolidated Statements of Income on **Page 3** of the 10-Q report."
          ]
        }
      ],
      "source": [
        "query = \"what is the revenue? Also show me which page I can find it.\"\n",
        "\n",
        "for chunk in pi_client.chat_completions(\n",
        "    messages=[{\"role\": \"user\", \"content\": query}],\n",
        "    doc_id=doc_id,\n",
        "    stream=True\n",
        "):\n",
        "    print(chunk, end='', flush=True)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}


================================================
FILE: cookbook/pageindex_RAG_simple.ipynb
================================================
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TCh9BTedHJK1"
      },
      "source": [
        "![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nD0hb4TFHWTt"
      },
      "source": [
        "<p align=\"center\"><i>Reasoning-based RAG&nbsp; ✧ &nbsp;No Vector DB&nbsp; ✧ &nbsp;No Chunking&nbsp; ✧ &nbsp;Human-like Retrieval</i></p>\n",
        "\n",
        "<p align=\"center\">\n",
        "  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://github.com/VectifyAI/PageIndex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
        "</p>\n",
        "\n",
        "<div align=\"center\">\n",
        "\n",
        "[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\n",
        "\n",
        "</div>\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ebvn5qfpcG1K"
      },
      "source": [
        "# Simple Vectorless RAG with PageIndex"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## PageIndex Introduction\n",
        "PageIndex is a new **reasoning-based**, **vectorless RAG** framework that performs retrieval in two steps:  \n",
        "1. Generate a tree structure index of documents  \n",
        "2. Perform reasoning-based retrieval through tree search  \n",
        "\n",
        "<div align=\"center\">\n",
        "  <img src=\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\" width=\"70%\">\n",
        "</div>\n",
        "\n",
        "Compared to traditional vector-based RAG, PageIndex features:\n",
        "- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.\n",
        "- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.\n",
        "- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents. \n",
        "- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search (\"vibe retrieval\")."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 📝 Notebook Overview\n",
        "\n",
        "This notebook demonstrates a simple, minimal example of **vectorless RAG** with PageIndex. You will learn how to:\n",
        "- [x] Build a PageIndex tree structure of a document\n",
        "- [x] Perform reasoning-based retrieval with tree search\n",
        "- [x] Generate answers based on the retrieved context\n",
        "\n",
        "> ⚡ Note: This is a **minimal example** to illustrate PageIndex's core philosophy and idea, not its full capabilities. More advanced examples are coming soon.\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7ziuTbbWcG1L"
      },
      "source": [
        "## Step 0: Preparation\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "edTfrizMFK4c"
      },
      "source": [
        "#### 0.1 Install PageIndex"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "id": "LaoB58wQFNDh"
      },
      "outputs": [],
      "source": [
        "%pip install -q --upgrade pageindex"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WVEWzPKGcG1M"
      },
      "source": [
        "#### 0.2 Setup PageIndex"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "StvqfcK4cG1M"
      },
      "outputs": [],
      "source": [
        "from pageindex import PageIndexClient\n",
        "import pageindex.utils as utils\n",
        "\n",
        "# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
        "PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\"\n",
        "pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 0.3 Setup LLM\n",
        "\n",
        "Choose your preferred LLM for reasoning-based retrieval. In this example, we use OpenAI’s GPT-4.1."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import openai\n",
        "OPENAI_API_KEY = \"YOUR_OPENAI_API_KEY\"\n",
        "\n",
        "async def call_llm(prompt, model=\"gpt-4.1\", temperature=0):\n",
        "    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\n",
        "    response = await client.chat.completions.create(\n",
        "        model=model,\n",
        "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
        "        temperature=temperature\n",
        "    )\n",
        "    return response.choices[0].message.content.strip()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "heGtIMOVcG1N"
      },
      "source": [
        "## Step 1: PageIndex Tree Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Mzd1VWjwMUJL"
      },
      "source": [
        "#### 1.1 Submit a document for generating PageIndex tree"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "f6--eZPLcG1N",
        "outputId": "ca688cfd-6c4b-4a57-dac2-f3c2604c4112"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Downloaded https://arxiv.org/pdf/2501.12948.pdf\n",
            "Document Submitted: pi-cmeseq08w00vt0bo3u6tr244g\n"
          ]
        }
      ],
      "source": [
        "import os, requests\n",
        "\n",
        "# You can also use our GitHub repo to generate PageIndex tree\n",
        "# https://github.com/VectifyAI/PageIndex\n",
        "\n",
        "pdf_url = \"https://arxiv.org/pdf/2501.12948.pdf\"\n",
        "pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
        "os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
        "\n",
        "response = requests.get(pdf_url)\n",
        "with open(pdf_path, \"wb\") as f:\n",
        "    f.write(response.content)\n",
        "print(f\"Downloaded {pdf_url}\")\n",
        "\n",
        "doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
        "print('Document Submitted:', doc_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4-Hrh0azcG1N"
      },
      "source": [
        "#### 1.2 Get the generated PageIndex tree structure"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "b1Q1g6vrcG1O",
        "outputId": "dc944660-38ad-47ea-d358-be422edbae53"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Simplified Tree Structure of the Document:\n",
            "[{'title': 'DeepSeek-R1: Incentivizing Reasoning Cap...',\n",
            "  'node_id': '0000',\n",
            "  'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning C...',\n",
            "  'nodes': [{'title': 'Abstract',\n",
            "             'node_id': '0001',\n",
            "             'summary': 'The partial document introduces two reas...'},\n",
            "            {'title': 'Contents',\n",
            "             'node_id': '0002',\n",
            "             'summary': 'This partial document provides a detaile...'},\n",
            "            {'title': '1. Introduction',\n",
            "             'node_id': '0003',\n",
            "             'prefix_summary': 'The partial document introduces recent a...',\n",
            "             'nodes': [{'title': '1.1. Contributions',\n",
            "                        'node_id': '0004',\n",
            "                        'summary': 'This partial document outlines the main ...'},\n",
            "                       {'title': '1.2. Summary of Evaluation Results',\n",
            "                        'node_id': '0005',\n",
            "                        'summary': 'The partial document provides a summary ...'}]},\n",
            "            {'title': '2. Approach',\n",
            "             'node_id': '0006',\n",
            "             'prefix_summary': '## 2. Approach\\n',\n",
            "             'nodes': [{'title': '2.1. Overview',\n",
            "                        'node_id': '0007',\n",
            "                        'summary': '### 2.1. Overview\\n\\nPrevious work has hea...'},\n",
            "                       {'title': '2.2. DeepSeek-R1-Zero: Reinforcement Lea...',\n",
            "                        'node_id': '0008',\n",
            "                        'prefix_summary': '### 2.2. DeepSeek-R1-Zero: Reinforcement...',\n",
            "                        'nodes': [{'title': '2.2.1. Reinforcement Learning Algorithm',\n",
            "                                   'node_id': '0009',\n",
            "                                   'summary': 'The partial document describes the Group...'},\n",
            "                                  {'title': '2.2.2. Reward Modeling',\n",
            "                                   'node_id': '0010',\n",
            "                                   'summary': 'This partial document discusses the rewa...'},\n",
            "                                  {'title': '2.2.3. Training Template',\n",
            "                                   'node_id': '0011',\n",
            "                                   'summary': '#### 2.2.3. Training Template\\n\\nTo train ...'},\n",
            "                                  {'title': '2.2.4. Performance, Self-evolution Proce...',\n",
            "                                   'node_id': '0012',\n",
            "                                   'summary': 'This partial document discusses the perf...'}]},\n",
            "                       {'title': '2.3. DeepSeek-R1: Reinforcement Learning...',\n",
            "                        'node_id': '0013',\n",
            "                        'summary': 'This partial document describes the trai...'},\n",
            "                       {'title': '2.4. Distillation: Empower Small Models ...',\n",
            "                        'node_id': '0014',\n",
            "                        'summary': 'This partial document discusses the proc...'}]},\n",
            "            {'title': '3. Experiment',\n",
            "             'node_id': '0015',\n",
            "             'prefix_summary': 'The partial document describes the exper...',\n",
            "             'nodes': [{'title': '3.1. DeepSeek-R1 Evaluation',\n",
            "                        'node_id': '0016',\n",
            "                        'summary': 'This partial document presents a compreh...'},\n",
            "                       {'title': '3.2. Distilled Model Evaluation',\n",
            "                        'node_id': '0017',\n",
            "                        'summary': 'This partial document presents an evalua...'}]},\n",
            "            {'title': '4. Discussion',\n",
            "             'node_id': '0018',\n",
            "             'summary': 'This partial document discusses the comp...'},\n",
            "            {'title': '5. Conclusion, Limitations, and Future W...',\n",
            "             'node_id': '0019',\n",
            "             'summary': 'This partial document presents the concl...'},\n",
            "            {'title': 'References',\n",
            "             'node_id': '0020',\n",
            "             'summary': 'This partial document consists of the re...'},\n",
            "            {'title': 'Appendix', 'node_id': '0021', 'summary': '## Appendix\\n'},\n",
            "            {'title': 'A. Contributions and Acknowledgments',\n",
            "             'node_id': '0022',\n",
            "             'summary': 'This partial document section details th...'}]}]\n"
          ]
        }
      ],
      "source": [
        "if pi_client.is_retrieval_ready(doc_id):\n",
        "    tree = pi_client.get_tree(doc_id, node_summary=True)['result']\n",
        "    print('Simplified Tree Structure of the Document:')\n",
        "    utils.print_tree(tree)\n",
        "else:\n",
        "    print(\"Processing document, please try again later...\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "USoCLOiQcG1O"
      },
      "source": [
        "## Step 2: Reasoning-Based Retrieval with Tree Search"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 2.1 Use LLM for tree search and identify nodes that might contain relevant context"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "id": "LLHNJAtTcG1O"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "query = \"What are the conclusions in this document?\"\n",
        "\n",
        "tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])\n",
        "\n",
        "search_prompt = f\"\"\"\n",
        "You are given a question and a tree structure of a document.\n",
        "Each node contains a node id, node title, and a corresponding summary.\n",
        "Your task is to find all nodes that are likely to contain the answer to the question.\n",
        "\n",
        "Question: {query}\n",
        "\n",
        "Document tree structure:\n",
        "{json.dumps(tree_without_text, indent=2)}\n",
        "\n",
        "Please reply in the following JSON format:\n",
        "{{\n",
        "    \"thinking\": \"<Your thinking process on which nodes are relevant to the question>\",\n",
        "    \"node_list\": [\"node_id_1\", \"node_id_2\", ..., \"node_id_n\"]\n",
        "}}\n",
        "Directly return the final JSON structure. Do not output anything else.\n",
        "\"\"\"\n",
        "\n",
        "tree_search_result = await call_llm(search_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 2.2 Print retrieved nodes and reasoning process"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 57,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "P8DVUOuAen5u",
        "outputId": "6bb6d052-ef30-4716-f88e-be98bcb7ebdb"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Reasoning Process:\n",
            "The question asks for the conclusions in the document. Typically, conclusions are found in sections\n",
            "explicitly titled 'Conclusion' or in sections summarizing the findings and implications of the work.\n",
            "In this document tree, node 0019 ('5. Conclusion, Limitations, and Future Work') is the most\n",
            "directly relevant, as it is dedicated to the conclusion and related topics. Additionally, the\n",
            "'Abstract' (node 0001) may contain a high-level summary that sometimes includes concluding remarks,\n",
            "but it is less likely to contain the full conclusions. Other sections like 'Discussion' (node 0018)\n",
            "may discuss implications but are not explicitly conclusions. Therefore, the primary node is 0019.\n",
            "\n",
            "Retrieved Nodes:\n",
            "Node ID: 0019\t Page: 16\t Title: 5. Conclusion, Limitations, and Future Work\n"
          ]
        }
      ],
      "source": [
        "node_map = utils.create_node_mapping(tree)\n",
        "tree_search_result_json = json.loads(tree_search_result)\n",
        "\n",
        "print('Reasoning Process:')\n",
        "utils.print_wrapped(tree_search_result_json['thinking'])\n",
        "\n",
        "print('\\nRetrieved Nodes:')\n",
        "for node_id in tree_search_result_json[\"node_list\"]:\n",
        "    node = node_map[node_id]\n",
        "    print(f\"Node ID: {node['node_id']}\\t Page: {node['page_index']}\\t Title: {node['title']}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "10wOZDG_cG1O"
      },
      "source": [
        "## Step 3: Answer Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 3.1 Extract relevant context from retrieved nodes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 58,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 279
        },
        "id": "a7UCBnXlcG1O",
        "outputId": "8a026ea3-4ef3-473a-a57b-b4565409749e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Retrieved Context:\n",
            "\n",
            "## 5. Conclusion, Limitations, and Future Work\n",
            "\n",
            "In this work, we share our journey in enhancing model reasoning abilities through reinforcement\n",
            "learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data,\n",
            "achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-\n",
            "start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance\n",
            "comparable to OpenAI-o1-1217 on a range of tasks.\n",
            "\n",
            "We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1\n",
            "as the teacher model to generate 800K training samples, and fine-tune several small dense models.\n",
            "The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on\n",
            "math benchmarks with $28.9 \\%$ on AIME and $83.9 \\%$ on MATH. Other dense models also achieve\n",
            "impressive results, significantly outperforming other instructiontuned models based on the same\n",
            "underlying checkpoints.\n",
            "\n",
            "In the fut...\n"
          ]
        }
      ],
      "source": [
        "node_list = json.loads(tree_search_result)[\"node_list\"]\n",
        "relevant_content = \"\\n\\n\".join(node_map[node_id][\"text\"] for node_id in node_list)\n",
        "\n",
        "print('Retrieved Context:\\n')\n",
        "utils.print_wrapped(relevant_content[:1000] + '...')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 3.2 Generate answer based on retrieved context"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 59,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 210
        },
        "id": "tcp_PhHzcG1O",
        "outputId": "187ff116-9bb0-4ab4-bacb-13944460b5ff"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Generated Answer:\n",
            "\n",
            "The conclusions in this document are:\n",
            "\n",
            "- DeepSeek-R1-Zero, a pure reinforcement learning (RL) approach without cold-start data, achieves\n",
            "strong performance across various tasks.\n",
            "- DeepSeek-R1, which combines cold-start data with iterative RL fine-tuning, is more powerful and\n",
            "achieves performance comparable to OpenAI-o1-1217 on a range of tasks.\n",
            "- Distilling DeepSeek-R1’s reasoning capabilities into smaller dense models is promising; for\n",
            "example, DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks,\n",
            "and other dense models also show significant improvements over similar instruction-tuned models.\n",
            "\n",
            "These results demonstrate the effectiveness of the RL-based approach and the potential for\n",
            "distilling reasoning abilities into smaller models.\n"
          ]
        }
      ],
      "source": [
        "answer_prompt = f\"\"\"\n",
        "Answer the question based on the context:\n",
        "\n",
        "Question: {query}\n",
        "Context: {relevant_content}\n",
        "\n",
        "Provide a clear, concise answer based only on the context provided.\n",
        "\"\"\"\n",
        "\n",
        "print('Generated Answer:\\n')\n",
        "answer = await call_llm(answer_prompt)\n",
        "utils.print_wrapped(answer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_1kaGD3GcG1O"
      },
      "source": [
        "---\n",
        "\n",
        "## 🎯 What's Next\n",
        "\n",
        "This notebook has demonstrated a **basic**, **minimal** example of **reasoning-based**, **vectorless** RAG with PageIndex. The workflow illustrates the core idea:\n",
        "> *Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context, without relying on a vector database or top-k similarity search*.\n",
        "\n",
        "While this notebook highlights a minimal workflow, the PageIndex framework is built to support **far more advanced** use cases. In upcoming tutorials, we will introduce:\n",
        "* **Multi-Node Reasoning with Content Extraction** — Scale tree search to extract and select relevant content from multiple nodes.\n",
        "* **Multi-Document Search** — Enable reasoning-based navigation across large document collections, extending beyond a single file.\n",
        "* **Efficient Tree Search** — Improve tree search efficiency for long documents with a large number of nodes.\n",
        "* **Expert Knowledge Integration and Preference Alignment** — Incorporate user preferences or expert insights by adding knowledge directly into the LLM tree search, without the need for fine-tuning.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 🔎 Learn More About PageIndex\n",
        "  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://dash.pageindex.ai\">🖥️ Dashboard</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://docs.pageindex.ai/quickstart\">📚 API Docs</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://github.com/VectifyAI/PageIndex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>\n",
        "\n",
        "<br>\n",
        "\n",
        "© 2025 [Vectify AI](https://vectify.ai)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}


================================================
FILE: cookbook/vision_RAG_pageindex.ipynb
================================================
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TCh9BTedHJK1"
      },
      "source": [
        "![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nD0hb4TFHWTt"
      },
      "source": [
        "<div align=\"center\">\n",
        "<p><i>Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</i></p>\n",
        "</div>\n",
        "\n",
        "<div align=\"center\">\n",
        "<p>\n",
        "  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://chat.pageindex.ai\">💻 Chat</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://pageindex.ai/mcp\">🔌 MCP</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://docs.pageindex.ai/quickstart\">📚 API</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://github.com/VectifyAI/PageIndex\">📦 GitHub</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n",
        "  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n",
        "</p>\n",
        "</div>\n",
        "\n",
        "<div align=\"center\">\n",
        "\n",
        "[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\n",
        "\n",
        "</div>\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> Check out our blog post, \"[Do We Still Need OCR?](https://pageindex.ai/blog/do-we-need-ocr)\", for a more detailed discussion."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ebvn5qfpcG1K"
      },
      "source": [
        "# A Vision-based, Vectorless RAG System for Long Documents\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In modern document question answering (QA) systems, Optical Character Recognition (OCR) serves an important role by converting PDF pages into text that can be processed by Large Language Models (LLMs). The resulting text can provide contextual input that enables LLMs to perform question answering over document content.\n",
        "\n",
        "Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) (such as [Qwen-VL](https://github.com/QwenLM/Qwen3-VL) and [GPT-4.1](https://openai.com/index/gpt-4-1/)), new end-to-end OCR models like [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.\n",
        "\n",
        "However, this paradigm shift raises an important question: \n",
        "\n",
        "\n",
        "> **If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?**\n",
        "\n",
        "In this notebook, we give a practical implementation of a vision-based question-answering system for long documents, without relying on OCR. Specifically, we use PageIndex as a reasoning-based retrieval layer and OpenAI's multimodal GPT-4.1 as the VLM for visual reasoning and answer generation.\n",
        "\n",
        "See the original [blog post](https://pageindex.ai/blog/do-we-need-ocr) for a more detailed discussion on how VLMs can replace traditional OCR pipelines in document question-answering."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 📝 Notebook Overview\n",
        "\n",
        "This notebook demonstrates a *minimal*, **vision-based vectorless RAG** pipeline for long documents with PageIndex, using only visual context from PDF pages. You will learn how to:\n",
        "- [x] Build a PageIndex tree structure of a document\n",
        "- [x] Perform reasoning-based retrieval with tree search\n",
        "- [x] Extract PDF page images of retrieved tree nodes for visual context\n",
        "- [x] Generate answers using VLM with PDF image inputs only (no OCR required)\n",
        "\n",
        "> ⚡ Note: This example uses PageIndex's reasoning-based retrieval with OpenAI's multimodal GPT-4.1 model for both tree search and visual context reasoning.\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7ziuTbbWcG1L"
      },
      "source": [
        "## Step 0: Preparation\n",
        "\n",
        "This notebook demonstrates **Vision-based RAG** with PageIndex, using PDF page images as visual context for retrieval and answer generation.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "edTfrizMFK4c"
      },
      "source": [
        "#### 0.1 Install PageIndex"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "id": "LaoB58wQFNDh"
      },
      "outputs": [],
      "source": [
        "%pip install -q --upgrade pageindex requests openai PyMuPDF"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WVEWzPKGcG1M"
      },
      "source": [
        "#### 0.2 Setup PageIndex"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "StvqfcK4cG1M"
      },
      "outputs": [],
      "source": [
        "from pageindex import PageIndexClient\n",
        "import pageindex.utils as utils\n",
        "\n",
        "# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\n",
        "PAGEINDEX_API_KEY = \"YOUR_PAGEINDEX_API_KEY\"\n",
        "pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 0.3 Setup VLM\n",
        "\n",
        "Choose your preferred VLM — in this notebook, we use OpenAI's multimodal GPT-4.1 as the VLM."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import openai, fitz, base64, os\n",
        "\n",
        "# Setup OpenAI client\n",
        "OPENAI_API_KEY = \"YOUR_OPENAI_API_KEY\"\n",
        "\n",
        "async def call_vlm(prompt, image_paths=None, model=\"gpt-4.1\"):\n",
        "    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\n",
        "    messages = [{\"role\": \"user\", \"content\": prompt}]\n",
        "    if image_paths:\n",
        "        content = [{\"type\": \"text\", \"text\": prompt}]\n",
        "        for image in image_paths:\n",
        "            if os.path.exists(image):\n",
        "                with open(image, \"rb\") as image_file:\n",
        "                    image_data = base64.b64encode(image_file.read()).decode('utf-8')\n",
        "                    content.append({\n",
        "                        \"type\": \"image_url\",\n",
        "                        \"image_url\": {\n",
        "                            \"url\": f\"data:image/jpeg;base64,{image_data}\"\n",
        "                        }\n",
        "                    })\n",
        "        messages[0][\"content\"] = content\n",
        "    response = await client.chat.completions.create(model=model, messages=messages, temperature=0)\n",
        "    return response.choices[0].message.content.strip()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 0.4 PDF Image Extraction Helper Functions\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def extract_pdf_page_images(pdf_path, output_dir=\"pdf_images\"):\n",
        "    os.makedirs(output_dir, exist_ok=True)\n",
        "    pdf_document = fitz.open(pdf_path)\n",
        "    page_images = {}\n",
        "    total_pages = len(pdf_document)\n",
        "    for page_number in range(len(pdf_document)):\n",
        "        page = pdf_document.load_page(page_number)\n",
        "        # Convert page to image\n",
        "        mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for better quality\n",
        "        pix = page.get_pixmap(matrix=mat)\n",
        "        img_data = pix.tobytes(\"jpeg\")\n",
        "        image_path = os.path.join(output_dir, f\"page_{page_number + 1}.jpg\")\n",
        "        with open(image_path, \"wb\") as image_file:\n",
        "            image_file.write(img_data)\n",
        "        page_images[page_number + 1] = image_path\n",
        "        print(f\"Saved page {page_number + 1} image: {image_path}\")\n",
        "    pdf_document.close()\n",
        "    return page_images, total_pages\n",
        "\n",
        "def get_page_images_for_nodes(node_list, node_map, page_images):\n",
        "    # Get PDF page images for retrieved nodes\n",
        "    image_paths = []\n",
        "    seen_pages = set()\n",
        "    for node_id in node_list:\n",
        "        node_info = node_map[node_id]\n",
        "        for page_num in range(node_info['start_index'], node_info['end_index'] + 1):\n",
        "            if page_num not in seen_pages:\n",
        "                image_paths.append(page_images[page_num])\n",
        "                seen_pages.add(page_num)\n",
        "    return image_paths\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "heGtIMOVcG1N"
      },
      "source": [
        "## Step 1: PageIndex Tree Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Mzd1VWjwMUJL"
      },
      "source": [
        "#### 1.1 Submit a document for generating PageIndex tree"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "f6--eZPLcG1N",
        "outputId": "ca688cfd-6c4b-4a57-dac2-f3c2604c4112"
      },
      "outputs": [],
      "source": [
        "import os, requests\n",
        "\n",
        "# You can also use our GitHub repo to generate PageIndex tree\n",
        "# https://github.com/VectifyAI/PageIndex\n",
        "\n",
        "pdf_url = \"https://arxiv.org/pdf/1706.03762.pdf\"  # the \"Attention Is All You Need\" paper\n",
        "pdf_path = os.path.join(\"../data\", pdf_url.split('/')[-1])\n",
        "os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\n",
        "\n",
        "response = requests.get(pdf_url)\n",
        "with open(pdf_path, \"wb\") as f:\n",
        "    f.write(response.content)\n",
        "print(f\"Downloaded {pdf_url}\\n\")\n",
        "\n",
        "# Extract page images from PDF\n",
        "print(\"Extracting page images...\")\n",
        "page_images, total_pages = extract_pdf_page_images(pdf_path)\n",
        "print(f\"Extracted {len(page_images)} page images from {total_pages} total pages.\\n\")\n",
        "\n",
        "doc_id = pi_client.submit_document(pdf_path)[\"doc_id\"]\n",
        "print('Document Submitted:', doc_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4-Hrh0azcG1N"
      },
      "source": [
        "#### 1.2 Get the generated PageIndex tree structure"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 65,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "b1Q1g6vrcG1O",
        "outputId": "dc944660-38ad-47ea-d358-be422edbae53"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Simplified Tree Structure of the Document:\n",
            "[{'title': 'Attention Is All You Need',\n",
            "  'node_id': '0000',\n",
            "  'page_index': 1,\n",
            "  'prefix_summary': '# Attention Is All You Need\\n\\nAshish Vasw...',\n",
            "  'nodes': [{'title': 'Abstract',\n",
            "             'node_id': '0001',\n",
            "             'page_index': 1,\n",
            "             'summary': 'The text introduces the Transformer, a n...'},\n",
            "            {'title': '1 Introduction',\n",
            "             'node_id': '0002',\n",
            "             'page_index': 2,\n",
            "             'summary': 'The text introduces the Transformer, a n...'},\n",
            "            {'title': '2 Background',\n",
            "             'node_id': '0003',\n",
            "             'page_index': 2,\n",
            "             'summary': 'This section discusses the Transformer m...'},\n",
            "            {'title': '3 Model Architecture',\n",
            "             'node_id': '0004',\n",
            "             'page_index': 2,\n",
            "             'prefix_summary': 'The text describes the encoder-decoder a...',\n",
            "             'nodes': [{'title': '3.1 Encoder and Decoder Stacks',\n",
            "                        'node_id': '0005',\n",
            "                        'page_index': 3,\n",
            "                        'summary': 'The text describes the encoder and decod...'},\n",
            "                       {'title': '3.2 Attention',\n",
            "                        'node_id': '0006',\n",
            "                        'page_index': 3,\n",
            "                        'prefix_summary': '### 3.2 Attention\\n\\nAn attention function...',\n",
            "                        'nodes': [{'title': '3.2.1 Scaled Dot-Product Attention',\n",
            "                                   'node_id': '0007',\n",
            "                                   'page_index': 4,\n",
            "                                   'summary': 'The text describes Scaled Dot-Product At...'},\n",
            "                                  {'title': '3.2.2 Multi-Head Attention',\n",
            "                                   'node_id': '0008',\n",
            "                                   'page_index': 4,\n",
            "                                   'summary': 'The text describes Multi-Head Attention,...'},\n",
            "                                  {'title': '3.2.3 Applications of Attention in our M...',\n",
            "                                   'node_id': '0009',\n",
            "                                   'page_index': 5,\n",
            "                                   'summary': 'The text describes the three application...'}]},\n",
            "                       {'title': '3.3 Position-wise Feed-Forward Networks',\n",
            "                        'node_id': '0010',\n",
            "                        'page_index': 5,\n",
            "                        'summary': '### 3.3 Position-wise Feed-Forward Netwo...'},\n",
            "                       {'title': '3.4 Embeddings and Softmax',\n",
            "                        'node_id': '0011',\n",
            "                        'page_index': 5,\n",
            "                        'summary': 'The text describes the use of learned em...'},\n",
            "                       {'title': '3.5 Positional Encoding',\n",
            "                        'node_id': '0012',\n",
            "                        'page_index': 6,\n",
            "                        'summary': 'This section explains the necessity of p...'}]},\n",
            "            {'title': '4 Why Self-Attention',\n",
            "             'node_id': '0013',\n",
            "             'page_index': 6,\n",
            "             'summary': 'This text compares self-attention layers...'},\n",
            "            {'title': '5 Training',\n",
            "             'node_id': '0014',\n",
            "             'page_index': 7,\n",
            "             'prefix_summary': '## 5 Training\\n\\nThis section describes th...',\n",
            "             'nodes': [{'title': '5.1 Training Data and Batching',\n",
            "                        'node_id': '0015',\n",
            "                        'page_index': 7,\n",
            "                        'summary': '### 5.1 Training Data and Batching\\n\\nWe t...'},\n",
            "                       {'title': '5.2 Hardware and Schedule',\n",
            "                        'node_id': '0016',\n",
            "                        'page_index': 7,\n",
            "                        'summary': '### 5.2 Hardware and Schedule\\n\\nWe traine...'},\n",
            "                       {'title': '5.3 Optimizer',\n",
            "                        'node_id': '0017',\n",
            "                        'page_index': 7,\n",
            "                        'summary': '### 5.3 Optimizer\\n\\nWe used the Adam opti...'},\n",
            "                       {'title': '5.4 Regularization',\n",
            "                        'node_id': '0018',\n",
            "                        'page_index': 7,\n",
            "                        'summary': 'The text details three regularization te...'}]},\n",
            "            {'title': '6 Results',\n",
            "             'node_id': '0019',\n",
            "             'page_index': 8,\n",
            "             'prefix_summary': '## 6 Results\\n',\n",
            "             'nodes': [{'title': '6.1 Machine Translation',\n",
            "                        'node_id': '0020',\n",
            "                        'page_index': 8,\n",
            "                        'summary': 'The text details the performance of a Tr...'},\n",
            "                       {'title': '6.2 Model Variations',\n",
            "                        'node_id': '0021',\n",
            "                        'page_index': 8,\n",
            "                        'summary': 'This text details experiments varying co...'},\n",
            "                       {'title': '6.3 English Constituency Parsing',\n",
            "                        'node_id': '0022',\n",
            "                        'page_index': 9,\n",
            "                        'summary': 'The text describes experiments evaluatin...'}]},\n",
            "            {'title': '7 Conclusion',\n",
            "             'node_id': '0023',\n",
            "             'page_index': 10,\n",
            "             'summary': 'This text concludes by presenting the Tr...'},\n",
            "            {'title': 'References',\n",
            "             'node_id': '0024',\n",
            "             'page_index': 10,\n",
            "             'summary': 'The provided text is a collection of ref...'},\n",
            "            {'title': 'Attention Visualizations',\n",
            "             'node_id': '0025',\n",
            "             'page_index': 13,\n",
            "             'summary': 'The text provides examples of attention ...'}]}]\n"
          ]
        }
      ],
      "source": [
        "if pi_client.is_retrieval_ready(doc_id):\n",
        "    tree = pi_client.get_tree(doc_id, node_summary=True)['result']\n",
        "    print('Simplified Tree Structure of the Document:')\n",
        "    utils.print_tree(tree, exclude_fields=['text'])\n",
        "else:\n",
        "    print(\"Processing document, please try again later...\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "USoCLOiQcG1O"
      },
      "source": [
        "## Step 2: Reasoning-Based Retrieval with Tree Search"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 2.1 Reasoning-based retrieval with PageIndex to identify nodes that might contain relevant context"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LLHNJAtTcG1O"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "query = \"What is the last operation in the Scaled Dot-Product Attention figure?\"\n",
        "\n",
        "tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])\n",
        "\n",
        "search_prompt = f\"\"\"\n",
        "You are given a question and a tree structure of a document.\n",
        "Each node contains a node id, node title, and a corresponding summary.\n",
        "Your task is to find all tree nodes that are likely to contain the answer to the question.\n",
        "\n",
        "Question: {query}\n",
        "\n",
        "Document tree structure:\n",
        "{json.dumps(tree_without_text, indent=2)}\n",
        "\n",
        "Please reply in the following JSON format:\n",
        "{{\n",
        "    \"thinking\": \"<Your thinking process on which nodes are relevant to the question>\",\n",
        "    \"node_list\": [\"node_id_1\", \"node_id_2\", ..., \"node_id_n\"]\n",
        "}}\n",
        "Directly return the final JSON structure. Do not output anything else.\n",
        "\"\"\"\n",
        "\n",
        "tree_search_result = await call_vlm(search_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 2.2 Print retrieved nodes and reasoning process"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 87,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "P8DVUOuAen5u",
        "outputId": "6bb6d052-ef30-4716-f88e-be98bcb7ebdb"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Reasoning Process:\n",
            "\n",
            "The question asks about the last operation in the Scaled Dot-Product Attention figure. The most\n",
            "relevant section is the one that describes Scaled Dot-Product Attention in detail, including its\n",
            "computation and the figure itself. This is likely found in section 3.2.1 'Scaled Dot-Product\n",
            "Attention' (node_id: 0007), which is a subsection of 3.2 'Attention' (node_id: 0006). The parent\n",
            "section 3.2 may also contain the figure and its caption, as the summary mentions Figure 2 (which is\n",
            "the Scaled Dot-Product Attention figure). Therefore, both node 0006 and node 0007 are likely to\n",
            "contain the answer.\n",
            "\n",
            "Retrieved Nodes:\n",
            "\n",
            "Node ID: 0006\t Pages: 3-4\t Title: 3.2 Attention\n",
            "Node ID: 0007\t Pages: 4\t Title: 3.2.1 Scaled Dot-Product Attention\n"
          ]
        }
      ],
      "source": [
        "node_map = utils.create_node_mapping(tree, include_page_ranges=True, max_page=total_pages)\n",
        "tree_search_result_json = json.loads(tree_search_result)\n",
        "\n",
        "print('Reasoning Process:\\n')\n",
        "utils.print_wrapped(tree_search_result_json['thinking'])\n",
        "\n",
        "print('\\nRetrieved Nodes:\\n')\n",
        "for node_id in tree_search_result_json[\"node_list\"]:\n",
        "    node_info = node_map[node_id]\n",
        "    node = node_info['node']\n",
        "    start_page = node_info['start_index']\n",
        "    end_page = node_info['end_index']\n",
        "    page_range = start_page if start_page == end_page else f\"{start_page}-{end_page}\"\n",
        "    print(f\"Node ID: {node['node_id']}\\t Pages: {page_range}\\t Title: {node['title']}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 2.3 Get corresponding PDF page images of retrieved nodes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 81,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Retrieved 2 PDF page image(s) for visual context.\n"
          ]
        }
      ],
      "source": [
        "retrieved_nodes = tree_search_result_json[\"node_list\"]\n",
        "retrieved_page_images = get_page_images_for_nodes(retrieved_nodes, node_map, page_images)\n",
        "print(f'\\nRetrieved {len(retrieved_page_images)} PDF page image(s) for visual context.')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "10wOZDG_cG1O"
      },
      "source": [
        "## Step 3: Answer Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### 3.1 Generate answer using VLM with visual context"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 210
        },
        "id": "tcp_PhHzcG1O",
        "outputId": "187ff116-9bb0-4ab4-bacb-13944460b5ff"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Generated answer using VLM with retrieved PDF page images as visual context:\n",
            "\n",
            "The last operation in the **Scaled Dot-Product Attention** figure is a **MatMul** (matrix\n",
            "multiplication). This operation multiplies the attention weights (after softmax) by the value matrix\n",
            "\\( V \\).\n"
          ]
        }
      ],
      "source": [
        "# Generate answer using VLM with only PDF page images as visual context\n",
        "answer_prompt = f\"\"\"\n",
        "Answer the question based on the images of the document pages as context.\n",
        "\n",
        "Question: {query}\n",
        "\n",
        "Provide a clear, concise answer based only on the context provided.\n",
        "\"\"\"\n",
        "\n",
        "print('Generated answer using VLM with retrieved PDF page images as visual context:\\n')\n",
        "answer = await call_vlm(answer_prompt, retrieved_page_images)\n",
        "utils.print_wrapped(answer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n",
        "\n",
        "In this notebook, we demonstrated a *minimal* **vision-based, vectorless RAG pipeline** using PageIndex and a VLM. The system retrieves relevant pages by reasoning over the document’s hierarchical tree index and answers questions directly from PDF images — no OCR required.\n",
        "\n",
        "If you’re interested in building your own **reasoning-based document QA system**, try [PageIndex Chat](https://chat.pageindex.ai), or integrate via [PageIndex MCP](https://pageindex.ai/mcp) and the [API](https://docs.pageindex.ai/quickstart). You can also explore the [GitHub repo](https://github.com/VectifyAI/PageIndex) for open-source implementations and additional examples."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "\n",
        "© 2025 [Vectify AI](https://vectify.ai)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}


================================================
FILE: pageindex/__init__.py
================================================
from .page_index import *
from .page_index_md import md_to_tree

================================================
FILE: pageindex/config.yaml
================================================
model: "gpt-4o-2024-11-20"
toc_check_page_num: 20
max_page_num_each_node: 10
max_token_num_each_node: 20000
if_add_node_id: "yes"
if_add_node_summary: "yes"
if_add_doc_description: "no"
if_add_node_text: "no"

================================================
FILE: pageindex/page_index.py
================================================
import os
import json
import copy
import math
import random
import re
from .utils import *
import os
from concurrent.futures import ThreadPoolExecutor, as_completed


################### check title in page #########################################################
async def check_title_appearance(item, page_list, start_index=1, model=None):    
    title=item['title']
    if 'physical_index' not in item or item['physical_index'] is None:
        return {'list_index': item.get('list_index'), 'answer': 'no', 'title':title, 'page_number': None}
    
    
    page_number = item['physical_index']
    page_text = page_list[page_number-start_index][0]

    
    prompt = f"""
    Your job is to check if the given section appears or starts in the given page_text.

    Note: do fuzzy matching, ignore any space inconsistency in the page_text.

    The given section title is {title}.
    The given page_text is {page_text}.
    
    Reply format:
    {{
        
        "thinking": <why do you think the section appears or starts in the page_text>
        "answer": "yes or no" (yes if the section appears or starts in the page_text, no otherwise)
    }}
    Directly return the final JSON structure. Do not output anything else."""

    response = await ChatGPT_API_async(model=model, prompt=prompt)
    response = extract_json(response)
    if 'answer' in response:
        answer = response['answer']
    else:
        answer = 'no'
    return {'list_index': item['list_index'], 'answer': answer, 'title': title, 'page_number': page_number}


async def check_title_appearance_in_start(title, page_text, model=None, logger=None):    
    prompt = f"""
    You will be given the current section title and the current page_text.
    Your job is to check if the current section starts in the beginning of the given page_text.
    If there are other contents before the current section title, then the current section does not start in the beginning of the given page_text.
    If the current section title is the first content in the given page_text, then the current section starts in the beginning of the given page_text.

    Note: do fuzzy matching, ignore any space inconsistency in the page_text.

    The given section title is {title}.
    The given page_text is {page_text}.
    
    reply format:
    {{
        "thinking": <why do you think the section appears or starts in the page_text>
        "start_begin": "yes or no" (yes if the section starts in the beginning of the page_text, no otherwise)
    }}
    Directly return the final JSON structure. Do not output anything else."""

    response = await ChatGPT_API_async(model=model, prompt=prompt)
    response = extract_json(response)
    if logger:
        logger.info(f"Response: {response}")
    return response.get("start_begin", "no")


async def check_title_appearance_in_start_concurrent(structure, page_list, model=None, logger=None):
    if logger:
        logger.info("Checking title appearance in start concurrently")
    
    # skip items without physical_index
    for item in structure:
        if item.get('physical_index') is None:
            item['appear_start'] = 'no'

    # only for items with valid physical_index
    tasks = []
    valid_items = []
    for item in structure:
        if item.get('physical_index') is not None:
            page_text = page_list[item['physical_index'] - 1][0]
            tasks.append(check_title_appearance_in_start(item['title'], page_text, model=model, logger=logger))
            valid_items.append(item)

    results = await asyncio.gather(*tasks, return_exceptions=True)
    for item, result in zip(valid_items, results):
        if isinstance(result, Exception):
            if logger:
                logger.error(f"Error checking start for {item['title']}: {result}")
            item['appear_start'] = 'no'
        else:
            item['appear_start'] = result

    return structure


def toc_detector_single_page(content, model=None):
    prompt = f"""
    Your job is to detect if there is a table of content provided in the given text.

    Given text: {content}

    return the following JSON format:
    {{
        "thinking": <why do you think there is a table of content in the given text>
        "toc_detected": "<yes or no>",
    }}

    Directly return the final JSON structure. Do not output anything else.
    Please note: abstract,summary, notation list, figure list, table list, etc. are not table of contents."""

    response = ChatGPT_API(model=model, prompt=prompt)
    # print('response', response)
    json_content = extract_json(response)    
    return json_content['toc_detected']


def check_if_toc_extraction_is_complete(content, toc, model=None):
    prompt = f"""
    You are given a partial document  and a  table of contents.
    Your job is to check if the  table of contents is complete, which it contains all the main sections in the partial document.

    Reply format:
    {{
        "thinking": <why do you think the table of contents is complete or not>
        "completed": "yes" or "no"
    }}
    Directly return the final JSON structure. Do not output anything else."""

    prompt = prompt + '\n Document:\n' + content + '\n Table of contents:\n' + toc
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)
    return json_content['completed']


def check_if_toc_transformation_is_complete(content, toc, model=None):
    prompt = f"""
    You are given a raw table of contents and a  table of contents.
    Your job is to check if the  table of contents is complete.

    Reply format:
    {{
        "thinking": <why do you think the cleaned table of contents is complete or not>
        "completed": "yes" or "no"
    }}
    Directly return the final JSON structure. Do not output anything else."""

    prompt = prompt + '\n Raw Table of contents:\n' + content + '\n Cleaned Table of contents:\n' + toc
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)
    return json_content['completed']

def extract_toc_content(content, model=None):
    prompt = f"""
    Your job is to extract the full table of contents from the given text, replace ... with :

    Given text: {content}

    Directly return the full table of contents content. Do not output anything else."""

    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)
    
    if_complete = check_if_toc_transformation_is_complete(content, response, model)
    if if_complete == "yes" and finish_reason == "finished":
        return response
    
    chat_history = [
        {"role": "user", "content": prompt}, 
        {"role": "assistant", "content": response},    
    ]
    prompt = f"""please continue the generation of table of contents , directly output the remaining part of the structure"""
    new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history)
    response = response + new_response
    if_complete = check_if_toc_transformation_is_complete(content, response, model)
    
    attempt = 0
    max_attempts = 5

    while not (if_complete == "yes" and finish_reason == "finished"):
        attempt += 1
        if attempt > max_attempts:
            raise Exception('Failed to complete table of contents after maximum retries')

        chat_history = [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response},
        ]
        prompt = f"""please continue the generation of table of contents , directly output the remaining part of the structure"""
        new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history)
        response = response + new_response
        if_complete = check_if_toc_transformation_is_complete(content, response, model)
    
    return response

def detect_page_index(toc_content, model=None):
    print('start detect_page_index')
    prompt = f"""
    You will be given a table of contents.

    Your job is to detect if there are page numbers/indices given within the table of contents.

    Given text: {toc_content}

    Reply format:
    {{
        "thinking": <why do you think there are page numbers/indices given within the table of contents>
        "page_index_given_in_toc": "<yes or no>"
    }}
    Directly return the final JSON structure. Do not output anything else."""

    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)
    return json_content['page_index_given_in_toc']

def toc_extractor(page_list, toc_page_list, model):
    def transform_dots_to_colon(text):
        text = re.sub(r'\.{5,}', ': ', text)
        # Handle dots separated by spaces
        text = re.sub(r'(?:\. ){5,}\.?', ': ', text)
        return text
    
    toc_content = ""
    for page_index in toc_page_list:
        toc_content += page_list[page_index][0]
    toc_content = transform_dots_to_colon(toc_content)
    has_page_index = detect_page_index(toc_content, model=model)
    
    return {
        "toc_content": toc_content,
        "page_index_given_in_toc": has_page_index
    }




def toc_index_extractor(toc, content, model=None):
    print('start toc_index_extractor')
    toc_extractor_prompt = """
    You are given a table of contents in a json format and several pages of a document, your job is to add the physical_index to the table of contents in the json format.

    The provided pages contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X.

    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    The response should be in the following JSON format: 
    [
        {
            "structure": <structure index, "x.x.x" or None> (string),
            "title": <title of the section>,
            "physical_index": "<physical_index_X>" (keep the format)
        },
        ...
    ]

    Only add the physical_index to the sections that are in the provided pages.
    If the section is not in the provided pages, do not add the physical_index to it.
    Directly return the final JSON structure. Do not output anything else."""

    prompt = toc_extractor_prompt + '\nTable of contents:\n' + str(toc) + '\nDocument pages:\n' + content
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)    
    return json_content



def toc_transformer(toc_content, model=None):
    print('start toc_transformer')
    init_prompt = """
    You are given a table of contents, You job is to transform the whole table of content into a JSON format included table_of_contents.

    structure is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    The response should be in the following JSON format: 
    {
    table_of_contents: [
        {
            "structure": <structure index, "x.x.x" or None> (string),
            "title": <title of the section>,
            "page": <page number or None>,
        },
        ...
        ],
    }
    You should transform the full table of contents in one go.
    Directly return the final JSON structure, do not output anything else. """

    prompt = init_prompt + '\n Given table of contents\n:' + toc_content
    last_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)
    if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)
    if if_complete == "yes" and finish_reason == "finished":
        last_complete = extract_json(last_complete)
        cleaned_response=convert_page_to_int(last_complete['table_of_contents'])
        return cleaned_response
    
    last_complete = get_json_content(last_complete)
    while not (if_complete == "yes" and finish_reason == "finished"):
        position = last_complete.rfind('}')
        if position != -1:
            last_complete = last_complete[:position+2]
        prompt = f"""
        Your task is to continue the table of contents json structure, directly output the remaining part of the json structure.
        The response should be in the following JSON format: 

        The raw table of contents json structure is:
        {toc_content}

        The incomplete transformed table of contents json structure is:
        {last_complete}

        Please continue the json structure, directly output the remaining part of the json structure."""

        new_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)

        if new_complete.startswith('```json'):
            new_complete =  get_json_content(new_complete)
            last_complete = last_complete+new_complete

        if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)
        

    last_complete = json.loads(last_complete)

    cleaned_response=convert_page_to_int(last_complete['table_of_contents'])
    return cleaned_response
    



def find_toc_pages(start_page_index, page_list, opt, logger=None):
    print('start find_toc_pages')
    last_page_is_yes = False
    toc_page_list = []
    i = start_page_index
    
    while i < len(page_list):
        # Only check beyond max_pages if we're still finding TOC pages
        if i >= opt.toc_check_page_num and not last_page_is_yes:
            break
        detected_result = toc_detector_single_page(page_list[i][0],model=opt.model)
        if detected_result == 'yes':
            if logger:
                logger.info(f'Page {i} has toc')
            toc_page_list.append(i)
            last_page_is_yes = True
        elif detected_result == 'no' and last_page_is_yes:
            if logger:
                logger.info(f'Found the last page with toc: {i-1}')
            break
        i += 1
    
    if not toc_page_list and logger:
        logger.info('No toc found')
        
    return toc_page_list

def remove_page_number(data):
    if isinstance(data, dict):
        data.pop('page_number', None)  
        for key in list(data.keys()):
            if 'nodes' in key:
                remove_page_number(data[key])
    elif isinstance(data, list):
        for item in data:
            remove_page_number(item)
    return data

def extract_matching_page_pairs(toc_page, toc_physical_index, start_page_index):
    pairs = []
    for phy_item in toc_physical_index:
        for page_item in toc_page:
            if phy_item.get('title') == page_item.get('title'):
                physical_index = phy_item.get('physical_index')
                if physical_index is not None and int(physical_index) >= start_page_index:
                    pairs.append({
                        'title': phy_item.get('title'),
                        'page': page_item.get('page'),
                        'physical_index': physical_index
                    })
    return pairs


def calculate_page_offset(pairs):
    differences = []
    for pair in pairs:
        try:
            physical_index = pair['physical_index']
            page_number = pair['page']
            difference = physical_index - page_number
            differences.append(difference)
        except (KeyError, TypeError):
            continue
    
    if not differences:
        return None
    
    difference_counts = {}
    for diff in differences:
        difference_counts[diff] = difference_counts.get(diff, 0) + 1
    
    most_common = max(difference_counts.items(), key=lambda x: x[1])[0]
    
    return most_common

def add_page_offset_to_toc_json(data, offset):
    for i in range(len(data)):
        if data[i].get('page') is not None and isinstance(data[i]['page'], int):
            data[i]['physical_index'] = data[i]['page'] + offset
            del data[i]['page']
    
    return data



def page_list_to_group_text(page_contents, token_lengths, max_tokens=20000, overlap_page=1):    
    num_tokens = sum(token_lengths)
    
    if num_tokens <= max_tokens:
        # merge all pages into one text
        page_text = "".join(page_contents)
        return [page_text]
    
    subsets = []
    current_subset = []
    current_token_count = 0

    expected_parts_num = math.ceil(num_tokens / max_tokens)
    average_tokens_per_part = math.ceil(((num_tokens / expected_parts_num) + max_tokens) / 2)
    
    for i, (page_content, page_tokens) in enumerate(zip(page_contents, token_lengths)):
        if current_token_count + page_tokens > average_tokens_per_part:

            subsets.append(''.join(current_subset))
            # Start new subset from overlap if specified
            overlap_start = max(i - overlap_page, 0)
            current_subset = page_contents[overlap_start:i]
            current_token_count = sum(token_lengths[overlap_start:i])
        
        # Add current page to the subset
        current_subset.append(page_content)
        current_token_count += page_tokens

    # Add the last subset if it contains any pages
    if current_subset:
        subsets.append(''.join(current_subset))
    
    print('divide page_list to groups', len(subsets))
    return subsets

def add_page_number_to_toc(part, structure, model=None):
    fill_prompt_seq = """
    You are given an JSON structure of a document and a partial part of the document. Your task is to check if the title that is described in the structure is started in the partial given document.

    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X. 

    If the full target section starts in the partial given document, insert the given JSON structure with the "start": "yes", and "start_index": "<physical_index_X>".

    If the full target section does not start in the partial given document, insert "start": "no",  "start_index": None.

    The response should be in the following format. 
        [
            {
                "structure": <structure index, "x.x.x" or None> (string),
                "title": <title of the section>,
                "start": "<yes or no>",
                "physical_index": "<physical_index_X> (keep the format)" or None
            },
            ...
        ]    
    The given structure contains the result of the previous part, you need to fill the result of the current part, do not change the previous result.
    Directly return the final JSON structure. Do not output anything else."""

    prompt = fill_prompt_seq + f"\n\nCurrent Partial Document:\n{part}\n\nGiven Structure\n{json.dumps(structure, indent=2)}\n"
    current_json_raw = ChatGPT_API(model=model, prompt=prompt)
    json_result = extract_json(current_json_raw)
    
    for item in json_result:
        if 'start' in item:
            del item['start']
    return json_result


def remove_first_physical_index_section(text):
    """
    Removes the first section between <physical_index_X> and <physical_index_X> tags,
    and returns the remaining text.
    """
    pattern = r'<physical_index_\d+>.*?<physical_index_\d+>'
    match = re.search(pattern, text, re.DOTALL)
    if match:
        # Remove the first matched section
        return text.replace(match.group(0), '', 1)
    return text

### add verify completeness
def generate_toc_continue(toc_content, part, model="gpt-4o-2024-11-20"):
    print('start generate_toc_continue')
    prompt = """
    You are an expert in extracting hierarchical tree structure.
    You are given a tree structure of the previous part and the text of the current part.
    Your task is to continue the tree structure from the previous part to include the current part.

    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    For the title, you need to extract the original title from the text, only fix the space inconsistency.

    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. \
    
    For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format.

    The response should be in the following format. 
        [
            {
                "structure": <structure index, "x.x.x"> (string),
                "title": <title of the section, keep the original title>,
                "physical_index": "<physical_index_X> (keep the format)"
            },
            ...
        ]    

    Directly return the additional part of the final JSON structure. Do not output anything else."""

    prompt = prompt + '\nGiven text\n:' + part + '\nPrevious tree structure\n:' + json.dumps(toc_content, indent=2)
    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)
    if finish_reason == 'finished':
        return extract_json(response)
    else:
        raise Exception(f'finish reason: {finish_reason}')
    
### add verify completeness
def generate_toc_init(part, model=None):
    print('start generate_toc_init')
    prompt = """
    You are an expert in extracting hierarchical tree structure, your task is to generate the tree structure of the document.

    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    For the title, you need to extract the original title from the text, only fix the space inconsistency.

    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. 

    For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format.

    The response should be in the following format. 
        [
            {{
                "structure": <structure index, "x.x.x"> (string),
                "title": <title of the section, keep the original title>,
                "physical_index": "<physical_index_X> (keep the format)"
            }},
            
        ],


    Directly return the final JSON structure. Do not output anything else."""

    prompt = prompt + '\nGiven text\n:' + part
    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)

    if finish_reason == 'finished':
         return extract_json(response)
    else:
        raise Exception(f'finish reason: {finish_reason}')

def process_no_toc(page_list, start_index=1, model=None, logger=None):
    page_contents=[]
    token_lengths=[]
    for page_index in range(start_index, start_index+len(page_list)):
        page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n"
        page_contents.append(page_text)
        token_lengths.append(count_tokens(page_text, model))
    group_texts = page_list_to_group_text(page_contents, token_lengths)
    logger.info(f'len(group_texts): {len(group_texts)}')

    toc_with_page_number= generate_toc_init(group_texts[0], model)
    for group_text in group_texts[1:]:
        toc_with_page_number_additional = generate_toc_continue(toc_with_page_number, group_text, model)    
        toc_with_page_number.extend(toc_with_page_number_additional)
    logger.info(f'generate_toc: {toc_with_page_number}')

    toc_with_page_number = convert_physical_index_to_int(toc_with_page_number)
    logger.info(f'convert_physical_index_to_int: {toc_with_page_number}')

    return toc_with_page_number

def process_toc_no_page_numbers(toc_content, toc_page_list, page_list,  start_index=1, model=None, logger=None):
    page_contents=[]
    token_lengths=[]
    toc_content = toc_transformer(toc_content, model)
    logger.info(f'toc_transformer: {toc_content}')
    for page_index in range(start_index, start_index+len(page_list)):
        page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n"
        page_contents.append(page_text)
        token_lengths.append(count_tokens(page_text, model))
    
    group_texts = page_list_to_group_text(page_contents, token_lengths)
    logger.info(f'len(group_texts): {len(group_texts)}')

    toc_with_page_number=copy.deepcopy(toc_content)
    for group_text in group_texts:
        toc_with_page_number = add_page_number_to_toc(group_text, toc_with_page_number, model)
    logger.info(f'add_page_number_to_toc: {toc_with_page_number}')

    toc_with_page_number = convert_physical_index_to_int(toc_with_page_number)
    logger.info(f'convert_physical_index_to_int: {toc_with_page_number}')

    return toc_with_page_number



def process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=None, model=None, logger=None):
    toc_with_page_number = toc_transformer(toc_content, model)
    logger.info(f'toc_with_page_number: {toc_with_page_number}')

    toc_no_page_number = remove_page_number(copy.deepcopy(toc_with_page_number))
    
    start_page_index = toc_page_list[-1] + 1
    main_content = ""
    for page_index in range(start_page_index, min(start_page_index + toc_check_page_num, len(page_list))):
        main_content += f"<physical_index_{page_index+1}>\n{page_list[page_index][0]}\n<physical_index_{page_index+1}>\n\n"

    toc_with_physical_index = toc_index_extractor(toc_no_page_number, main_content, model)
    logger.info(f'toc_with_physical_index: {toc_with_physical_index}')

    toc_with_physical_index = convert_physical_index_to_int(toc_with_physical_index)
    logger.info(f'toc_with_physical_index: {toc_with_physical_index}')

    matching_pairs = extract_matching_page_pairs(toc_with_page_number, toc_with_physical_index, start_page_index)
    logger.info(f'matching_pairs: {matching_pairs}')

    offset = calculate_page_offset(matching_pairs)
    logger.info(f'offset: {offset}')

    toc_with_page_number = add_page_offset_to_toc_json(toc_with_page_number, offset)
    logger.info(f'toc_with_page_number: {toc_with_page_number}')

    toc_with_page_number = process_none_page_numbers(toc_with_page_number, page_list, model=model)
    logger.info(f'toc_with_page_number: {toc_with_page_number}')

    return toc_with_page_number



##check if needed to process none page numbers
def process_none_page_numbers(toc_items, page_list, start_index=1, model=None):
    for i, item in enumerate(toc_items):
        if "physical_index" not in item:
            # logger.info(f"fix item: {item}")
            # Find previous physical_index
            prev_physical_index = 0  # Default if no previous item exists
            for j in range(i - 1, -1, -1):
                if toc_items[j].get('physical_index') is not None:
                    prev_physical_index = toc_items[j]['physical_index']
                    break
            
            # Find next physical_index
            next_physical_index = -1  # Default if no next item exists
            for j in range(i + 1, len(toc_items)):
                if toc_items[j].get('physical_index') is not None:
                    next_physical_index = toc_items[j]['physical_index']
                    break

            page_contents = []
            for page_index in range(prev_physical_index, next_physical_index+1):
                # Add bounds checking to prevent IndexError
                list_index = page_index - start_index
                if list_index >= 0 and list_index < len(page_list):
                    page_text = f"<physical_index_{page_index}>\n{page_list[list_index][0]}\n<physical_index_{page_index}>\n\n"
                    page_contents.append(page_text)
                else:
                    continue

            item_copy = copy.deepcopy(item)
            del item_copy['page']
            result = add_page_number_to_toc(page_contents, item_copy, model)
            if isinstance(result[0]['physical_index'], str) and result[0]['physical_index'].startswith('<physical_index'):
                item['physical_index'] = int(result[0]['physical_index'].split('_')[-1].rstrip('>').strip())
                del item['page']
    
    return toc_items




def check_toc(page_list, opt=None):
    toc_page_list = find_toc_pages(start_page_index=0, page_list=page_list, opt=opt)
    if len(toc_page_list) == 0:
        print('no toc found')
        return {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'}
    else:
        print('toc found')
        toc_json = toc_extractor(page_list, toc_page_list, opt.model)

        if toc_json['page_index_given_in_toc'] == 'yes':
            print('index found')
            return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'yes'}
        else:
            current_start_index = toc_page_list[-1] + 1
            
            while (toc_json['page_index_given_in_toc'] == 'no' and 
                   current_start_index < len(page_list) and 
                   current_start_index < opt.toc_check_page_num):
                
                additional_toc_pages = find_toc_pages(
                    start_page_index=current_start_index,
                    page_list=page_list,
                    opt=opt
                )
                
                if len(additional_toc_pages) == 0:
                    break

                additional_toc_json = toc_extractor(page_list, additional_toc_pages, opt.model)
                if additional_toc_json['page_index_given_in_toc'] == 'yes':
                    print('index found')
                    return {'toc_content': additional_toc_json['toc_content'], 'toc_page_list': additional_toc_pages, 'page_index_given_in_toc': 'yes'}

                else:
                    current_start_index = additional_toc_pages[-1] + 1
            print('index not found')
            return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'no'}






################### fix incorrect toc #########################################################
def single_toc_item_index_fixer(section_title, content, model="gpt-4o-2024-11-20"):
    toc_extractor_prompt = """
    You are given a section title and several pages of a document, your job is to find the physical index of the start page of the section in the partial document.

    The provided pages contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X.

    Reply in a JSON format:
    {
        "thinking": <explain which page, started and closed by <physical_index_X>, contains the start of this section>,
        "physical_index": "<physical_index_X>" (keep the format)
    }
    Directly return the final JSON structure. Do not output anything else."""

    prompt = toc_extractor_prompt + '\nSection Title:\n' + str(section_title) + '\nDocument pages:\n' + content
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)    
    return convert_physical_index_to_int(json_content['physical_index'])



async def fix_incorrect_toc(toc_with_page_number, page_list, incorrect_results, start_index=1, model=None, logger=None):
    print(f'start fix_incorrect_toc with {len(incorrect_results)} incorrect results')
    incorrect_indices = {result['list_index'] for result in incorrect_results}
    
    end_index = len(page_list) + start_index - 1
    
    incorrect_results_and_range_logs = []
    # Helper function to process and check a single incorrect item
    async def process_and_check_item(incorrect_item):
        list_index = incorrect_item['list_index']
        
        # Check if list_index is valid
        if list_index < 0 or list_index >= len(toc_with_page_number):
            # Return an invalid result for out-of-bounds indices
            return {
                'list_index': list_index,
                'title': incorrect_item['title'],
                'physical_index': incorrect_item.get('physical_index'),
                'is_valid': False
            }
        
        # Find the previous correct item
        prev_correct = None
        for i in range(list_index-1, -1, -1):
            if i not in incorrect_indices and i >= 0 and i < len(toc_with_page_number):
                physical_index = toc_with_page_number[i].get('physical_index')
                if physical_index is not None:
                    prev_correct = physical_index
                    break
        # If no previous correct item found, use start_index
        if prev_correct is None:
            prev_correct = start_index - 1
        
        # Find the next correct item
        next_correct = None
        for i in range(list_index+1, len(toc_with_page_number)):
            if i not in incorrect_indices and i >= 0 and i < len(toc_with_page_number):
                physical_index = toc_with_page_number[i].get('physical_index')
                if physical_index is not None:
                    next_correct = physical_index
                    break
        # If no next correct item found, use end_index
        if next_correct is None:
            next_correct = end_index
        
        incorrect_results_and_range_logs.append({
            'list_index': list_index,
            'title': incorrect_item['title'],
            'prev_correct': prev_correct,
            'next_correct': next_correct
        })

        page_contents=[]
        for page_index in range(prev_correct, next_correct+1):
            # Add bounds checking to prevent IndexError
            page_list_idx = page_index - start_index
            if page_list_idx >= 0 and page_list_idx < len(page_list):
                page_text = f"<physical_index_{page_index}>\n{page_list[page_list_idx][0]}\n<physical_index_{page_index}>\n\n"
                page_contents.append(page_text)
            else:
                continue
        content_range = ''.join(page_contents)
        
        physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
        
        # Check if the result is correct
        check_item = incorrect_item.copy()
        check_item['physical_index'] = physical_index_int
        check_result = await check_title_appearance(check_item, page_list, start_index, model)

        return {
            'list_index': list_index,
            'title': incorrect_item['title'],
            'physical_index': physical_index_int,
            'is_valid': check_result['answer'] == 'yes'
        }

    # Process incorrect items concurrently
    tasks = [
        process_and_check_item(item)
        for item in incorrect_results
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    for item, result in zip(incorrect_results, results):
        if isinstance(result, Exception):
            print(f"Processing item {item} generated an exception: {result}")
            continue
    results = [result for result in results if not isinstance(result, Exception)]

    # Update the toc_with_page_number with the fixed indices and check for any invalid results
    invalid_results = []
    for result in results:
        if result['is_valid']:
            # Add bounds checking to prevent IndexError
            list_idx = result['list_index']
            if 0 <= list_idx < len(toc_with_page_number):
                toc_with_page_number[list_idx]['physical_index'] = result['physical_index']
            else:
                # Index is out of bounds, treat as invalid
                invalid_results.append({
                    'list_index': result['list_index'],
                    'title': result['title'],
                    'physical_index': result['physical_index'],
                })
        else:

Download .txt

gitextract_hq6ob10f/

├── .claude/
│   └── commands/
│       └── dedupe.md
├── .gitattributes
├── .github/
│   └── workflows/
│       ├── autoclose-labeled-issues.yml
│       ├── backfill-dedupe.yml
│       ├── issue-dedupe.yml
│       └── remove-autoclose-label.yml
├── .gitignore
├── CHANGELOG.md
├── LICENSE
├── README.md
├── cookbook/
│   ├── README.md
│   ├── agentic_retrieval.ipynb
│   ├── pageIndex_chat_quickstart.ipynb
│   ├── pageindex_RAG_simple.ipynb
│   └── vision_RAG_pageindex.ipynb
├── pageindex/
│   ├── __init__.py
│   ├── config.yaml
│   ├── page_index.py
│   ├── page_index_md.py
│   └── utils.py
├── requirements.txt
├── run_pageindex.py
├── scripts/
│   ├── autoclose-labeled-issues.js
│   └── comment-on-duplicates.sh
├── tests/
│   └── results/
│       ├── 2023-annual-report-truncated_structure.json
│       ├── 2023-annual-report_structure.json
│       ├── PRML_structure.json
│       ├── Regulation Best Interest_Interpretive release_structure.json
│       ├── Regulation Best Interest_proposed rule_structure.json
│       ├── earthmover_structure.json
│       ├── four-lectures_structure.json
│       └── q1-fy25-earnings_structure.json
└── tutorials/
    ├── doc-search/
    │   ├── README.md
    │   ├── description.md
    │   ├── metadata.md
    │   └── semantics.md
    └── tree-search/
        └── README.md

Download .txt

SYMBOL INDEX (115 symbols across 4 files)

FILE: pageindex/page_index.py
  function check_title_appearance (line 13) | async def check_title_appearance(item, page_list, start_index=1, model=N...
  function check_title_appearance_in_start (line 48) | async def check_title_appearance_in_start(title, page_text, model=None, ...
  function check_title_appearance_in_start_concurrent (line 74) | async def check_title_appearance_in_start_concurrent(structure, page_lis...
  function toc_detector_single_page (line 104) | def toc_detector_single_page(content, model=None):
  function check_if_toc_extraction_is_complete (line 125) | def check_if_toc_extraction_is_complete(content, toc, model=None):
  function check_if_toc_transformation_is_complete (line 143) | def check_if_toc_transformation_is_complete(content, toc, model=None):
  function extract_toc_content (line 160) | def extract_toc_content(content, model=None):
  function detect_page_index (line 202) | def detect_page_index(toc_content, model=None):
  function toc_extractor (line 222) | def toc_extractor(page_list, toc_page_list, model):
  function toc_index_extractor (line 243) | def toc_index_extractor(toc, content, model=None):
  function toc_transformer (line 273) | def toc_transformer(toc_content, model=None):
  function find_toc_pages (line 336) | def find_toc_pages(start_page_index, page_list, opt, logger=None):
  function remove_page_number (line 363) | def remove_page_number(data):
  function extract_matching_page_pairs (line 374) | def extract_matching_page_pairs(toc_page, toc_physical_index, start_page...
  function calculate_page_offset (line 389) | def calculate_page_offset(pairs):
  function add_page_offset_to_toc_json (line 411) | def add_page_offset_to_toc_json(data, offset):
  function page_list_to_group_text (line 421) | def page_list_to_group_text(page_contents, token_lengths, max_tokens=200...
  function add_page_number_to_toc (line 456) | def add_page_number_to_toc(part, structure, model=None):
  function remove_first_physical_index_section (line 489) | def remove_first_physical_index_section(text):
  function generate_toc_continue (line 502) | def generate_toc_continue(toc_content, part, model="gpt-4o-2024-11-20"):
  function generate_toc_init (line 537) | def generate_toc_init(part, model=None):
  function process_no_toc (line 571) | def process_no_toc(page_list, start_index=1, model=None, logger=None):
  function process_toc_no_page_numbers (line 592) | def process_toc_no_page_numbers(toc_content, toc_page_list, page_list,  ...
  function process_toc_with_page_numbers (line 617) | def process_toc_with_page_numbers(toc_content, toc_page_list, page_list,...
  function process_none_page_numbers (line 651) | def process_none_page_numbers(toc_items, page_list, start_index=1, model...
  function check_toc (line 691) | def check_toc(page_list, opt=None):
  function single_toc_item_index_fixer (line 735) | def single_toc_item_index_fixer(section_title, content, model="gpt-4o-20...
  function fix_incorrect_toc (line 755) | async def fix_incorrect_toc(toc_with_page_number, page_list, incorrect_r...
  function fix_incorrect_toc_with_retries (line 873) | async def fix_incorrect_toc_with_retries(toc_with_page_number, page_list...
  function verify_toc (line 895) | async def verify_toc(page_list, list_result, start_index=1, N=None, mode...
  function meta_processor (line 954) | async def meta_processor(page_list, mode=None, toc_content=None, toc_pag...
  function process_large_node_recursively (line 995) | async def process_large_node_recursively(node, page_list, opt=None, logg...
  function tree_parser (line 1024) | async def tree_parser(page_list, opt, doc=None, logger=None):
  function page_index_main (line 1061) | def page_index_main(doc, opt=None):
  function page_index (line 1106) | def page_index(doc, model=None, toc_check_page_num=None, max_page_num_ea...
  function validate_and_truncate_physical_indices (line 1117) | def validate_and_truncate_physical_indices(toc_with_page_number, page_li...

FILE: pageindex/page_index_md.py
  function get_node_summary (line 10) | async def get_node_summary(node, summary_token_threshold=200, model=None):
  function generate_summaries_for_structure_md (line 19) | async def generate_summaries_for_structure_md(structure, summary_token_t...
  function extract_nodes_from_markdown (line 32) | def extract_nodes_from_markdown(markdown_content):
  function extract_node_text_content (line 62) | def extract_node_text_content(node_list, markdown_lines):
  function update_node_list_with_text_token_count (line 89) | def update_node_list_with_text_token_count(node_list, model=None):
  function tree_thinning_for_index (line 135) | def tree_thinning_for_index(node_list, min_node_token=None, model=None):
  function build_tree_from_nodes (line 190) | def build_tree_from_nodes(node_list):
  function clean_tree_for_output (line 224) | def clean_tree_for_output(tree_nodes):
  function md_to_tree (line 243) | async def md_to_tree(md_path, if_thinning=False, min_token_threshold=Non...

FILE: pageindex/utils.py
  function count_tokens (line 22) | def count_tokens(text, model=None):
  function ChatGPT_API_with_finish_reason (line 29) | def ChatGPT_API_with_finish_reason(model, prompt, api_key=CHATGPT_API_KE...
  function ChatGPT_API (line 61) | def ChatGPT_API(model, prompt, api_key=CHATGPT_API_KEY, chat_history=None):
  function ChatGPT_API_async (line 89) | async def ChatGPT_API_async(model, prompt, api_key=CHATGPT_API_KEY):
  function get_json_content (line 111) | def get_json_content(response):
  function extract_json (line 125) | def extract_json(content):
  function write_node_id (line 158) | def write_node_id(data, node_id=0):
  function get_nodes (line 170) | def get_nodes(structure):
  function structure_to_list (line 185) | def structure_to_list(structure):
  function get_leaf_nodes (line 199) | def get_leaf_nodes(structure):
  function is_leaf_node (line 217) | def is_leaf_node(data, node_id):
  function get_last_node (line 243) | def get_last_node(structure):
  function extract_text_from_pdf (line 247) | def extract_text_from_pdf(pdf_path):
  function get_pdf_title (line 256) | def get_pdf_title(pdf_path):
  function get_text_of_pages (line 262) | def get_text_of_pages(pdf_path, start_page, end_page, tag=True):
  function get_first_start_page_from_text (line 274) | def get_first_start_page_from_text(text):
  function get_last_start_page_from_text (line 281) | def get_last_start_page_from_text(text):
  function sanitize_filename (line 292) | def sanitize_filename(filename, replacement='-'):
  function get_pdf_name (line 297) | def get_pdf_name(pdf_path):
  class JsonLogger (line 309) | class JsonLogger:
    method __init__ (line 310) | def __init__(self, file_path):
    method log (line 320) | def log(self, level, message, **kwargs):
    method info (line 331) | def info(self, message, **kwargs):
    method error (line 334) | def error(self, message, **kwargs):
    method debug (line 337) | def debug(self, message, **kwargs):
    method exception (line 340) | def exception(self, message, **kwargs):
    method _filepath (line 344) | def _filepath(self):
  function list_to_tree (line 350) | def list_to_tree(data):
  function add_preface_if_needed (line 398) | def add_preface_if_needed(data):
  function get_page_tokens (line 413) | def get_page_tokens(pdf_path, model="gpt-4o-2024-11-20", pdf_parser="PyP...
  function get_text_of_pdf_pages (line 441) | def get_text_of_pdf_pages(pdf_pages, start_page, end_page):
  function get_text_of_pdf_pages_with_labels (line 447) | def get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page):
  function get_number_of_pages (line 453) | def get_number_of_pages(pdf_path):
  function post_processing (line 460) | def post_processing(structure, end_physical_index):
  function clean_structure_post (line 481) | def clean_structure_post(data):
  function remove_fields (line 493) | def remove_fields(data, fields=['text']):
  function print_toc (line 501) | def print_toc(tree, indent=0):
  function print_json (line 507) | def print_json(data, max_len=40, indent=2):
  function remove_structure_text (line 522) | def remove_structure_text(data):
  function check_token_limit (line 533) | def check_token_limit(structure, limit=110000):
  function convert_physical_index_to_int (line 545) | def convert_physical_index_to_int(data):
  function convert_page_to_int (line 568) | def convert_page_to_int(data):
  function add_node_text (line 579) | def add_node_text(node, pdf_pages):
  function add_node_text_with_labels (line 592) | def add_node_text_with_labels(node, pdf_pages):
  function generate_node_summary (line 605) | async def generate_node_summary(node, model=None):
  function generate_summaries_for_structure (line 616) | async def generate_summaries_for_structure(structure, model=None):
  function create_clean_structure_for_description (line 626) | def create_clean_structure_for_description(structure):
  function generate_doc_description (line 649) | def generate_doc_description(structure, model=None):
  function reorder_dict (line 661) | def reorder_dict(data, key_order):
  function format_structure (line 667) | def format_structure(structure, order=None):
  class ConfigLoader (line 681) | class ConfigLoader:
    method __init__ (line 682) | def __init__(self, default_path: str = None):
    method _load_yaml (line 688) | def _load_yaml(path):
    method _validate_keys (line 692) | def _validate_keys(self, user_dict):
    method load (line 697) | def load(self, user_opt=None) -> config:

FILE: scripts/autoclose-labeled-issues.js
  constant GITHUB_TOKEN (line 22) | const GITHUB_TOKEN = process.env.GITHUB_TOKEN;
  constant REPO_OWNER (line 23) | const REPO_OWNER   = process.env.REPO_OWNER;
  constant REPO_NAME (line 24) | const REPO_NAME    = process.env.REPO_NAME;
  constant DRY_RUN (line 25) | const DRY_RUN      = process.env.DRY_RUN === 'true';
  constant THREE_DAYS_MS (line 27) | const THREE_DAYS_MS = 3 * 24 * 60 * 60 * 1000;
  function githubRequest (line 29) | function githubRequest(method, path, body = null, retried = false) {
  function fetchDuplicateIssues (line 91) | async function fetchDuplicateIssues() {
  function isBot (line 109) | function isBot(user) {
  function findDuplicateComment (line 116) | function findDuplicateComment(comments) {
  function hasHumanCommentAfter (line 125) | function hasHumanCommentAfter(comments, afterDate) {
  function fetchAllComments (line 136) | async function fetchAllComments(issueNumber) {
  function hasThumbsDownReaction (line 155) | async function hasThumbsDownReaction(commentId) {
  function closeAsDuplicate (line 166) | async function closeAsDuplicate(issueNumber) {
  function processIssue (line 185) | async function processIssue(issue) {
  function main (line 231) | async function main() {

Download .json

Condensed preview — 37 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (569K chars).

[
  {
    "path": ".claude/commands/dedupe.md",
    "chars": 2052,
    "preview": "---\nallowed-tools:\n  - Bash(gh:*)\n  - Bash(./scripts/comment-on-duplicates.sh:*)\n---\n\nYou are a GitHub issue deduplicati"
  },
  {
    "path": ".gitattributes",
    "chars": 25,
    "preview": "*.ipynb linguist-vendored"
  },
  {
    "path": ".github/workflows/autoclose-labeled-issues.yml",
    "chars": 955,
    "preview": "# Auto-closes duplicate issues after 3 days if no human activity or thumbs-down reaction.\n# Runs daily at 09:00 UTC.\nnam"
  },
  {
    "path": ".github/workflows/backfill-dedupe.yml",
    "chars": 2078,
    "preview": "# Backfills duplicate detection for historical issues using Claude Code.\n# Triggered manually via workflow_dispatch.\nnam"
  },
  {
    "path": ".github/workflows/issue-dedupe.yml",
    "chars": 1787,
    "preview": "# Detects duplicate issues using Claude Code with the /dedupe command.\n# Triggered automatically when a new issue is ope"
  },
  {
    "path": ".github/workflows/remove-autoclose-label.yml",
    "chars": 1432,
    "preview": "# Removes the \"duplicate\" label when a human (non-bot) comments on a\n# duplicate-flagged issue, signaling that the issue"
  },
  {
    "path": ".gitignore",
    "chars": 169,
    "preview": ".ipynb_checkpoints\n__pycache__\nfiles\nindex\ntemp/*\nchroma-collections.parquet\nchroma-embeddings.parquet\n.DS_Store\n.env*\nn"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 373,
    "preview": "# Change Log\nAll notable changes to this project will be documented in this file.\n\n## Beta - 2025-04-23\n\n### Fixed\n- [x]"
  },
  {
    "path": "LICENSE",
    "chars": 1067,
    "preview": "MIT License\n\nCopyright (c) 2025 Vectify AI\n\nPermission is hereby granted, free of charge, to any person obtaining a copy"
  },
  {
    "path": "README.md",
    "chars": 14503,
    "preview": "<div align=\"center\">\n  \n<a href=\"https://vectify.ai/pageindex\" target=\"_blank\">\n  <img src=\"https://github.com/user-atta"
  },
  {
    "path": "cookbook/README.md",
    "chars": 1146,
    "preview": "### 🧪 Cookbooks:\n\n* [**Vectorless RAG notebook**](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RA"
  },
  {
    "path": "cookbook/agentic_retrieval.ipynb",
    "chars": 72233,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"XTboY7brzyp2\"\n      },\n      \"sou"
  },
  {
    "path": "cookbook/pageIndex_chat_quickstart.ipynb",
    "chars": 10265,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"XTboY7brzyp2\"\n      },\n      \"sou"
  },
  {
    "path": "cookbook/pageindex_RAG_simple.ipynb",
    "chars": 24658,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"TCh9BTedHJK1\"\n      },\n      \"sou"
  },
  {
    "path": "cookbook/vision_RAG_pageindex.ipynb",
    "chars": 28316,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"TCh9BTedHJK1\"\n      },\n      \"sou"
  },
  {
    "path": "pageindex/__init__.py",
    "chars": 63,
    "preview": "from .page_index import *\nfrom .page_index_md import md_to_tree"
  },
  {
    "path": "pageindex/config.yaml",
    "chars": 208,
    "preview": "model: \"gpt-4o-2024-11-20\"\ntoc_check_page_num: 20\nmax_page_num_each_node: 10\nmax_token_num_each_node: 20000\nif_add_node_"
  },
  {
    "path": "pageindex/page_index.py",
    "chars": 48693,
    "preview": "import os\nimport json\nimport copy\nimport math\nimport random\nimport re\nfrom .utils import *\nimport os\nfrom concurrent.fut"
  },
  {
    "path": "pageindex/page_index_md.py",
    "chars": 11879,
    "preview": "import asyncio\nimport json\nimport re\nimport os\ntry:\n    from .utils import *\nexcept:\n    from utils import *\n\nasync def "
  },
  {
    "path": "pageindex/utils.py",
    "chars": 24356,
    "preview": "import tiktoken\nimport openai\nimport logging\nimport os\nfrom datetime import datetime\nimport time\nimport json\nimport PyPD"
  },
  {
    "path": "requirements.txt",
    "chars": 98,
    "preview": "openai==1.101.0\npymupdf==1.26.4\nPyPDF2==3.0.1\npython-dotenv==1.1.0\ntiktoken==0.11.0\npyyaml==6.0.2\n"
  },
  {
    "path": "run_pageindex.py",
    "chars": 5978,
    "preview": "import argparse\nimport os\nimport json\nfrom pageindex import *\nfrom pageindex.page_index_md import md_to_tree\n\nif __name_"
  },
  {
    "path": "scripts/autoclose-labeled-issues.js",
    "chars": 8093,
    "preview": "/**\n * scripts/autoclose-labeled-issues.js\n *\n * Auto-closes issues that have a bot \"possible duplicate\" comment older t"
  },
  {
    "path": "scripts/comment-on-duplicates.sh",
    "chars": 2117,
    "preview": "#!/usr/bin/env bash\n#\n# comment-on-duplicates.sh - Posts a duplicate issue comment with auto-close warning.\n#\n# Usage:\n#"
  },
  {
    "path": "tests/results/2023-annual-report-truncated_structure.json",
    "chars": 1844,
    "preview": "{\n  \"doc_name\": \"2023-annual-report-truncated.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\""
  },
  {
    "path": "tests/results/2023-annual-report_structure.json",
    "chars": 13499,
    "preview": "{\n  \"doc_name\": \"2023-annual-report.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n     "
  },
  {
    "path": "tests/results/PRML_structure.json",
    "chars": 49206,
    "preview": "{\n  \"doc_name\": \"PRML.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end_index\": "
  },
  {
    "path": "tests/results/Regulation Best Interest_Interpretive release_structure.json",
    "chars": 12338,
    "preview": "{\n  \"doc_name\": \"Regulation Best Interest_Interpretive release.pdf\",\n  \"doc_description\": \"A detailed analysis of the SE"
  },
  {
    "path": "tests/results/Regulation Best Interest_proposed rule_structure.json",
    "chars": 131846,
    "preview": "{\n  \"doc_name\": \"Regulation Best Interest_proposed rule.pdf\",\n  \"doc_description\": \"The document provides a comprehensiv"
  },
  {
    "path": "tests/results/earthmover_structure.json",
    "chars": 2971,
    "preview": "{\n  \"doc_name\": \"earthmover.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Earth Mover\\u2019s Distance based Similarity Se"
  },
  {
    "path": "tests/results/four-lectures_structure.json",
    "chars": 7724,
    "preview": "{\n  \"doc_name\": \"four-lectures.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end"
  },
  {
    "path": "tests/results/q1-fy25-earnings_structure.json",
    "chars": 49396,
    "preview": "{\n  \"doc_name\": \"q1-fy25-earnings.pdf\",\n  \"doc_description\": \"A comprehensive financial report detailing The Walt Disney"
  },
  {
    "path": "tutorials/doc-search/README.md",
    "chars": 689,
    "preview": "\n\n## Document Search Examples\n\n\nPageIndex currently enables reasoning-based RAG within a single document by default.\nFor"
  },
  {
    "path": "tutorials/doc-search/description.md",
    "chars": 2014,
    "preview": "\n## Document Search by Description\n\nFor documents that don't have metadata, you can use LLM-generated descriptions to he"
  },
  {
    "path": "tutorials/doc-search/metadata.md",
    "chars": 1371,
    "preview": "\n\n## Document Search by Metadata\n<callout>PageIndex with metadata support is in closed beta. Fill out this form to reque"
  },
  {
    "path": "tutorials/doc-search/semantics.md",
    "chars": 1805,
    "preview": "## Document Search by Semantics\n\nFor documents that cover diverse topics, one can also use vector-based semantic search "
  },
  {
    "path": "tutorials/tree-search/README.md",
    "chars": 2605,
    "preview": "## Tree Search Examples\nThis tutorial provides a basic example of how to perform retrieval using the PageIndex tree.\n\n##"
  }
]

About this extraction

This page contains the full source code of the VectifyAI/PageIndex GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 37 files (527.2 KB), approximately 119.1k tokens, and a symbol index with 115 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo