[
  {
    "path": ".claude/commands/dedupe.md",
    "content": "---\nallowed-tools:\n  - Bash(gh:*)\n  - Bash(./scripts/comment-on-duplicates.sh:*)\n---\n\nYou are a GitHub issue deduplication assistant. Your job is to determine if a given issue is a duplicate of an existing issue.\n\n## Input\n\nThe issue to check: $ARGUMENTS\n\n## Steps\n\n### 1. Pre-checks\n\nFirst, check if the issue should be skipped:\n\n```\ngh issue view <number> --json state,labels,title,body,comments\n```\n\nSkip if:\n- The issue is already closed\n- The issue already has a `duplicate` label\n- The issue already has a dedupe comment (check comments for \"possible duplicate\")\n\n### 2. Understand the issue\n\nRead the issue carefully and generate a concise summary of the core problem or feature request. Extract 3-5 key technical terms or concepts.\n\n### 3. Search for duplicates\n\nLaunch 5 parallel searches using different keyword strategies to maximize coverage:\n\n1. **Exact terms**: Use the most specific technical terms from the issue title\n2. **Synonyms**: Use alternative phrasings for the core problem\n3. **Error messages**: If the issue contains error messages, search for those\n4. **Component names**: Search by the specific component/module mentioned\n5. **Broad category**: Search by the general category of the issue\n\nFor each search, use:\n```\ngh search issues \"<keywords> state:open\" --repo $REPOSITORY --limit 20\n```\n\n### 4. Analyze candidates\n\nFor each unique candidate issue found:\n- Compare the core problem being described\n- Look past superficial wording differences\n- Consider whether they describe the same root cause\n- Only flag as duplicate if you are at least 85% confident\n\n### 5. Filter false positives\n\nRemove candidates that:\n- Are only superficially similar (same area but different problems)\n- Are related but describe distinct issues\n- Are too old or already resolved differently\n\n### 6. Report results\n\nIf you found duplicates (max 3), call:\n```\n./scripts/comment-on-duplicates.sh --base-issue <number> --potential-duplicates <dup1> <dup2> ...\n```\n\nIf no duplicates found, do nothing and report that the issue appears to be unique.\n"
  },
  {
    "path": ".gitattributes",
    "content": "*.ipynb linguist-vendored"
  },
  {
    "path": ".github/workflows/autoclose-labeled-issues.yml",
    "content": "# Auto-closes duplicate issues after 3 days if no human activity or thumbs-down reaction.\n# Runs daily at 09:00 UTC.\nname: Auto-close Duplicate Issues\n\non:\n  schedule:\n    - cron: '0 9 * * *'\n  workflow_dispatch:\n    inputs:\n      dry_run:\n        description: 'Dry run - report but do not close issues'\n        required: false\n        default: 'false'\n        type: choice\n        options:\n          - 'false'\n          - 'true'\n\npermissions:\n  issues: write\n  contents: read\n\njobs:\n  autoclose:\n    runs-on: ubuntu-latest\n    timeout-minutes: 10\n    steps:\n      - name: Checkout repository\n        uses: actions/checkout@v4\n\n      - name: Close inactive duplicate issues\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n          REPO_OWNER: ${{ github.repository_owner }}\n          REPO_NAME: ${{ github.event.repository.name }}\n          DRY_RUN: ${{ inputs.dry_run || 'false' }}\n        run: node scripts/autoclose-labeled-issues.js\n"
  },
  {
    "path": ".github/workflows/backfill-dedupe.yml",
    "content": "# Backfills duplicate detection for historical issues using Claude Code.\n# Triggered manually via workflow_dispatch.\nname: Backfill Duplicate Detection\n\non:\n  workflow_dispatch:\n    inputs:\n      days_back:\n        description: 'How many days back to look for issues (default: 30)'\n        required: false\n        default: '30'\n        type: number\n\npermissions:\n  contents: read\n  issues: write\n  actions: write\n\njobs:\n  backfill:\n    runs-on: ubuntu-latest\n    timeout-minutes: 10\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Fetch issues and run dedupe\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n          REPO: ${{ github.repository }}\n          DAYS_BACK: ${{ inputs.days_back || '30' }}\n        run: |\n          if ! [[ \"$DAYS_BACK\" =~ ^[0-9]+$ ]]; then\n            echo \"Error: days_back must be a number\"\n            exit 1\n          fi\n\n          SINCE=$(date -u -d \"$DAYS_BACK days ago\" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-${DAYS_BACK}d +%Y-%m-%dT%H:%M:%SZ)\n          echo \"Fetching open issues since $SINCE\"\n\n          # Get open issues via gh api --paginate, filter out PRs and already-labeled ones\n          ISSUES=$(gh api --paginate \"repos/$REPO/issues?state=open&per_page=100\" \\\n            --jq \"[.[] | select(.pull_request == null) | select(.created_at >= \\\"$SINCE\\\") | select([.labels[].name] | index(\\\"duplicate\\\") | not)] | .[].number\" | xargs)\n\n          if [ -z \"$ISSUES\" ]; then\n            echo \"No issues to process\"\n            exit 0\n          fi\n\n          BATCH_SIZE=10\n          COUNT=0\n          echo \"Issues to process: $ISSUES\"\n          for NUMBER in $ISSUES; do\n            echo \"Triggering dedupe for issue #$NUMBER\"\n            gh workflow run issue-dedupe.yml --repo \"$REPO\" -f issue_number=\"$NUMBER\"\n            COUNT=$((COUNT + 1))\n            if [ $((COUNT % BATCH_SIZE)) -eq 0 ]; then\n              echo \"Pausing 60s after $COUNT issues...\"\n              sleep 60\n            else\n              sleep 5\n            fi\n          done\n\n          echo \"Backfill triggered for $COUNT issues\"\n"
  },
  {
    "path": ".github/workflows/issue-dedupe.yml",
    "content": "# Detects duplicate issues using Claude Code with the /dedupe command.\n# Triggered automatically when a new issue is opened, or manually for a single issue.\nname: Issue Duplicate Detection\n\non:\n  issues:\n    types: [opened]\n  workflow_dispatch:\n    inputs:\n      issue_number:\n        description: 'Issue number to check for duplicates'\n        required: true\n        type: string\n\npermissions:\n  contents: read\n  issues: write\n\nconcurrency:\n  group: dedupe-${{ github.event.issue.number || inputs.issue_number }}\n  cancel-in-progress: true\n\njobs:\n  detect-duplicate:\n    runs-on: ubuntu-latest\n    timeout-minutes: 10\n    # Skip pull-requests that surface as issues and bot-opened issues\n    if: >\n      (github.event_name == 'workflow_dispatch') ||\n      (github.event.issue.pull_request == null &&\n       !endsWith(github.actor, '[bot]') &&\n       github.actor != 'github-actions')\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Determine issue number\n        id: issue\n        env:\n          EVENT_NAME: ${{ github.event_name }}\n          INPUT_NUMBER: ${{ inputs.issue_number }}\n          ISSUE_NUMBER: ${{ github.event.issue.number }}\n        run: |\n          if [ \"$EVENT_NAME\" = \"workflow_dispatch\" ]; then\n            echo \"number=$INPUT_NUMBER\" >> \"$GITHUB_OUTPUT\"\n          else\n            echo \"number=$ISSUE_NUMBER\" >> \"$GITHUB_OUTPUT\"\n          fi\n\n      - uses: anthropics/claude-code-action@v1\n        env:\n          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n        with:\n          prompt: \"/dedupe ${{ github.repository }}/issues/${{ steps.issue.outputs.number }}\"\n          anthropic_api_key: ${{ secrets.AUTHROPIC_API_KEY }}\n          github_token: ${{ secrets.GITHUB_TOKEN }}\n          allowed_bots: \"github-actions\"\n          allowed_non_write_users: \"*\"\n"
  },
  {
    "path": ".github/workflows/remove-autoclose-label.yml",
    "content": "# Removes the \"duplicate\" label when a human (non-bot) comments on a\n# duplicate-flagged issue, signaling that the issue needs re-evaluation.\n# The auto-close script also independently checks for human activity,\n# so this provides an additional visible signal.\nname: Remove Duplicate Label on Human Activity\n\non:\n  issue_comment:\n    types: [created]\n\npermissions:\n  issues: write\n\njobs:\n  remove-label:\n    # Only run for issue comments (not PR comments)\n    if: >\n      github.event.issue.pull_request == null &&\n      !endsWith(github.actor, '[bot]') &&\n      github.actor != 'github-actions'\n    runs-on: ubuntu-latest\n    steps:\n      - name: Remove duplicate label if human commented\n        uses: actions/github-script@v7\n        with:\n          script: |\n            const issue = context.payload.issue;\n            const labels = (issue.labels || []).map(l => l.name);\n\n            if (!labels.includes('duplicate')) {\n              core.info('Issue does not have \"duplicate\" label - nothing to do.');\n              return;\n            }\n\n            await github.rest.issues.removeLabel({\n              owner: context.repo.owner,\n              repo: context.repo.repo,\n              issue_number: issue.number,\n              name: 'duplicate',\n            });\n\n            core.info(\n              `Removed \"duplicate\" label from #${issue.number} ` +\n              `after human comment by ${context.actor}`\n            );\n"
  },
  {
    "path": ".gitignore",
    "content": ".ipynb_checkpoints\n__pycache__\nfiles\nindex\ntemp/*\nchroma-collections.parquet\nchroma-embeddings.parquet\n.DS_Store\n.env*\nnotebook\nSDK/*\nlog/*\nlogs/\nparts/*\njson_results/*\n"
  },
  {
    "path": "CHANGELOG.md",
    "content": "# Change Log\nAll notable changes to this project will be documented in this file.\n\n## Beta - 2025-04-23\n\n### Fixed\n- [x] Fixed a bug introduced on April 18 where `start_index` was incorrectly passed.\n\n## Beta - 2025-04-03\n\n### Added\n- [x] Add node_id, node summary\n- [x] Add document discription\n\n### Changed\n- [x] Change \"child_nodes\" -> \"nodes\" to simplify the structure\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2025 Vectify AI\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "<div align=\"center\">\n  \n<a href=\"https://vectify.ai/pageindex\" target=\"_blank\">\n  <img src=\"https://github.com/user-attachments/assets/46201e72-675b-43bc-bfbd-081cc6b65a1d\" alt=\"PageIndex Banner\" />\n</a>\n\n<br/>\n<br/>\n\n<p align=\"center\">\n  <a href=\"https://trendshift.io/repositories/14736\" target=\"_blank\"><img src=\"https://trendshift.io/api/badge/repositories/14736\" alt=\"VectifyAI%2FPageIndex | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"/></a>\n</p>\n\n# PageIndex: Vectorless, Reasoning-based RAG\n\n<p align=\"center\"><b>Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</b></p>\n\n<h4 align=\"center\">\n  <a href=\"https://vectify.ai\">🏠 Homepage</a>&nbsp; • &nbsp;\n  <a href=\"https://chat.pageindex.ai\">🖥️ Chat Platform</a>&nbsp; • &nbsp;\n  <a href=\"https://pageindex.ai/mcp\">🔌 MCP</a>&nbsp; • &nbsp;\n  <a href=\"https://docs.pageindex.ai\">📚 Docs</a>&nbsp; • &nbsp;\n  <a href=\"https://discord.com/invite/VuXuf29EUj\">💬 Discord</a>&nbsp; • &nbsp;\n  <a href=\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\">✉️ Contact</a>&nbsp;\n</h4>\n  \n</div>\n\n\n<details open>\n<summary><h3>📢 Latest Updates</h3></summary>\n\n **🔥 Releases:**\n- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like document-analysis agent [platform](https://chat.pageindex.ai) built for professional long documents. Can also be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).\n<!-- - [**PageIndex Chat API**](https://docs.pageindex.ai/quickstart): An API that brings PageIndex's advanced long-document intelligence directly into your applications and workflows. -->\n<!-- - [PageIndex MCP](https://pageindex.ai/mcp): Bring PageIndex into Claude, Cursor, or any MCP-enabled agent. Chat with long PDFs in a reasoning-based, human-like way. -->\n \n **📝 Articles:**\n- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context* *tree index* that enables LLMs to perform *reasoning-based*, *human-like retrieval* over long documents, without vector DB or chunking.\n<!-- - [Do We Still Need OCR?](https://pageindex.ai/blog/do-we-need-ocr): Explores how vision-based, reasoning-native RAG challenges the traditional OCR pipeline, and why the future of document AI might be *vectorless* and *vision-based*. -->\n\n **🧪 Cookbooks:**\n- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, hands-on example of reasoning-based RAG using PageIndex. No vectors, no chunking, and human-like retrieval.\n- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): OCR-free, vision-only RAG with PageIndex's reasoning-native retrieval workflow that works directly over PDF page images.\n</details>\n\n---\n\n# 📑 Introduction to PageIndex\n\nAre you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.\n\nInspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a **vectorless**, **reasoning-based RAG** system that builds a **hierarchical tree index** from long documents and uses LLMs to **reason** *over that index* for **agentic, context-aware retrieval**.\nIt simulates how *human experts* navigate and extract knowledge from complex documents through *tree search*, enabling LLMs to *think* and *reason* their way to the most relevant document sections. PageIndex performs retrieval in two steps:\n\n1. Generate a “Table-of-Contents” **tree structure index** of documents\n2. Perform reasoning-based retrieval through **tree search**\n\n<div align=\"center\">\n  <a href=\"https://pageindex.ai/blog/pageindex-intro\" target=\"_blank\" title=\"The PageIndex Framework\">\n    <img src=\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\" width=\"70%\">\n  </a>\n</div>\n\n### 🎯 Core Features \n\nCompared to traditional vector-based RAG, **PageIndex** features:\n- **No Vector DB**: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.\n- **No Chunking**: Documents are organized into natural sections, not artificial chunks.\n- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents.\n- **Better Explainability and Traceability**: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search (“vibe retrieval”).\n\nPageIndex powers a reasoning-based RAG system that achieved **state-of-the-art** [98.7% accuracy](https://github.com/VectifyAI/Mafin2.5-FinanceBench) on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis (see our [blog post](https://vectify.ai/blog/Mafin2.5) for details).\n\n### 📍 Explore PageIndex\n\nTo learn more, please see a detailed introduction of the [PageIndex framework](https://pageindex.ai/blog/pageindex-intro). Check out this GitHub repo for open-source code, and the [cookbooks](https://docs.pageindex.ai/cookbook), [tutorials](https://docs.pageindex.ai/tutorials), and [blog](https://pageindex.ai/blog) for additional usage guides and examples. \n\nThe PageIndex service is available as a ChatGPT-style [chat platform](https://chat.pageindex.ai), or can be integrated via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).\n\n### 🛠️ Deployment Options\n- Self-host — run locally with this open-source repo.\n- Cloud Service — try instantly with our [Chat Platform](https://chat.pageindex.ai/), or integrate with [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart).\n- _Enterprise_ — private or on-prem deployment. [Contact us](https://ii2abc2jejf.typeform.com/to/tK3AXl8T) or [book a demo](https://calendly.com/pageindex/meet) for more details.\n\n### 🧪 Quick Hands-on\n\n- Try the [**Vectorless RAG**](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb) notebook — a *minimal*, hands-on example of reasoning-based RAG using PageIndex.\n- Experiment with [*Vision-based Vectorless RAG*](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb) — no OCR; a minimal, reasoning-native RAG pipeline that works directly over page images.\n  \n<div align=\"center\">\n  <a href=\"https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb\" target=\"_blank\" rel=\"noopener\">\n    <img src=\"https://img.shields.io/badge/Open_In_Colab-Vectorless_RAG-orange?style=for-the-badge&logo=googlecolab\" alt=\"Open in Colab: Vectorless RAG\" />\n  </a>\n  &nbsp;&nbsp;\n  <a href=\"https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb\" target=\"_blank\" rel=\"noopener\">\n    <img src=\"https://img.shields.io/badge/Open_In_Colab-Vision_RAG-orange?style=for-the-badge&logo=googlecolab\" alt=\"Open in Colab: Vision RAG\" />\n  </a>\n</div>\n\n---\n\n# 🌲 PageIndex Tree Structure\nPageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a _\"table of contents\"_ but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.\n\nBelow is an example PageIndex tree structure. Also see more example [documents](https://github.com/VectifyAI/PageIndex/tree/main/tests/pdfs) and generated [tree structures](https://github.com/VectifyAI/PageIndex/tree/main/tests/results).\n\n```jsonc\n...\n{\n  \"title\": \"Financial Stability\",\n  \"node_id\": \"0006\",\n  \"start_index\": 21,\n  \"end_index\": 22,\n  \"summary\": \"The Federal Reserve ...\",\n  \"nodes\": [\n    {\n      \"title\": \"Monitoring Financial Vulnerabilities\",\n      \"node_id\": \"0007\",\n      \"start_index\": 22,\n      \"end_index\": 28,\n      \"summary\": \"The Federal Reserve's monitoring ...\"\n    },\n    {\n      \"title\": \"Domestic and International Cooperation and Coordination\",\n      \"node_id\": \"0008\",\n      \"start_index\": 28,\n      \"end_index\": 31,\n      \"summary\": \"In 2023, the Federal Reserve collaborated ...\"\n    }\n  ]\n}\n...\n```\n\nYou can generate the PageIndex tree structure with this open-source repo, or use our [API](https://docs.pageindex.ai/quickstart) \n\n---\n\n# ⚙️ Package Usage\n\nYou can follow these steps to generate a PageIndex tree from a PDF document.\n\n### 1. Install dependencies\n\n```bash\npip3 install --upgrade -r requirements.txt\n```\n\n### 2. Set your OpenAI API key\n\nCreate a `.env` file in the root directory and add your API key:\n\n```bash\nCHATGPT_API_KEY=your_openai_key_here\n```\n\n### 3. Run PageIndex on your PDF\n\n```bash\npython3 run_pageindex.py --pdf_path /path/to/your/document.pdf\n```\n\n<details>\n<summary><strong>Optional parameters</strong></summary>\n<br>\nYou can customize the processing with additional optional arguments:\n\n```\n--model                 OpenAI model to use (default: gpt-4o-2024-11-20)\n--toc-check-pages       Pages to check for table of contents (default: 20)\n--max-pages-per-node    Max pages per node (default: 10)\n--max-tokens-per-node   Max tokens per node (default: 20000)\n--if-add-node-id        Add node ID (yes/no, default: yes)\n--if-add-node-summary   Add node summary (yes/no, default: yes)\n--if-add-doc-description Add doc description (yes/no, default: yes)\n```\n</details>\n\n<details>\n<summary><strong>Markdown support</strong></summary>\n<br>\nWe also provide markdown support for PageIndex. You can use the `-md_path` flag to generate a tree structure for a markdown file.\n\n```bash\npython3 run_pageindex.py --md_path /path/to/your/document.md\n```\n\n> Note: in this function, we use \"#\" to determine node heading and their levels. For example, \"##\" is level 2, \"###\" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our [PageIndex OCR](https://pageindex.ai/blog/ocr), which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.\n</details>\n\n<!-- \n# ☁️ Improved Tree Generation with PageIndex OCR\n\nThis repo is designed for generating PageIndex tree structure for simple PDFs, but many real-world use cases involve complex PDFs that are hard to parse by classic Python tools. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy.\n\nTo address this, we introduced PageIndex OCR — the first long-context OCR model designed to preserve the global structure of documents. PageIndex OCR significantly outperforms other leading OCR tools, such as those from Mistral and Contextual AI, in recognizing true hierarchy and semantic relationships across document pages.\n\n- Experience next-level OCR quality with PageIndex OCR at our [Dashboard](https://dash.pageindex.ai/).\n- Integrate PageIndex OCR seamlessly into your stack via our [API](https://docs.pageindex.ai/quickstart).\n\n<p align=\"center\">\n  <img src=\"https://github.com/user-attachments/assets/eb35d8ae-865c-4e60-a33b-ebbd00c41732\" width=\"80%\">\n</p>\n-->\n\n---\n\n# 📈 Case Study: PageIndex Leads Finance QA Benchmark\n\n[Mafin 2.5](https://vectify.ai/mafin) is a reasoning-based RAG system for financial document analysis, powered by **PageIndex**. It achieved a state-of-the-art [**98.7% accuracy**](https://vectify.ai/blog/Mafin2.5) on the [FinanceBench](https://arxiv.org/abs/2311.11944) benchmark, significantly outperforming traditional vector-based RAG systems.\n\nPageIndex's hierarchical indexing and reasoning-driven retrieval enable precise navigation and extraction of relevant context from complex financial reports, such as SEC filings and earnings disclosures.\n\nExplore the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) and our [blog post](https://vectify.ai/blog/Mafin2.5) for detailed comparisons and performance metrics.\n\n<div align=\"center\">\n  <a href=\"https://github.com/VectifyAI/Mafin2.5-FinanceBench\">\n    <img src=\"https://github.com/user-attachments/assets/571aa074-d803-43c7-80c4-a04254b782a3\" width=\"70%\">\n  </a>\n</div>\n\n---\n\n# 🧭 Resources\n\n* 🧪 [Cookbooks](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): hands-on, runnable examples and advanced use cases.\n* 📖 [Tutorials](https://docs.pageindex.ai/doc-search): practical guides and strategies, including *Document Search* and *Tree Search*.\n* 📝 [Blog](https://pageindex.ai/blog): technical articles, research insights, and product updates.\n* 🔌 [MCP setup](https://pageindex.ai/mcp#quick-setup) & [API docs](https://docs.pageindex.ai/quickstart): integration details and configuration options.\n\n---\n\n# ⭐ Support Us\nPlease cite this work as:\n```\nMingtian Zhang, Yu Tang and PageIndex Team,\n\"PageIndex: Next-Generation Vectorless, Reasoning-based RAG\",\nPageIndex Blog, Sep 2025.\n```\n\nOr use the BibTeX citation:\n\n```\n@article{zhang2025pageindex,\n  author = {Mingtian Zhang and Yu Tang and PageIndex Team},\n  title = {PageIndex: Next-Generation Vectorless, Reasoning-based RAG},\n  journal = {PageIndex Blog},\n  year = {2025},\n  month = {September},\n  note = {https://pageindex.ai/blog/pageindex-intro},\n}\n```\n\nLeave us a star 🌟 if you like our project. Thank you!  \n\n<p>\n  <img src=\"https://github.com/user-attachments/assets/eae4ff38-48ae-4a7c-b19f-eab81201d794\" width=\"80%\">\n</p>\n\n### Connect with Us\n\n[![Twitter](https://img.shields.io/badge/Twitter-000000?style=for-the-badge&logo=x&logoColor=white)](https://x.com/PageIndexAI)&nbsp;\n[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/company/vectify-ai/)&nbsp;\n[![Discord](https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.com/invite/VuXuf29EUj)&nbsp;\n[![Contact Us](https://img.shields.io/badge/Contact_Us-3B82F6?style=for-the-badge&logo=envelope&logoColor=white)](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)\n\n---\n\n© 2025 [Vectify AI](https://vectify.ai)\n"
  },
  {
    "path": "cookbook/README.md",
    "content": "### 🧪 Cookbooks:\n\n* [**Vectorless RAG notebook**](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb): A *minimal*, hands-on example of reasoning-based RAG using **PageIndex** — no vectors, no chunking, and human-like retrieval.\n* [Vision-based Vectorless RAG notebook](https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb): no OCR; reasoning-native RAG pipeline that retrieves and reasons directly over page images.\n\n<div align=\"center\">\n  <a href=\"https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/pageindex_RAG_simple.ipynb\" target=\"_blank\" rel=\"noopener\">\n    <img src=\"https://img.shields.io/badge/Open_In_Colab-Vectorless_RAG-orange?style=for-the-badge&logo=googlecolab\" alt=\"Open in Colab: Vectorless RAG\" />\n  </a>\n  &nbsp;&nbsp;\n  <a href=\"https://colab.research.google.com/github/VectifyAI/PageIndex/blob/main/cookbook/vision_RAG_pageindex.ipynb\" target=\"_blank\" rel=\"noopener\">\n    <img src=\"https://img.shields.io/badge/Open_In_Colab-Vision_RAG-orange?style=for-the-badge&logo=googlecolab\" alt=\"Open in Colab: Vision RAG\" />\n  </a>\n</div>"
  },
  {
    "path": "cookbook/agentic_retrieval.ipynb",
    "content": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"XTboY7brzyp2\"\n      },\n      \"source\": [\n        \"![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"EtjMbl9Pz3S-\"\n      },\n      \"source\": [\n        \"<p align=\\\"center\\\">Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</p>\\n\",\n        \"\\n\",\n        \"<p align=\\\"center\\\">\\n\",\n        \"  <a href=\\\"https://vectify.ai\\\">🏠 Homepage</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://chat.pageindex.ai\\\">🖥️ Platform</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://docs.pageindex.ai/quickstart\\\">📚 API Docs</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://github.com/VectifyAI/PageIndex\\\">📦 GitHub</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://discord.com/invite/VuXuf29EUj\\\">💬 Discord</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\\\">✉️ Contact</a>&nbsp;\\n\",\n        \"</p>\\n\",\n        \"\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"\\n\",\n        \"[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\\n\",\n        \"\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"---\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"bbC9uLWCz8zl\"\n      },\n      \"source\": [\n        \"# Agentic Retrieval with PageIndex Chat API\\n\",\n        \"\\n\",\n        \"Similarity-based RAG based on Vector-DB has shown big limitations in recent AI applications, reasoning-based or agentic retrieval has become important in current developments. However, unlike classic RAG pipeine with embedding input, top-K chunks returns, re-rank, what should a agentic-native retreival API looks like?\\n\",\n        \"\\n\",\n        \"For an agentic-native retrieval system, we need the ability to prompt for retrieval just as naturally as you interact with ChatGPT. Below, we provide an example of how the PageIndex Chat API enables this style of prompt-driven retrieval.\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"## PageIndex Chat API\\n\",\n        \"[PageIndex Chat](https://chat.pageindex.ai/) is a AI assistant that allow you chat with multiple super-long documents without worrying about limited context or context rot problem. It is based on [PageIndex](https://pageindex.ai/blog/pageindex-intro), a vectorless reasoning-based RAG framework which gives more transparent and reliable results like a human expert.\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"  <img src=\\\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\\\" width=\\\"70%\\\">\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"You can now access PageIndex Chat with API or SDK.\\n\",\n        \"\\n\",\n        \"## 📝 Notebook Overview\\n\",\n        \"\\n\",\n        \"This notebook demonstrates a simple, minimal example of agentic retrieval with PageIndex. You will learn:\\n\",\n        \"- [x] How to use PageIndex Chat API.\\n\",\n        \"- [x] How to prompt the PageIndex Chat to make it a retrieval system\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"77SQbPoe-LTN\"\n      },\n      \"source\": [\n        \"### Install PageIndex SDK\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 36,\n      \"metadata\": {\n        \"id\": \"6Eiv_cHf0OXz\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"%pip install -q --upgrade pageindex\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"UR9-qkdD-Om7\"\n      },\n      \"source\": [\n        \"### Setup PageIndex\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 60,\n      \"metadata\": {\n        \"id\": \"AFzsW4gq0fjh\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"from pageindex import PageIndexClient\\n\",\n        \"\\n\",\n        \"# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\\n\",\n        \"PAGEINDEX_API_KEY = \\\"YOUR_PAGEINDEX_API_KEY\\\"\\n\",\n        \"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"uvzf9oWL-Ts9\"\n      },\n      \"source\": [\n        \"### Upload a document\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 39,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"qf7sNRoL0hGw\",\n        \"outputId\": \"529f53c1-c827-45a7-cf01-41f567d4feaa\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Downloaded https://arxiv.org/pdf/2507.13334.pdf\\n\",\n            \"Document Submitted: pi-cmi34m6jy01sg0bqzofch62n8\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"import os, requests\\n\",\n        \"\\n\",\n        \"pdf_url = \\\"https://arxiv.org/pdf/2507.13334.pdf\\\"\\n\",\n        \"pdf_path = os.path.join(\\\"../data\\\", pdf_url.split('/')[-1])\\n\",\n        \"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\\n\",\n        \"\\n\",\n        \"response = requests.get(pdf_url)\\n\",\n        \"with open(pdf_path, \\\"wb\\\") as f:\\n\",\n        \"    f.write(response.content)\\n\",\n        \"print(f\\\"Downloaded {pdf_url}\\\")\\n\",\n        \"\\n\",\n        \"doc_id = pi_client.submit_document(pdf_path)[\\\"doc_id\\\"]\\n\",\n        \"print('Document Submitted:', doc_id)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"U4hpLB4T-fCt\"\n      },\n      \"source\": [\n        \"### Check the processing status\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 61,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"PB1S_CWd2n87\",\n        \"outputId\": \"472a64ab-747d-469c-9e46-3329456df212\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"{'createdAt': '2025-11-16T08:36:41.177000',\\n\",\n            \" 'description': 'This survey provides a comprehensive overview and taxonomy of '\\n\",\n            \"                'Context Engineering for Large Language Models, covering '\\n\",\n            \"                'foundational components, system implementations, evaluation '\\n\",\n            \"                'methods, and future research directions.',\\n\",\n            \" 'id': 'pi-cmi1gp1hg01t20do2l3bgzwz1',\\n\",\n            \" 'name': '2507.13334_19.pdf',\\n\",\n            \" 'pageNum': 166,\\n\",\n            \" 'status': 'completed'}\\n\",\n            \"\\n\",\n            \" Document ready! (166 pages)\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"from pprint import pprint\\n\",\n        \"\\n\",\n        \"doc_info = pi_client.get_document(doc_id)\\n\",\n        \"pprint(doc_info)\\n\",\n        \"\\n\",\n        \"if doc_info['status'] == 'completed':\\n\",\n        \"  print(f\\\"\\\\n Document ready! ({doc_info['pageNum']} pages)\\\")\\n\",\n        \"elif doc_info['status'] == 'processing':\\n\",\n        \"  print(\\\"\\\\n Document is still processing. Please wait and check again.\\\")\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"z1C9FOvO-p1m\"\n      },\n      \"source\": [\n        \"### Ask a question about this document\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 55,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"X3RbQvy_0nt7\",\n        \"outputId\": \"9bfb314b-24ad-4eb2-d26c-01be5728d3cc\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"I'll help you find the evaluation methods used in this paper. Let me start by examining the document structure to locate the relevant sections.{\\\"doc_name\\\": \\\"2507.13334_19.pdf\\\"}Perfect! I can see there's a dedicated section on \\\"Evaluation\\\" (node_id: 0015) that covers pages 45-50. Let me extract the content from those pages to get detailed information about the evaluation methods.{\\\"doc_name\\\": \\\"2507.13334_19.pdf\\\", \\\"pages\\\": \\\"45-50\\\"}Based on the comprehensive evaluation section of the paper, here are the **evaluation methods** used:\\n\",\n            \"\\n\",\n            \"## Evaluation Framework Overview\\n\",\n            \"\\n\",\n            \"The paper presents a comprehensive evaluation framework organized into **Component-Level Assessment** and **System-Level Integration Assessment**.\\n\",\n            \"\\n\",\n            \"### 1. **Component-Level Assessment (Intrinsic Evaluation)**\\n\",\n            \"\\n\",\n            \"#### Prompt Engineering Evaluation:\\n\",\n            \"- **Semantic similarity metrics**\\n\",\n            \"- **Response quality assessment**\\n\",\n            \"- **Robustness testing** across diverse input variations\\n\",\n            \"- **Contextual calibration** assessment\\n\",\n            \"\\n\",\n            \"#### Long Context Processing Evaluation:\\n\",\n            \"- **\\\"Needle in a haystack\\\"** evaluation paradigm - tests models' ability to retrieve specific information embedded within long contexts\\n\",\n            \"- **Multi-document reasoning tasks** - assess synthesis capabilities\\n\",\n            \"- **Position interpolation techniques** evaluation\\n\",\n            \"- **Information retention, positional bias, and reasoning coherence** metrics\\n\",\n            \"\\n\",\n            \"#### Self-Contextualization Evaluation:\\n\",\n            \"- **Meta-learning assessments**\\n\",\n            \"- **Adaptation speed measurements**\\n\",\n            \"- **Consistency analysis** across multiple iterations\\n\",\n            \"- Self-refinement frameworks: **Self-Refine, Reflexion, N-CRITICS**\\n\",\n            \"- Performance improvements measured (~20% improvement with GPT-4)\\n\",\n            \"\\n\",\n            \"#### Structured/Relational Data Integration:\\n\",\n            \"- **Knowledge graph traversal accuracy**\\n\",\n            \"- **Table comprehension assessment**\\n\",\n            \"- **Database query generation evaluation**\\n\",\n            \"\\n\",\n            \"### 2. **System-Level Integration Assessment (Extrinsic Evaluation)**\\n\",\n            \"\\n\",\n            \"#### Retrieval-Augmented Generation (RAG):\\n\",\n            \"- **Precision, recall, relevance metrics**\\n\",\n            \"- **Factual accuracy assessment**\\n\",\n            \"- **Task decomposition accuracy**\\n\",\n            \"- **Multi-plan selection effectiveness**\\n\",\n            \"- Memory-augmented planning evaluation\\n\",\n            \"\\n\",\n            \"#### Memory Systems Evaluation:\\n\",\n            \"- **LongMemEval benchmark** (500 curated questions covering):\\n\",\n            \"  - Information extraction\\n\",\n            \"  - Temporal reasoning\\n\",\n            \"  - Multi-session reasoning\\n\",\n            \"  - Knowledge updates\\n\",\n            \"- Dedicated benchmarks: **NarrativeQA, QMSum, QuALITY, MEMENTO**\\n\",\n            \"- Accuracy degradation tracking (~30% degradation in extended interactions)\\n\",\n            \"\\n\",\n            \"#### Tool-Integrated Reasoning:\\n\",\n            \"- **MCP-RADAR framework** for standardized evaluation\\n\",\n            \"- **Berkeley Function Calling Leaderboard (BFCL)** - 2,000 test cases\\n\",\n            \"- **T-Eval** - 553 tool-use cases\\n\",\n            \"- **API-Bank** - 73 APIs, 314 dialogues\\n\",\n            \"- **ToolHop** - 995 queries, 3,912 tools\\n\",\n            \"- **StableToolBench** for API instability\\n\",\n            \"- **WebArena** and **Mind2Web** for web agents\\n\",\n            \"- **VideoWebArena** for multimodal agents\\n\",\n            \"- Metrics: tool selection accuracy, parameter extraction precision, execution success rates, error recovery\\n\",\n            \"\\n\",\n            \"#### Multi-Agent Systems:\\n\",\n            \"- **Communication effectiveness metrics**\\n\",\n            \"- **Coordination efficiency assessment**\\n\",\n            \"- **Protocol adherence evaluation**\\n\",\n            \"- **Task decomposition accuracy**\\n\",\n            \"- **Emergent collaborative behaviors** assessment\\n\",\n            \"- Context handling and transaction support evaluation\\n\",\n            \"\\n\",\n            \"### 3. **Emerging Evaluation Paradigms**\\n\",\n            \"\\n\",\n            \"#### Self-Refinement Evaluation:\\n\",\n            \"- Iterative improvement assessment across multiple cycles\\n\",\n            \"- Multi-dimensional feedback mechanisms\\n\",\n            \"- Ensemble-based evaluation approaches\\n\",\n            \"\\n\",\n            \"#### Multi-Aspect Feedback:\\n\",\n            \"- Correctness, relevance, clarity, and robustness dimensions\\n\",\n            \"- Self-rewarding mechanisms for autonomous evolution\\n\",\n            \"\\n\",\n            \"#### Criticism-Guided Evaluation:\\n\",\n            \"- Specialized critic models providing detailed feedback\\n\",\n            \"- Fine-grained assessment of reasoning quality, factual accuracy, logical consistency\\n\",\n            \"\\n\",\n            \"### 4. **Safety and Robustness Assessment**\\n\",\n            \"\\n\",\n            \"- **Adversarial attack resistance testing**\\n\",\n            \"- **Distribution shift evaluation**\\n\",\n            \"- **Input perturbation testing**\\n\",\n            \"- **Alignment assessment** (adherence to intended behaviors)\\n\",\n            \"- **Graceful degradation strategies**\\n\",\n            \"- **Error recovery protocols**\\n\",\n            \"- **Long-term behavior consistency** evaluation\\n\",\n            \"\\n\",\n            \"### Key Benchmarks Mentioned:\\n\",\n            \"- GAIA (general assistant tasks - 92% human vs 15% GPT-4 accuracy)\\n\",\n            \"- GTA benchmark (GPT-4 <50% task completion vs 92% human)\\n\",\n            \"- WebArena Leaderboard (with success rates ranging from 23.5% to 61.7%)\\n\",\n            \"\\n\",\n            \"### Challenges Identified:\\n\",\n            \"- Traditional metrics (BLEU, ROUGE, perplexity) inadequate for complex systems\\n\",\n            \"- Need for \\\"living\\\" benchmarks that co-evolve with AI capabilities\\n\",\n            \"- Longitudinal evaluation frameworks for tracking memory fidelity over time\\n\",\n            \"- Compositional generalization assessment\\n\",\n            \"- Evaluation of \\\"unknown unknowns\\\" in multi-agent systems\\n\",\n            \"\\n\",\n            \"The paper emphasizes a **paradigm shift from static benchmarks to dynamic, holistic assessments** that evaluate not just task success but reasoning quality, robustness, and long-term autonomy.\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"query = \\\"What are the evaluation methods used in this paper?\\\"\\n\",\n        \"\\n\",\n        \"for chunk in pi_client.chat_completions(\\n\",\n        \"    messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": query}],\\n\",\n        \"    doc_id=doc_id,\\n\",\n        \"    stream=True\\n\",\n        \"):\\n\",\n        \"    print(chunk, end='', flush=True)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"RXrdnxv92LY1\"\n      },\n      \"source\": [\n        \"## Agentic Retrieval with PageIndex Chat API\\n\",\n        \"\\n\",\n        \"You can now easily prompt the PageIndex Chat API to be a retrieval assistant.\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 62,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"puyOGkAi0wrH\",\n        \"outputId\": \"b08c682f-abfb-4381-8bb0-726d210120ad\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"I'll help you find the evaluation methods discussed in this paper. Let me start by examining the document structure to locate relevant sections.{\\\"doc_name\\\": \\\"2507.13334_22.pdf\\\"}Now I can see the document structure. Section 6 is dedicated to \\\"Evaluation\\\" and spans pages 45-50. Let me retrieve the content from those pages to get the detailed evaluation methods.{\\\"doc_name\\\": \\\"2507.13334_22.pdf\\\", \\\"pages\\\": \\\"45-50\\\"}Perfect! I have retrieved the comprehensive evaluation section. Now let me compile the evaluation methods in the requested JSON format.\\n\",\n            \"\\n\",\n            \"```json\\n\",\n            \"[\\n\",\n            \"  {\\n\",\n            \"    \\\"page\\\": 45,\\n\",\n            \"    \\\"content\\\": \\\"## 6. Evaluation\\\\n\\\\nThe evaluation of context-engineered systems presents unprecedented challenges that transcend traditional language model assessment paradigms. These systems exhibit complex, multi-component architectures with dynamic, context-dependent behaviors requiring comprehensive evaluation frameworks that assess component-level diagnostics, task-based performance, and overall system robustness [841, 1141].\\\\n\\\\nThe heterogeneous nature of context engineering components-spanning retrieval mechanisms, memory systems, reasoning chains, and multi-agent coordination-demands evaluation methodologies that can capture both individual component effectiveness and emergent system-level behaviors [314, 939].\\\\n\\\\n### 6.1. Evaluation Frameworks and Methodologies\\\\n\\\\nThis subsection presents comprehensive approaches for evaluating both individual components and integrated systems in context engineering.\\\\n\\\\n#### 6.1.1. Component-Level Assessment\\\\n\\\\nIntrinsic evaluation focuses on the performance of individual components in isolation, providing foundational insights into system capabilities and failure modes.\\\\n\\\\nFor prompt engineering components, evaluation encompasses prompt effectiveness measurement through semantic similarity metrics, response quality assessment, and robustness testing across diverse input variations. Current approaches reveal brittleness and robustness challenges in prompt design, necessitating more sophisticated evaluation frameworks that can assess contextual calibration and adaptive prompt optimization $[1141,669]$.\\\"\\n\",\n            \"  },\\n\",\n            \"  {\\n\",\n            \"    \\\"page\\\": 46,\\n\",\n            \"    \\\"content\\\": \\\"Long context processing evaluation requires specialized metrics addressing information retention, positional bias, and reasoning coherence across extended sequences. The \\\\\\\"needle in a haystack\\\\\\\" evaluation paradigm tests models' ability to retrieve specific information embedded within long contexts, while multi-document reasoning tasks assess synthesis capabilities across multiple information sources. Position interpolation techniques and ultra-long sequence processing methods face significant computational challenges that limit practical evaluation scenarios [737, 299].\\\\n\\\\nSelf-contextualization mechanisms undergo evaluation through meta-learning assessments, adaptation speed measurements, and consistency analysis across multiple iterations. Self-refinement frameworks including Self-Refine, Reflexion, and N-CRITICS demonstrate substantial performance improvements, with GPT-4 achieving approximately 20\\\\\\\\% improvement through iterative self-refinement processes [741, 964, 795]. Multi-dimensional feedback mechanisms and ensemble-based evaluation approaches provide comprehensive assessment of autonomous evolution capabilities [583, 710].\\\\n\\\\nStructured and relational data integration evaluation examines accuracy in knowledge graph traversal, table comprehension, and database query generation. However, current evaluation frameworks face significant limitations in assessing structural reasoning capabilities, with high-quality structured training data development presenting ongoing challenges. LSTM-based models demonstrate increased errors when sequential and structural information conflict, highlighting the need for more sophisticated benchmarks testing structural understanding $[769,674,167]$.\\\\n\\\\n#### 6.1.2. System-Level Integration Assessment\\\\n\\\\nExtrinsic evaluation measures end-to-end performance on downstream tasks, providing holistic assessments of system utility through comprehensive benchmarks spanning question answering, reasoning, and real-world applications.\\\\n\\\\nSystem-level evaluation must capture emergent behaviors arising from component interactions, including synergistic effects where combined components exceed individual performance and potential interference patterns where component integration degrades overall effectiveness [841, 1141].\\\\n\\\\nRetrieval-Augmented Generation evaluation encompasses both retrieval quality and generation effectiveness through comprehensive metrics addressing precision, recall, relevance, and factual accuracy. Agentic RAG systems introduce additional complexity requiring evaluation of task decomposition accuracy, multi-plan selection effectiveness, and memory-augmented planning capabilities. Self-reflection mechanisms demonstrate iterative improvement through feedback loops, with MemoryBank implementations incorporating Ebbinghaus Forgetting Curve principles for enhanced memory evaluation [444, 166, 1372, 1192, 41].\\\\n\\\\nMemory systems evaluation encounters substantial difficulties stemming from the absence of standardized assessment frameworks and the inherently stateless characteristics of contemporary LLMs. LongMemEval offers 500 carefully curated questions that evaluate fundamental capabilities encompassing information extraction, temporal reasoning, multi-session reasoning, and knowledge updates. Commercial AI assistants exhibit $30 \\\\\\\\%$ accuracy degradation throughout extended interactions, underscoring significant deficiencies in memory persistence and retrieval effectiveness [1340, 1180, 463, 847, 390]. Dedicated benchmarks such as NarrativeQA, QMSum, QuALITY, and MEMENTO tackle episodic memory evaluation challenges [556, 572].\\\\n\\\\nTool-integrated reasoning systems require comprehensive evaluation covering the entire interaction trajectory, including tool selection accuracy, parameter extraction precision, execution success rates, and error recovery capabilities. The MCP-RADAR framework provides standardized evaluation employing objective metrics for software engineering and mathematical reasoning domains. Real-world evaluation reveals\\\"\\n\",\n            \"  },\\n\",\n            \"  {\\n\",\n            \"    \\\"page\\\": 47,\\n\",\n            \"    \\\"content\\\": \\\"significant performance gaps, with GPT-4 completing less than 50\\\\\\\\% of tasks in the GTA benchmark, compared to human performance of $92 \\\\\\\\%$ [314, 1098, 126, 939]. Advanced benchmarks including BFCL (2,000 testing cases), T-Eval (553 tool-use cases), API-Bank (73 APIs, 314 dialogues), and ToolHop ( 995 queries, 3,912 tools) address multi-turn interactions and nested tool calling scenarios [263, 363, 377, 1264, 160, 835].\\\\n\\\\nMulti-agent systems evaluation captures communication effectiveness, coordination efficiency, and collective outcome quality through specialized metrics addressing protocol adherence, task decomposition accuracy, and emergent collaborative behaviors. Contemporary orchestration frameworks including LangGraph, AutoGen, and CAMEL demonstrate insufficient transaction support, with validation limitations emerging as systems rely exclusively on LLM self-validation capabilities without independent validation procedures. Context handling failures compound challenges as agents struggle with long-term context maintenance encompassing both episodic and semantic information [128, 394, 901].\\\\n\\\\n### 6.2. Benchmark Datasets and Evaluation Paradigms\\\\n\\\\nThis subsection reviews specialized benchmarks and evaluation paradigms designed for assessing context engineering system performance.\\\\n\\\\n#### 6.2.1. Foundational Component Benchmarks\\\\n\\\\nLong context processing evaluation employs specialized benchmark suites designed to test information retention, reasoning, and synthesis across extended sequences. Current benchmarks face significant computational complexity challenges, with $\\\\\\\\mathrm{O}\\\\\\\\left(\\\\\\\\mathrm{n}^{2}\\\\\\\\right)$ scaling limitations in attention mechanisms creating substantial memory constraints for ultra-long sequences. Position interpolation and extension techniques require sophisticated evaluation frameworks that can assess both computational efficiency and reasoning quality across varying sequence lengths [737, 299, 1236].\\\\n\\\\nAdvanced architectures including LongMamba and specialized position encoding methods demonstrate promising directions for long context processing, though evaluation reveals persistent challenges in maintaining coherence across extended sequences. The development of sliding attention mechanisms and memory-efficient implementations requires comprehensive benchmarks that can assess both computational tractability and task performance [1267, 351].\\\\n\\\\nStructured and relational data integration benchmarks encompass diverse knowledge representation formats and reasoning patterns. However, current evaluation frameworks face limitations in assessing structural reasoning capabilities, with the development of high-quality structured training data presenting ongoing challenges. Evaluation must address the fundamental tension between sequential and structural information processing, particularly in scenarios where these information types conflict [769, 674, 167].\\\\n\\\\n#### 6.2.2. System Implementation Benchmarks\\\\n\\\\nRetrieval-Augmented Generation evaluation leverages comprehensive benchmark suites addressing diverse retrieval and generation challenges. Modular RAG architectures demonstrate enhanced flexibility through specialized modules for retrieval, augmentation, and generation, enabling fine-grained evaluation of individual components and their interactions. Graph-enhanced RAG systems incorporating GraphRAG and LightRAG demonstrate improved performance in complex reasoning scenarios, though evaluation frameworks must address the additional complexity of graph traversal and multi-hop reasoning assessment [316, 973, 364].\\\\n\\\\nAgentic RAG systems introduce sophisticated planning and reflection mechanisms requiring evaluation\\\"\\n\",\n            \"  },\\n\",\n            \"  {\\n\",\n            \"    \\\"page\\\": 48,\\n\",\n            \"    \\\"content\\\": \\\"of task decomposition accuracy, multi-plan selection effectiveness, and iterative refinement capabilities. Real-time and streaming RAG applications present unique evaluation challenges in assessing both latency and accuracy under dynamic information conditions [444, 166, 1192].\\\\n\\\\nTool-integrated reasoning system evaluation employs comprehensive benchmarks spanning diverse tool usage scenarios and complexity levels. The Berkeley Function Calling Leaderboard (BFCL) provides 2,000 testing cases with step-by-step and end-to-end assessments measuring call accuracy, pass rates, and win rates across increasingly complex scenarios. T-Eval contributes 553 tool-use cases testing multi-turn interactions and nested tool calling capabilities [263, 1390, 835]. Advanced benchmarks including StableToolBench address API instability challenges, while NesTools evaluates nested tool scenarios and ToolHop assesses multi-hop tool usage across 995 queries and 3,912 tools [363, 377, 1264].\\\\n\\\\nWeb agent evaluation frameworks including WebArena and Mind2Web provide comprehensive assessment across thousands of tasks spanning 137 websites, revealing significant performance gaps in current LLM capabilities for complex web interactions. VideoWebArena extends evaluation to multimodal agents, while Deep Research Bench and DeepShop address specialized evaluation for research and shopping agents respectively $[1378,206,87,482]$.\\\\n\\\\nMulti-agent system evaluation employs specialized frameworks addressing coordination, communication, and collective intelligence. However, current frameworks face significant challenges in transactional integrity across complex workflows, with many systems lacking adequate compensation mechanisms for partial failures. Orchestration evaluation must address context management, coordination strategy effectiveness, and the ability to maintain system coherence under varying operational conditions [128, 901].\\\\n\\\\n| Release Date | Open Source | Method / Model | Success Rate (\\\\\\\\%) | Source |\\\\n| :-- | :--: | :-- | :--: | :-- |\\\\n| $2025-02$ | $\\\\\\\\times$ | IBM CUGA | 61.7 | $[753]$ |\\\\n| $2025-01$ | $\\\\\\\\times$ | OpenAI Operator | 58.1 | $[813]$ |\\\\n| $2024-08$ | $\\\\\\\\times$ | Jace.AI | 57.1 | $[476]$ |\\\\n| $2024-12$ | $\\\\\\\\times$ | ScribeAgent + GPT-4o | 53.0 | $[950]$ |\\\\n| $2025-01$ | $\\\\\\\\checkmark$ | AgentSymbiotic | 52.1 | $[1323]$ |\\\\n| $2025-01$ | $\\\\\\\\checkmark$ | Learn-by-Interact | 48.0 | $[998]$ |\\\\n| $2024-10$ | $\\\\\\\\checkmark$ | AgentOccam-Judge | 45.7 | $[1231]$ |\\\\n| $2024-08$ | $\\\\\\\\times$ | WebPilot | 37.2 | $[1331]$ |\\\\n| $2024-10$ | $\\\\\\\\checkmark$ | GUI-API Hybrid Agent | 35.8 | $[988]$ |\\\\n| $2024-09$ | $\\\\\\\\checkmark$ | Agent Workflow Memory | 35.5 | $[1144]$ |\\\\n| $2024-04$ | $\\\\\\\\checkmark$ | SteP | 33.5 | $[979]$ |\\\\n| $2025-06$ | $\\\\\\\\checkmark$ | TTI | 26.1 | $[951]$ |\\\\n| $2024-04$ | $\\\\\\\\checkmark$ | BrowserGym + GPT-4 | 23.5 | $[238]$ |\\\\n\\\\nTable 8: WebArena [1378] Leaderboard: Top performing models with their success rates and availability status.\\\\n\\\\n### 6.3. Evaluation Challenges and Emerging Paradigms\\\\n\\\\nThis subsection identifies current limitations in evaluation methodologies and explores emerging approaches for more effective assessment.\\\"\\n\",\n            \"  },\\n\",\n            \"  {\\n\",\n            \"    \\\"page\\\": 49,\\n\",\n            \"    \\\"content\\\": \\\"#### 6.3.1. Methodological Limitations and Biases\\\\n\\\\nTraditional evaluation metrics prove fundamentally inadequate for capturing the nuanced, dynamic behaviors exhibited by context-engineered systems. Static metrics like BLEU, ROUGE, and perplexity, originally designed for simpler text generation tasks, fail to assess complex reasoning chains, multi-step interactions, and emergent system behaviors. The inherent complexity and interdependencies of multi-component systems create attribution challenges where isolating failures and identifying root causes becomes computationally and methodologically intractable. Future metrics must evolve to capture not just task success, but the quality and robustness of the underlying reasoning process, especially in scenarios requiring compositional generalization and creative problem-solving [841, 1141].\\\\n\\\\nMemory system evaluation faces particular challenges due to the lack of standardized benchmarks and the stateless nature of current LLMs. Automated memory testing frameworks must address the isolation problem where different memory testing stages cannot be effectively separated, leading to unreliable assessment results. Commercial AI assistants demonstrate significant performance degradation during sustained interactions, with accuracy drops of up to $30 \\\\\\\\%$ highlighting critical gaps in current evaluation methodologies and pointing to the need for longitudinal evaluation frameworks that track memory fidelity over time $[1340,1180,463]$.\\\\n\\\\nTool-integrated reasoning system evaluation reveals substantial performance gaps between current systems and human-level capabilities. The GAIA benchmark demonstrates that while humans achieve $92 \\\\\\\\%$ accuracy on general assistant tasks, advanced models like GPT-4 achieve only $15 \\\\\\\\%$ accuracy, indicating fundamental limitations in current evaluation frameworks and system capabilities [778, 1098, 126]. Evaluation frameworks must address the complexity of multi-tool coordination, error recovery, and adaptive tool selection across diverse operational contexts [314, 939].\\\\n\\\\n#### 6.3.2. Emerging Evaluation Paradigms\\\\n\\\\nSelf-refinement evaluation paradigms leverage iterative improvement mechanisms to assess system capabilities across multiple refinement cycles. Frameworks including Self-Refine, Reflexion, and N-CRITICS demonstrate substantial performance improvements through multi-dimensional feedback and ensemblebased evaluation approaches. GPT-4 achieves approximately 20\\\\\\\\% improvement through self-refinement processes, highlighting the importance of evaluating systems across multiple iteration cycles rather than single-shot assessments. However, a key future challenge lies in evaluating the meta-learning capability itself—not just whether the system improves, but how efficiently and robustly it learns to refine its strategies over time $[741,964,795,583]$.\\\\n\\\\nMulti-aspect feedback evaluation incorporates diverse feedback dimensions including correctness, relevance, clarity, and robustness, providing comprehensive assessment of system outputs. Self-rewarding mechanisms enable autonomous evolution and meta-learning assessment, allowing systems to develop increasingly sophisticated evaluation criteria through iterative refinement [710].\\\\n\\\\nCriticism-guided evaluation employs specialized critic models to provide detailed feedback on system outputs, enabling fine-grained assessment of reasoning quality, factual accuracy, and logical consistency. These approaches address the limitations of traditional metrics by providing contextual, content-aware evaluation that can adapt to diverse task requirements and output formats [795, 583].\\\\n\\\\nOrchestration evaluation frameworks address the unique challenges of multi-agent coordination by incorporating transactional integrity assessment, context management evaluation, and coordination strategy effectiveness measurement. Advanced frameworks including SagaLLM provide transaction support and\\\"\\n\",\n            \"  },\\n\",\n            \"  {\\n\",\n            \"    \\\"page\\\": 50,\\n\",\n            \"    \\\"content\\\": \\\"independent validation procedures to address the limitations of systems that rely exclusively on LLM selfvalidation capabilities $[128,394]$.\\\\n\\\\n#### 6.3.3. Safety and Robustness Assessment\\\\n\\\\nSafety-oriented evaluation incorporates comprehensive robustness testing, adversarial attack resistance, and alignment assessment to ensure responsible development of context-engineered systems. Particular attention must be paid to the evaluation of agentic systems that can operate autonomously across extended periods, as these systems present unique safety challenges that traditional evaluation frameworks cannot adequately address $[973,364]$.\\\\n\\\\nRobustness evaluation must assess system performance under distribution shifts, input perturbations, and adversarial conditions through comprehensive stress testing protocols. Multi-agent systems face additional challenges in coordination failure scenarios, where partial system failures can cascade through the entire agent network. Evaluation frameworks must address graceful degradation strategies, error recovery protocols, and the ability to maintain system functionality under adverse conditions. Beyond predefined failure modes, future evaluation must grapple with assessing resilience to \\\\\\\"unknown unknowns\\\\\\\"-emergent and unpredictable failure cascades in highly complex, autonomous multi-agent systems [128, 394].\\\\n\\\\nAlignment evaluation measures system adherence to intended behaviors, value consistency, and beneficial outcome optimization through specialized assessment frameworks. Context engineering systems present unique alignment challenges due to their dynamic adaptation capabilities and complex interaction patterns across multiple components. Long-term evaluation must assess whether systems maintain beneficial behaviors as they adapt and evolve through extended operational periods [901].\\\\n\\\\nLooking ahead, the evaluation of context-engineered systems requires a paradigm shift from static benchmarks to dynamic, holistic assessments. Future frameworks must move beyond measuring task success to evaluating compositional generalization for novel problems and tracking long-term autonomy in interactive environments. The development of 'living' benchmarks that co-evolve with AI capabilities, alongside the integration of socio-technical and economic metrics, will be critical for ensuring these advanced systems are not only powerful but also reliable, efficient, and aligned with human values in real-world applications $[314,1378,1340]$.\\\\n\\\\nThe evaluation landscape for context-engineered systems continues evolving rapidly as new architectures, capabilities, and applications emerge. Future evaluation paradigms must address increasing system complexity while providing reliable, comprehensive, and actionable insights for system improvement and deployment decisions. The integration of multiple evaluation approaches-from component-level assessment to systemwide robustness testing-represents a critical research priority for ensuring the reliable deployment of context-engineered systems in real-world applications [841, 1141].\\\"\\n\",\n            \"  }\\n\",\n            \"]\\n\",\n            \"```\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"retrieval_prompt = f\\\"\\\"\\\"\\n\",\n        \"Your job is to retrieve the raw relevant content from the document based on the user's query.\\n\",\n        \"\\n\",\n        \"Query: {query}\\n\",\n        \"\\n\",\n        \"Return in JSON format:\\n\",\n        \"```json\\n\",\n        \"[\\n\",\n        \"  {{\\n\",\n        \"    \\\"page\\\": <number>,\\n\",\n        \"    \\\"content\\\": \\\"<raw text>\\\"\\n\",\n        \"  }},\\n\",\n        \"  ...\\n\",\n        \"]\\n\",\n        \"```\\n\",\n        \"\\\"\\\"\\\"\\n\",\n        \"\\n\",\n        \"full_response = \\\"\\\"\\n\",\n        \"\\n\",\n        \"for chunk in pi_client.chat_completions(\\n\",\n        \"    messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": retrieval_prompt}],\\n\",\n        \"    doc_id=doc_id,\\n\",\n        \"    stream=True\\n\",\n        \"):\\n\",\n        \"    print(chunk, end='', flush=True)\\n\",\n        \"    full_response += chunk\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"d-Y9towQ_CiF\"\n      },\n      \"source\": [\n        \"### Extract the JSON retreived results\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 59,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"rwjC65oB05Tt\",\n        \"outputId\": \"64504ad5-1778-463f-989b-46e18aba2ea6\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Note: you may need to restart the kernel to use updated packages.\\n\",\n            \"[{'content': '## 6. Evaluation\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'The evaluation of context-engineered systems presents '\\n\",\n            \"             'unprecedented challenges that transcend traditional language '\\n\",\n            \"             'model assessment paradigms. These systems exhibit complex, '\\n\",\n            \"             'multi-component architectures with dynamic, context-dependent '\\n\",\n            \"             'behaviors requiring comprehensive evaluation frameworks that '\\n\",\n            \"             'assess component-level diagnostics, task-based performance, and '\\n\",\n            \"             'overall system robustness [841, 1141].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'The heterogeneous nature of context engineering '\\n\",\n            \"             'components-spanning retrieval mechanisms, memory systems, '\\n\",\n            \"             'reasoning chains, and multi-agent coordination-demands '\\n\",\n            \"             'evaluation methodologies that can capture both individual '\\n\",\n            \"             'component effectiveness and emergent system-level behaviors '\\n\",\n            \"             '[314, 939].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '### 6.1. Evaluation Frameworks and Methodologies\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'This subsection presents comprehensive approaches for evaluating '\\n\",\n            \"             'both individual components and integrated systems in context '\\n\",\n            \"             'engineering.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '#### 6.1.1. Component-Level Assessment\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Intrinsic evaluation focuses on the performance of individual '\\n\",\n            \"             'components in isolation, providing foundational insights into '\\n\",\n            \"             'system capabilities and failure modes.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'For prompt engineering components, evaluation encompasses prompt '\\n\",\n            \"             'effectiveness measurement through semantic similarity metrics, '\\n\",\n            \"             'response quality assessment, and robustness testing across '\\n\",\n            \"             'diverse input variations. Current approaches reveal brittleness '\\n\",\n            \"             'and robustness challenges in prompt design, necessitating more '\\n\",\n            \"             'sophisticated evaluation frameworks that can assess contextual '\\n\",\n            \"             'calibration and adaptive prompt optimization $[1141,669]$.',\\n\",\n            \"  'page': 45},\\n\",\n            \" {'content': 'Long context processing evaluation requires specialized metrics '\\n\",\n            \"             'addressing information retention, positional bias, and reasoning '\\n\",\n            \"             'coherence across extended sequences. The \\\"needle in a haystack\\\" '\\n\",\n            \"             \\\"evaluation paradigm tests models' ability to retrieve specific \\\"\\n\",\n            \"             'information embedded within long contexts, while multi-document '\\n\",\n            \"             'reasoning tasks assess synthesis capabilities across multiple '\\n\",\n            \"             'information sources. Position interpolation techniques and '\\n\",\n            \"             'ultra-long sequence processing methods face significant '\\n\",\n            \"             'computational challenges that limit practical evaluation '\\n\",\n            \"             'scenarios [737, 299].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Self-contextualization mechanisms undergo evaluation through '\\n\",\n            \"             'meta-learning assessments, adaptation speed measurements, and '\\n\",\n            \"             'consistency analysis across multiple iterations. Self-refinement '\\n\",\n            \"             'frameworks including Self-Refine, Reflexion, and N-CRITICS '\\n\",\n            \"             'demonstrate substantial performance improvements, with GPT-4 '\\n\",\n            \"             'achieving approximately 20\\\\\\\\% improvement through iterative '\\n\",\n            \"             'self-refinement processes [741, 964, 795]. Multi-dimensional '\\n\",\n            \"             'feedback mechanisms and ensemble-based evaluation approaches '\\n\",\n            \"             'provide comprehensive assessment of autonomous evolution '\\n\",\n            \"             'capabilities [583, 710].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Structured and relational data integration evaluation examines '\\n\",\n            \"             'accuracy in knowledge graph traversal, table comprehension, and '\\n\",\n            \"             'database query generation. However, current evaluation '\\n\",\n            \"             'frameworks face significant limitations in assessing structural '\\n\",\n            \"             'reasoning capabilities, with high-quality structured training '\\n\",\n            \"             'data development presenting ongoing challenges. LSTM-based '\\n\",\n            \"             'models demonstrate increased errors when sequential and '\\n\",\n            \"             'structural information conflict, highlighting the need for more '\\n\",\n            \"             'sophisticated benchmarks testing structural understanding '\\n\",\n            \"             '$[769,674,167]$.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '#### 6.1.2. System-Level Integration Assessment\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Extrinsic evaluation measures end-to-end performance on '\\n\",\n            \"             'downstream tasks, providing holistic assessments of system '\\n\",\n            \"             'utility through comprehensive benchmarks spanning question '\\n\",\n            \"             'answering, reasoning, and real-world applications.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'System-level evaluation must capture emergent behaviors arising '\\n\",\n            \"             'from component interactions, including synergistic effects where '\\n\",\n            \"             'combined components exceed individual performance and potential '\\n\",\n            \"             'interference patterns where component integration degrades '\\n\",\n            \"             'overall effectiveness [841, 1141].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Retrieval-Augmented Generation evaluation encompasses both '\\n\",\n            \"             'retrieval quality and generation effectiveness through '\\n\",\n            \"             'comprehensive metrics addressing precision, recall, relevance, '\\n\",\n            \"             'and factual accuracy. Agentic RAG systems introduce additional '\\n\",\n            \"             'complexity requiring evaluation of task decomposition accuracy, '\\n\",\n            \"             'multi-plan selection effectiveness, and memory-augmented '\\n\",\n            \"             'planning capabilities. Self-reflection mechanisms demonstrate '\\n\",\n            \"             'iterative improvement through feedback loops, with MemoryBank '\\n\",\n            \"             'implementations incorporating Ebbinghaus Forgetting Curve '\\n\",\n            \"             'principles for enhanced memory evaluation [444, 166, 1372, 1192, '\\n\",\n            \"             '41].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Memory systems evaluation encounters substantial difficulties '\\n\",\n            \"             'stemming from the absence of standardized assessment frameworks '\\n\",\n            \"             'and the inherently stateless characteristics of contemporary '\\n\",\n            \"             'LLMs. LongMemEval offers 500 carefully curated questions that '\\n\",\n            \"             'evaluate fundamental capabilities encompassing information '\\n\",\n            \"             'extraction, temporal reasoning, multi-session reasoning, and '\\n\",\n            \"             'knowledge updates. Commercial AI assistants exhibit $30 \\\\\\\\%$ '\\n\",\n            \"             'accuracy degradation throughout extended interactions, '\\n\",\n            \"             'underscoring significant deficiencies in memory persistence and '\\n\",\n            \"             'retrieval effectiveness [1340, 1180, 463, 847, 390]. Dedicated '\\n\",\n            \"             'benchmarks such as NarrativeQA, QMSum, QuALITY, and MEMENTO '\\n\",\n            \"             'tackle episodic memory evaluation challenges [556, 572].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Tool-integrated reasoning systems require comprehensive '\\n\",\n            \"             'evaluation covering the entire interaction trajectory, including '\\n\",\n            \"             'tool selection accuracy, parameter extraction precision, '\\n\",\n            \"             'execution success rates, and error recovery capabilities. The '\\n\",\n            \"             'MCP-RADAR framework provides standardized evaluation employing '\\n\",\n            \"             'objective metrics for software engineering and mathematical '\\n\",\n            \"             'reasoning domains. Real-world evaluation reveals',\\n\",\n            \"  'page': 46},\\n\",\n            \" {'content': 'significant performance gaps, with GPT-4 completing less than '\\n\",\n            \"             '50\\\\\\\\% of tasks in the GTA benchmark, compared to human '\\n\",\n            \"             'performance of $92 \\\\\\\\%$ [314, 1098, 126, 939]. Advanced '\\n\",\n            \"             'benchmarks including BFCL (2,000 testing cases), T-Eval (553 '\\n\",\n            \"             'tool-use cases), API-Bank (73 APIs, 314 dialogues), and ToolHop '\\n\",\n            \"             '( 995 queries, 3,912 tools) address multi-turn interactions and '\\n\",\n            \"             'nested tool calling scenarios [263, 363, 377, 1264, 160, 835].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Multi-agent systems evaluation captures communication '\\n\",\n            \"             'effectiveness, coordination efficiency, and collective outcome '\\n\",\n            \"             'quality through specialized metrics addressing protocol '\\n\",\n            \"             'adherence, task decomposition accuracy, and emergent '\\n\",\n            \"             'collaborative behaviors. Contemporary orchestration frameworks '\\n\",\n            \"             'including LangGraph, AutoGen, and CAMEL demonstrate insufficient '\\n\",\n            \"             'transaction support, with validation limitations emerging as '\\n\",\n            \"             'systems rely exclusively on LLM self-validation capabilities '\\n\",\n            \"             'without independent validation procedures. Context handling '\\n\",\n            \"             'failures compound challenges as agents struggle with long-term '\\n\",\n            \"             'context maintenance encompassing both episodic and semantic '\\n\",\n            \"             'information [128, 394, 901].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '### 6.2. Benchmark Datasets and Evaluation Paradigms\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'This subsection reviews specialized benchmarks and evaluation '\\n\",\n            \"             'paradigms designed for assessing context engineering system '\\n\",\n            \"             'performance.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '#### 6.2.1. Foundational Component Benchmarks\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Long context processing evaluation employs specialized benchmark '\\n\",\n            \"             'suites designed to test information retention, reasoning, and '\\n\",\n            \"             'synthesis across extended sequences. Current benchmarks face '\\n\",\n            \"             'significant computational complexity challenges, with '\\n\",\n            \"             '$\\\\\\\\mathrm{O}\\\\\\\\left(\\\\\\\\mathrm{n}^{2}\\\\\\\\right)$ scaling limitations '\\n\",\n            \"             'in attention mechanisms creating substantial memory constraints '\\n\",\n            \"             'for ultra-long sequences. Position interpolation and extension '\\n\",\n            \"             'techniques require sophisticated evaluation frameworks that can '\\n\",\n            \"             'assess both computational efficiency and reasoning quality '\\n\",\n            \"             'across varying sequence lengths [737, 299, 1236].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Advanced architectures including LongMamba and specialized '\\n\",\n            \"             'position encoding methods demonstrate promising directions for '\\n\",\n            \"             'long context processing, though evaluation reveals persistent '\\n\",\n            \"             'challenges in maintaining coherence across extended sequences. '\\n\",\n            \"             'The development of sliding attention mechanisms and '\\n\",\n            \"             'memory-efficient implementations requires comprehensive '\\n\",\n            \"             'benchmarks that can assess both computational tractability and '\\n\",\n            \"             'task performance [1267, 351].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Structured and relational data integration benchmarks encompass '\\n\",\n            \"             'diverse knowledge representation formats and reasoning patterns. '\\n\",\n            \"             'However, current evaluation frameworks face limitations in '\\n\",\n            \"             'assessing structural reasoning capabilities, with the '\\n\",\n            \"             'development of high-quality structured training data presenting '\\n\",\n            \"             'ongoing challenges. Evaluation must address the fundamental '\\n\",\n            \"             'tension between sequential and structural information '\\n\",\n            \"             'processing, particularly in scenarios where these information '\\n\",\n            \"             'types conflict [769, 674, 167].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '#### 6.2.2. System Implementation Benchmarks\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Retrieval-Augmented Generation evaluation leverages '\\n\",\n            \"             'comprehensive benchmark suites addressing diverse retrieval and '\\n\",\n            \"             'generation challenges. Modular RAG architectures demonstrate '\\n\",\n            \"             'enhanced flexibility through specialized modules for retrieval, '\\n\",\n            \"             'augmentation, and generation, enabling fine-grained evaluation '\\n\",\n            \"             'of individual components and their interactions. Graph-enhanced '\\n\",\n            \"             'RAG systems incorporating GraphRAG and LightRAG demonstrate '\\n\",\n            \"             'improved performance in complex reasoning scenarios, though '\\n\",\n            \"             'evaluation frameworks must address the additional complexity of '\\n\",\n            \"             'graph traversal and multi-hop reasoning assessment [316, 973, '\\n\",\n            \"             '364].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Agentic RAG systems introduce sophisticated planning and '\\n\",\n            \"             'reflection mechanisms requiring evaluation',\\n\",\n            \"  'page': 47},\\n\",\n            \" {'content': 'of task decomposition accuracy, multi-plan selection '\\n\",\n            \"             'effectiveness, and iterative refinement capabilities. Real-time '\\n\",\n            \"             'and streaming RAG applications present unique evaluation '\\n\",\n            \"             'challenges in assessing both latency and accuracy under dynamic '\\n\",\n            \"             'information conditions [444, 166, 1192].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Tool-integrated reasoning system evaluation employs '\\n\",\n            \"             'comprehensive benchmarks spanning diverse tool usage scenarios '\\n\",\n            \"             'and complexity levels. The Berkeley Function Calling Leaderboard '\\n\",\n            \"             '(BFCL) provides 2,000 testing cases with step-by-step and '\\n\",\n            \"             'end-to-end assessments measuring call accuracy, pass rates, and '\\n\",\n            \"             'win rates across increasingly complex scenarios. T-Eval '\\n\",\n            \"             'contributes 553 tool-use cases testing multi-turn interactions '\\n\",\n            \"             'and nested tool calling capabilities [263, 1390, 835]. Advanced '\\n\",\n            \"             'benchmarks including StableToolBench address API instability '\\n\",\n            \"             'challenges, while NesTools evaluates nested tool scenarios and '\\n\",\n            \"             'ToolHop assesses multi-hop tool usage across 995 queries and '\\n\",\n            \"             '3,912 tools [363, 377, 1264].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Web agent evaluation frameworks including WebArena and Mind2Web '\\n\",\n            \"             'provide comprehensive assessment across thousands of tasks '\\n\",\n            \"             'spanning 137 websites, revealing significant performance gaps in '\\n\",\n            \"             'current LLM capabilities for complex web interactions. '\\n\",\n            \"             'VideoWebArena extends evaluation to multimodal agents, while '\\n\",\n            \"             'Deep Research Bench and DeepShop address specialized evaluation '\\n\",\n            \"             'for research and shopping agents respectively '\\n\",\n            \"             '$[1378,206,87,482]$.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Multi-agent system evaluation employs specialized frameworks '\\n\",\n            \"             'addressing coordination, communication, and collective '\\n\",\n            \"             'intelligence. However, current frameworks face significant '\\n\",\n            \"             'challenges in transactional integrity across complex workflows, '\\n\",\n            \"             'with many systems lacking adequate compensation mechanisms for '\\n\",\n            \"             'partial failures. Orchestration evaluation must address context '\\n\",\n            \"             'management, coordination strategy effectiveness, and the ability '\\n\",\n            \"             'to maintain system coherence under varying operational '\\n\",\n            \"             'conditions [128, 901].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '| Release Date | Open Source | Method / Model | Success Rate '\\n\",\n            \"             '(\\\\\\\\%) | Source |\\\\n'\\n\",\n            \"             '| :-- | :--: | :-- | :--: | :-- |\\\\n'\\n\",\n            \"             '| $2025-02$ | $\\\\\\\\times$ | IBM CUGA | 61.7 | $[753]$ |\\\\n'\\n\",\n            \"             '| $2025-01$ | $\\\\\\\\times$ | OpenAI Operator | 58.1 | $[813]$ |\\\\n'\\n\",\n            \"             '| $2024-08$ | $\\\\\\\\times$ | Jace.AI | 57.1 | $[476]$ |\\\\n'\\n\",\n            \"             '| $2024-12$ | $\\\\\\\\times$ | ScribeAgent + GPT-4o | 53.0 | $[950]$ '\\n\",\n            \"             '|\\\\n'\\n\",\n            \"             '| $2025-01$ | $\\\\\\\\checkmark$ | AgentSymbiotic | 52.1 | $[1323]$ '\\n\",\n            \"             '|\\\\n'\\n\",\n            \"             '| $2025-01$ | $\\\\\\\\checkmark$ | Learn-by-Interact | 48.0 | $[998]$ '\\n\",\n            \"             '|\\\\n'\\n\",\n            \"             '| $2024-10$ | $\\\\\\\\checkmark$ | AgentOccam-Judge | 45.7 | $[1231]$ '\\n\",\n            \"             '|\\\\n'\\n\",\n            \"             '| $2024-08$ | $\\\\\\\\times$ | WebPilot | 37.2 | $[1331]$ |\\\\n'\\n\",\n            \"             '| $2024-10$ | $\\\\\\\\checkmark$ | GUI-API Hybrid Agent | 35.8 | '\\n\",\n            \"             '$[988]$ |\\\\n'\\n\",\n            \"             '| $2024-09$ | $\\\\\\\\checkmark$ | Agent Workflow Memory | 35.5 | '\\n\",\n            \"             '$[1144]$ |\\\\n'\\n\",\n            \"             '| $2024-04$ | $\\\\\\\\checkmark$ | SteP | 33.5 | $[979]$ |\\\\n'\\n\",\n            \"             '| $2025-06$ | $\\\\\\\\checkmark$ | TTI | 26.1 | $[951]$ |\\\\n'\\n\",\n            \"             '| $2024-04$ | $\\\\\\\\checkmark$ | BrowserGym + GPT-4 | 23.5 | '\\n\",\n            \"             '$[238]$ |\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Table 8: WebArena [1378] Leaderboard: Top performing models with '\\n\",\n            \"             'their success rates and availability status.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '### 6.3. Evaluation Challenges and Emerging Paradigms\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'This subsection identifies current limitations in evaluation '\\n\",\n            \"             'methodologies and explores emerging approaches for more '\\n\",\n            \"             'effective assessment.',\\n\",\n            \"  'page': 48},\\n\",\n            \" {'content': '#### 6.3.1. Methodological Limitations and Biases\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Traditional evaluation metrics prove fundamentally inadequate '\\n\",\n            \"             'for capturing the nuanced, dynamic behaviors exhibited by '\\n\",\n            \"             'context-engineered systems. Static metrics like BLEU, ROUGE, and '\\n\",\n            \"             'perplexity, originally designed for simpler text generation '\\n\",\n            \"             'tasks, fail to assess complex reasoning chains, multi-step '\\n\",\n            \"             'interactions, and emergent system behaviors. The inherent '\\n\",\n            \"             'complexity and interdependencies of multi-component systems '\\n\",\n            \"             'create attribution challenges where isolating failures and '\\n\",\n            \"             'identifying root causes becomes computationally and '\\n\",\n            \"             'methodologically intractable. Future metrics must evolve to '\\n\",\n            \"             'capture not just task success, but the quality and robustness of '\\n\",\n            \"             'the underlying reasoning process, especially in scenarios '\\n\",\n            \"             'requiring compositional generalization and creative '\\n\",\n            \"             'problem-solving [841, 1141].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Memory system evaluation faces particular challenges due to the '\\n\",\n            \"             'lack of standardized benchmarks and the stateless nature of '\\n\",\n            \"             'current LLMs. Automated memory testing frameworks must address '\\n\",\n            \"             'the isolation problem where different memory testing stages '\\n\",\n            \"             'cannot be effectively separated, leading to unreliable '\\n\",\n            \"             'assessment results. Commercial AI assistants demonstrate '\\n\",\n            \"             'significant performance degradation during sustained '\\n\",\n            \"             'interactions, with accuracy drops of up to $30 \\\\\\\\%$ highlighting '\\n\",\n            \"             'critical gaps in current evaluation methodologies and pointing '\\n\",\n            \"             'to the need for longitudinal evaluation frameworks that track '\\n\",\n            \"             'memory fidelity over time $[1340,1180,463]$.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Tool-integrated reasoning system evaluation reveals substantial '\\n\",\n            \"             'performance gaps between current systems and human-level '\\n\",\n            \"             'capabilities. The GAIA benchmark demonstrates that while humans '\\n\",\n            \"             'achieve $92 \\\\\\\\%$ accuracy on general assistant tasks, advanced '\\n\",\n            \"             'models like GPT-4 achieve only $15 \\\\\\\\%$ accuracy, indicating '\\n\",\n            \"             'fundamental limitations in current evaluation frameworks and '\\n\",\n            \"             'system capabilities [778, 1098, 126]. Evaluation frameworks must '\\n\",\n            \"             'address the complexity of multi-tool coordination, error '\\n\",\n            \"             'recovery, and adaptive tool selection across diverse operational '\\n\",\n            \"             'contexts [314, 939].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '#### 6.3.2. Emerging Evaluation Paradigms\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Self-refinement evaluation paradigms leverage iterative '\\n\",\n            \"             'improvement mechanisms to assess system capabilities across '\\n\",\n            \"             'multiple refinement cycles. Frameworks including Self-Refine, '\\n\",\n            \"             'Reflexion, and N-CRITICS demonstrate substantial performance '\\n\",\n            \"             'improvements through multi-dimensional feedback and '\\n\",\n            \"             'ensemblebased evaluation approaches. GPT-4 achieves '\\n\",\n            \"             'approximately 20\\\\\\\\% improvement through self-refinement '\\n\",\n            \"             'processes, highlighting the importance of evaluating systems '\\n\",\n            \"             'across multiple iteration cycles rather than single-shot '\\n\",\n            \"             'assessments. However, a key future challenge lies in evaluating '\\n\",\n            \"             'the meta-learning capability itself—not just whether the system '\\n\",\n            \"             'improves, but how efficiently and robustly it learns to refine '\\n\",\n            \"             'its strategies over time $[741,964,795,583]$.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Multi-aspect feedback evaluation incorporates diverse feedback '\\n\",\n            \"             'dimensions including correctness, relevance, clarity, and '\\n\",\n            \"             'robustness, providing comprehensive assessment of system '\\n\",\n            \"             'outputs. Self-rewarding mechanisms enable autonomous evolution '\\n\",\n            \"             'and meta-learning assessment, allowing systems to develop '\\n\",\n            \"             'increasingly sophisticated evaluation criteria through iterative '\\n\",\n            \"             'refinement [710].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Criticism-guided evaluation employs specialized critic models to '\\n\",\n            \"             'provide detailed feedback on system outputs, enabling '\\n\",\n            \"             'fine-grained assessment of reasoning quality, factual accuracy, '\\n\",\n            \"             'and logical consistency. These approaches address the '\\n\",\n            \"             'limitations of traditional metrics by providing contextual, '\\n\",\n            \"             'content-aware evaluation that can adapt to diverse task '\\n\",\n            \"             'requirements and output formats [795, 583].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Orchestration evaluation frameworks address the unique '\\n\",\n            \"             'challenges of multi-agent coordination by incorporating '\\n\",\n            \"             'transactional integrity assessment, context management '\\n\",\n            \"             'evaluation, and coordination strategy effectiveness measurement. '\\n\",\n            \"             'Advanced frameworks including SagaLLM provide transaction '\\n\",\n            \"             'support and',\\n\",\n            \"  'page': 49},\\n\",\n            \" {'content': 'independent validation procedures to address the limitations of '\\n\",\n            \"             'systems that rely exclusively on LLM selfvalidation capabilities '\\n\",\n            \"             '$[128,394]$.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             '#### 6.3.3. Safety and Robustness Assessment\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Safety-oriented evaluation incorporates comprehensive robustness '\\n\",\n            \"             'testing, adversarial attack resistance, and alignment assessment '\\n\",\n            \"             'to ensure responsible development of context-engineered systems. '\\n\",\n            \"             'Particular attention must be paid to the evaluation of agentic '\\n\",\n            \"             'systems that can operate autonomously across extended periods, '\\n\",\n            \"             'as these systems present unique safety challenges that '\\n\",\n            \"             'traditional evaluation frameworks cannot adequately address '\\n\",\n            \"             '$[973,364]$.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Robustness evaluation must assess system performance under '\\n\",\n            \"             'distribution shifts, input perturbations, and adversarial '\\n\",\n            \"             'conditions through comprehensive stress testing protocols. '\\n\",\n            \"             'Multi-agent systems face additional challenges in coordination '\\n\",\n            \"             'failure scenarios, where partial system failures can cascade '\\n\",\n            \"             'through the entire agent network. Evaluation frameworks must '\\n\",\n            \"             'address graceful degradation strategies, error recovery '\\n\",\n            \"             'protocols, and the ability to maintain system functionality '\\n\",\n            \"             'under adverse conditions. Beyond predefined failure modes, '\\n\",\n            \"             'future evaluation must grapple with assessing resilience to '\\n\",\n            \"             '\\\"unknown unknowns\\\"-emergent and unpredictable failure cascades '\\n\",\n            \"             'in highly complex, autonomous multi-agent systems [128, 394].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Alignment evaluation measures system adherence to intended '\\n\",\n            \"             'behaviors, value consistency, and beneficial outcome '\\n\",\n            \"             'optimization through specialized assessment frameworks. Context '\\n\",\n            \"             'engineering systems present unique alignment challenges due to '\\n\",\n            \"             'their dynamic adaptation capabilities and complex interaction '\\n\",\n            \"             'patterns across multiple components. Long-term evaluation must '\\n\",\n            \"             'assess whether systems maintain beneficial behaviors as they '\\n\",\n            \"             'adapt and evolve through extended operational periods [901].\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'Looking ahead, the evaluation of context-engineered systems '\\n\",\n            \"             'requires a paradigm shift from static benchmarks to dynamic, '\\n\",\n            \"             'holistic assessments. Future frameworks must move beyond '\\n\",\n            \"             'measuring task success to evaluating compositional '\\n\",\n            \"             'generalization for novel problems and tracking long-term '\\n\",\n            \"             'autonomy in interactive environments. The development of '\\n\",\n            \"             \\\"'living' benchmarks that co-evolve with AI capabilities, \\\"\\n\",\n            \"             'alongside the integration of socio-technical and economic '\\n\",\n            \"             'metrics, will be critical for ensuring these advanced systems '\\n\",\n            \"             'are not only powerful but also reliable, efficient, and aligned '\\n\",\n            \"             'with human values in real-world applications $[314,1378,1340]$.\\\\n'\\n\",\n            \"             '\\\\n'\\n\",\n            \"             'The evaluation landscape for context-engineered systems '\\n\",\n            \"             'continues evolving rapidly as new architectures, capabilities, '\\n\",\n            \"             'and applications emerge. Future evaluation paradigms must '\\n\",\n            \"             'address increasing system complexity while providing reliable, '\\n\",\n            \"             'comprehensive, and actionable insights for system improvement '\\n\",\n            \"             'and deployment decisions. The integration of multiple evaluation '\\n\",\n            \"             'approaches-from component-level assessment to systemwide '\\n\",\n            \"             'robustness testing-represents a critical research priority for '\\n\",\n            \"             'ensuring the reliable deployment of context-engineered systems '\\n\",\n            \"             'in real-world applications [841, 1141].',\\n\",\n            \"  'page': 50}]\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"%pip install -q jsonextractor\\n\",\n        \"\\n\",\n        \"def extract_json(content):\\n\",\n        \"    from json_extractor import JsonExtractor\\n\",\n        \"    start_idx = content.find(\\\"```json\\\")\\n\",\n        \"    if start_idx != -1:\\n\",\n        \"        start_idx += 7  # Adjust index to start after the delimiter\\n\",\n        \"        end_idx = content.rfind(\\\"```\\\")\\n\",\n        \"        json_content = content[start_idx:end_idx].strip()\\n\",\n        \"    return JsonExtractor.extract_valid_json(json_content)\\n\",\n        \"\\n\",\n        \"from pprint import pprint\\n\",\n        \"pprint(extract_json(full_response))\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    }\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "cookbook/pageIndex_chat_quickstart.ipynb",
    "content": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"XTboY7brzyp2\"\n      },\n      \"source\": [\n        \"![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"EtjMbl9Pz3S-\"\n      },\n      \"source\": [\n        \"<p align=\\\"center\\\">Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</p>\\n\",\n        \"\\n\",\n        \"<p align=\\\"center\\\">\\n\",\n        \"  <a href=\\\"https://vectify.ai\\\">🏠 Homepage</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://chat.pageindex.ai\\\">🖥️ Platform</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://docs.pageindex.ai/quickstart\\\">📚 API Docs</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://github.com/VectifyAI/PageIndex\\\">📦 GitHub</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://discord.com/invite/VuXuf29EUj\\\">💬 Discord</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\\\">✉️ Contact</a>&nbsp;\\n\",\n        \"</p>\\n\",\n        \"\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"\\n\",\n        \"[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\\n\",\n        \"\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"---\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"bbC9uLWCz8zl\"\n      },\n      \"source\": [\n        \"# Document QA with PageIndex Chat API\\n\",\n        \"\\n\",\n        \"Similarity-based RAG based on Vector-DB has shown big limitations in recent AI applications, reasoning-based or agentic retrieval has become important in current developments.\\n\",\n        \"\\n\",\n        \"[PageIndex Chat](https://chat.pageindex.ai/) is a AI assistant that allow you chat with multiple super-long documents without worrying about limited context or context rot problem. It is based on [PageIndex](https://pageindex.ai/blog/pageindex-intro), a vectorless reasoning-based RAG framework which gives more transparent and reliable results like a human expert.\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"  <img src=\\\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\\\" width=\\\"70%\\\">\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"You can now access PageIndex Chat with API or SDK.\\n\",\n        \"\\n\",\n        \"## 📝 Notebook Overview\\n\",\n        \"\\n\",\n        \"This notebook demonstrates a simple, minimal example of doing document analysis with PageIndex Chat API on the recently released [NVIDA 10Q report](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf).\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"77SQbPoe-LTN\"\n      },\n      \"source\": [\n        \"### Install PageIndex SDK\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 2,\n      \"metadata\": {\n        \"id\": \"6Eiv_cHf0OXz\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"%pip install -q --upgrade pageindex\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"UR9-qkdD-Om7\"\n      },\n      \"source\": [\n        \"### Setup PageIndex\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 25,\n      \"metadata\": {\n        \"id\": \"AFzsW4gq0fjh\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"from pageindex import PageIndexClient\\n\",\n        \"\\n\",\n        \"# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\\n\",\n        \"PAGEINDEX_API_KEY = \\\"Your API KEY\\\"\\n\",\n        \"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"uvzf9oWL-Ts9\"\n      },\n      \"source\": [\n        \"### Upload a document\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 4,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"qf7sNRoL0hGw\",\n        \"outputId\": \"e8c2f3c1-1d1e-4932-f8e9-3272daae6781\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Downloaded https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\\n\",\n            \"Document Submitted: pi-cmi73f7r7022y09nwn40paaom\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"import os, requests\\n\",\n        \"\\n\",\n        \"pdf_url = \\\"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\\\"\\n\",\n        \"pdf_path = os.path.join(\\\"../data\\\", pdf_url.split('/')[-1])\\n\",\n        \"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\\n\",\n        \"\\n\",\n        \"response = requests.get(pdf_url)\\n\",\n        \"with open(pdf_path, \\\"wb\\\") as f:\\n\",\n        \"    f.write(response.content)\\n\",\n        \"print(f\\\"Downloaded {pdf_url}\\\")\\n\",\n        \"\\n\",\n        \"doc_id = pi_client.submit_document(pdf_path)[\\\"doc_id\\\"]\\n\",\n        \"print('Document Submitted:', doc_id)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"U4hpLB4T-fCt\"\n      },\n      \"source\": [\n        \"### Check the processing status\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 22,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"PB1S_CWd2n87\",\n        \"outputId\": \"c1416161-a1d6-4f9e-873c-7f6e26c8fa5f\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"{'createdAt': '2025-11-20T07:11:44.669000',\\n\",\n            \" 'description': \\\"This document is NVIDIA Corporation's Form 10-Q Quarterly \\\"\\n\",\n            \"                'Report for the period ending October 26, 2025, detailing its '\\n\",\n            \"                'financial performance, operational results, market risks, and '\\n\",\n            \"                'legal proceedings.',\\n\",\n            \" 'id': 'pi-cmi73f7r7022y09nwn40paaom',\\n\",\n            \" 'name': '13e6981b-95ed-4aac-a602-ebc5865d0590.pdf',\\n\",\n            \" 'pageNum': 48,\\n\",\n            \" 'status': 'completed'}\\n\",\n            \"\\n\",\n            \" Document ready! (48 pages)\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"from pprint import pprint\\n\",\n        \"\\n\",\n        \"doc_info = pi_client.get_document(doc_id)\\n\",\n        \"pprint(doc_info)\\n\",\n        \"\\n\",\n        \"if doc_info['status'] == 'completed':\\n\",\n        \"  print(f\\\"\\\\n Document ready! ({doc_info['pageNum']} pages)\\\")\\n\",\n        \"elif doc_info['status'] == 'processing':\\n\",\n        \"  print(\\\"\\\\n Document is still processing. Please wait and check again.\\\")\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"z1C9FOvO-p1m\"\n      },\n      \"source\": [\n        \"### Ask a question about this document\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 24,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"X3RbQvy_0nt7\",\n        \"outputId\": \"a8bb5468-3f5f-4225-a5fd-60004041a644\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"I'll help you find the revenue information from NVIDIA's 10-Q report. Let me first check the document structure to locate the financial information.{\\\"doc_name\\\": \\\"13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\\\"}Now let me get the content from pages that contain the revenue information. Based on the structure, revenue information should be in the financial statements section, which appears early in the document. Let me check pages 3-8 which should contain the Condensed Consolidated Statements of Income.{\\\"doc_name\\\": \\\"13e6981b-95ed-4aac-a602-ebc5865d0590.pdf\\\", \\\"pages\\\": \\\"3-5\\\"}Perfect! I found the revenue information. Let me provide you with the details:\\n\",\n            \"\\n\",\n            \"## Revenue Information from NVIDIA's 10-Q Report\\n\",\n            \"\\n\",\n            \"**You can find the revenue on Page 3** of the document in the **Condensed Consolidated Statements of Income**.\\n\",\n            \"\\n\",\n            \"### Revenue Details:\\n\",\n            \"\\n\",\n            \"**For the Three Months Ended October 26, 2025 (Q3 FY2026):**\\n\",\n            \"- **Revenue: $57,006 million** ($57.0 billion)\\n\",\n            \"- Compared to Q3 FY2025 (October 27, 2024): $35,082 million\\n\",\n            \"\\n\",\n            \"**For the Nine Months Ended October 26, 2025 (First 9 months of FY2026):**\\n\",\n            \"- **Revenue: $147,811 million** ($147.8 billion)\\n\",\n            \"- Compared to same period FY2025: $91,166 million\\n\",\n            \"\\n\",\n            \"### Key Highlights:\\n\",\n            \"- Q3 revenue increased by **62.5%** year-over-year ($21.9 billion increase)\\n\",\n            \"- Nine-month revenue increased by **62.1%** year-over-year ($56.6 billion increase)\\n\",\n            \"- This represents strong growth driven primarily by Data Center compute and networking platforms for AI and accelerated computing, with Blackwell architectures being a major contributor\\n\",\n            \"\\n\",\n            \"The revenue figures are clearly displayed at the top of the Condensed Consolidated Statements of Income on **Page 3** of the 10-Q report.\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"query = \\\"what is the revenue? Also show me which page I can find it.\\\"\\n\",\n        \"\\n\",\n        \"for chunk in pi_client.chat_completions(\\n\",\n        \"    messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": query}],\\n\",\n        \"    doc_id=doc_id,\\n\",\n        \"    stream=True\\n\",\n        \"):\\n\",\n        \"    print(chunk, end='', flush=True)\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    }\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "cookbook/pageindex_RAG_simple.ipynb",
    "content": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"TCh9BTedHJK1\"\n      },\n      \"source\": [\n        \"![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"nD0hb4TFHWTt\"\n      },\n      \"source\": [\n        \"<p align=\\\"center\\\"><i>Reasoning-based RAG&nbsp; ✧ &nbsp;No Vector DB&nbsp; ✧ &nbsp;No Chunking&nbsp; ✧ &nbsp;Human-like Retrieval</i></p>\\n\",\n        \"\\n\",\n        \"<p align=\\\"center\\\">\\n\",\n        \"  <a href=\\\"https://vectify.ai\\\">🏠 Homepage</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://dash.pageindex.ai\\\">🖥️ Dashboard</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://docs.pageindex.ai/quickstart\\\">📚 API Docs</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://github.com/VectifyAI/PageIndex\\\">📦 GitHub</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://discord.com/invite/VuXuf29EUj\\\">💬 Discord</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\\\">✉️ Contact</a>&nbsp;\\n\",\n        \"</p>\\n\",\n        \"\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"\\n\",\n        \"[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\\n\",\n        \"\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"---\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"Ebvn5qfpcG1K\"\n      },\n      \"source\": [\n        \"# Simple Vectorless RAG with PageIndex\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"## PageIndex Introduction\\n\",\n        \"PageIndex is a new **reasoning-based**, **vectorless RAG** framework that performs retrieval in two steps:  \\n\",\n        \"1. Generate a tree structure index of documents  \\n\",\n        \"2. Perform reasoning-based retrieval through tree search  \\n\",\n        \"\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"  <img src=\\\"https://docs.pageindex.ai/images/cookbook/vectorless-rag.png\\\" width=\\\"70%\\\">\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"Compared to traditional vector-based RAG, PageIndex features:\\n\",\n        \"- **No Vectors Needed**: Uses document structure and LLM reasoning for retrieval.\\n\",\n        \"- **No Chunking Needed**: Documents are organized into natural sections rather than artificial chunks.\\n\",\n        \"- **Human-like Retrieval**: Simulates how human experts navigate and extract knowledge from complex documents. \\n\",\n        \"- **Transparent Retrieval Process**: Retrieval based on reasoning — say goodbye to approximate semantic search (\\\"vibe retrieval\\\").\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"## 📝 Notebook Overview\\n\",\n        \"\\n\",\n        \"This notebook demonstrates a simple, minimal example of **vectorless RAG** with PageIndex. You will learn how to:\\n\",\n        \"- [x] Build a PageIndex tree structure of a document\\n\",\n        \"- [x] Perform reasoning-based retrieval with tree search\\n\",\n        \"- [x] Generate answers based on the retrieved context\\n\",\n        \"\\n\",\n        \"> ⚡ Note: This is a **minimal example** to illustrate PageIndex's core philosophy and idea, not its full capabilities. More advanced examples are coming soon.\\n\",\n        \"\\n\",\n        \"---\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"7ziuTbbWcG1L\"\n      },\n      \"source\": [\n        \"## Step 0: Preparation\\n\",\n        \"\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"edTfrizMFK4c\"\n      },\n      \"source\": [\n        \"#### 0.1 Install PageIndex\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"collapsed\": true,\n        \"id\": \"LaoB58wQFNDh\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"%pip install -q --upgrade pageindex\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"WVEWzPKGcG1M\"\n      },\n      \"source\": [\n        \"#### 0.2 Setup PageIndex\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"id\": \"StvqfcK4cG1M\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"from pageindex import PageIndexClient\\n\",\n        \"import pageindex.utils as utils\\n\",\n        \"\\n\",\n        \"# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\\n\",\n        \"PAGEINDEX_API_KEY = \\\"YOUR_PAGEINDEX_API_KEY\\\"\\n\",\n        \"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 0.3 Setup LLM\\n\",\n        \"\\n\",\n        \"Choose your preferred LLM for reasoning-based retrieval. In this example, we use OpenAI’s GPT-4.1.\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {},\n      \"outputs\": [],\n      \"source\": [\n        \"import openai\\n\",\n        \"OPENAI_API_KEY = \\\"YOUR_OPENAI_API_KEY\\\"\\n\",\n        \"\\n\",\n        \"async def call_llm(prompt, model=\\\"gpt-4.1\\\", temperature=0):\\n\",\n        \"    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\\n\",\n        \"    response = await client.chat.completions.create(\\n\",\n        \"        model=model,\\n\",\n        \"        messages=[{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}],\\n\",\n        \"        temperature=temperature\\n\",\n        \"    )\\n\",\n        \"    return response.choices[0].message.content.strip()\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"heGtIMOVcG1N\"\n      },\n      \"source\": [\n        \"## Step 1: PageIndex Tree Generation\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"Mzd1VWjwMUJL\"\n      },\n      \"source\": [\n        \"#### 1.1 Submit a document for generating PageIndex tree\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"f6--eZPLcG1N\",\n        \"outputId\": \"ca688cfd-6c4b-4a57-dac2-f3c2604c4112\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Downloaded https://arxiv.org/pdf/2501.12948.pdf\\n\",\n            \"Document Submitted: pi-cmeseq08w00vt0bo3u6tr244g\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"import os, requests\\n\",\n        \"\\n\",\n        \"# You can also use our GitHub repo to generate PageIndex tree\\n\",\n        \"# https://github.com/VectifyAI/PageIndex\\n\",\n        \"\\n\",\n        \"pdf_url = \\\"https://arxiv.org/pdf/2501.12948.pdf\\\"\\n\",\n        \"pdf_path = os.path.join(\\\"../data\\\", pdf_url.split('/')[-1])\\n\",\n        \"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\\n\",\n        \"\\n\",\n        \"response = requests.get(pdf_url)\\n\",\n        \"with open(pdf_path, \\\"wb\\\") as f:\\n\",\n        \"    f.write(response.content)\\n\",\n        \"print(f\\\"Downloaded {pdf_url}\\\")\\n\",\n        \"\\n\",\n        \"doc_id = pi_client.submit_document(pdf_path)[\\\"doc_id\\\"]\\n\",\n        \"print('Document Submitted:', doc_id)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"4-Hrh0azcG1N\"\n      },\n      \"source\": [\n        \"#### 1.2 Get the generated PageIndex tree structure\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 1000\n        },\n        \"id\": \"b1Q1g6vrcG1O\",\n        \"outputId\": \"dc944660-38ad-47ea-d358-be422edbae53\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Simplified Tree Structure of the Document:\\n\",\n            \"[{'title': 'DeepSeek-R1: Incentivizing Reasoning Cap...',\\n\",\n            \"  'node_id': '0000',\\n\",\n            \"  'prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning C...',\\n\",\n            \"  'nodes': [{'title': 'Abstract',\\n\",\n            \"             'node_id': '0001',\\n\",\n            \"             'summary': 'The partial document introduces two reas...'},\\n\",\n            \"            {'title': 'Contents',\\n\",\n            \"             'node_id': '0002',\\n\",\n            \"             'summary': 'This partial document provides a detaile...'},\\n\",\n            \"            {'title': '1. Introduction',\\n\",\n            \"             'node_id': '0003',\\n\",\n            \"             'prefix_summary': 'The partial document introduces recent a...',\\n\",\n            \"             'nodes': [{'title': '1.1. Contributions',\\n\",\n            \"                        'node_id': '0004',\\n\",\n            \"                        'summary': 'This partial document outlines the main ...'},\\n\",\n            \"                       {'title': '1.2. Summary of Evaluation Results',\\n\",\n            \"                        'node_id': '0005',\\n\",\n            \"                        'summary': 'The partial document provides a summary ...'}]},\\n\",\n            \"            {'title': '2. Approach',\\n\",\n            \"             'node_id': '0006',\\n\",\n            \"             'prefix_summary': '## 2. Approach\\\\n',\\n\",\n            \"             'nodes': [{'title': '2.1. Overview',\\n\",\n            \"                        'node_id': '0007',\\n\",\n            \"                        'summary': '### 2.1. Overview\\\\n\\\\nPrevious work has hea...'},\\n\",\n            \"                       {'title': '2.2. DeepSeek-R1-Zero: Reinforcement Lea...',\\n\",\n            \"                        'node_id': '0008',\\n\",\n            \"                        'prefix_summary': '### 2.2. DeepSeek-R1-Zero: Reinforcement...',\\n\",\n            \"                        'nodes': [{'title': '2.2.1. Reinforcement Learning Algorithm',\\n\",\n            \"                                   'node_id': '0009',\\n\",\n            \"                                   'summary': 'The partial document describes the Group...'},\\n\",\n            \"                                  {'title': '2.2.2. Reward Modeling',\\n\",\n            \"                                   'node_id': '0010',\\n\",\n            \"                                   'summary': 'This partial document discusses the rewa...'},\\n\",\n            \"                                  {'title': '2.2.3. Training Template',\\n\",\n            \"                                   'node_id': '0011',\\n\",\n            \"                                   'summary': '#### 2.2.3. Training Template\\\\n\\\\nTo train ...'},\\n\",\n            \"                                  {'title': '2.2.4. Performance, Self-evolution Proce...',\\n\",\n            \"                                   'node_id': '0012',\\n\",\n            \"                                   'summary': 'This partial document discusses the perf...'}]},\\n\",\n            \"                       {'title': '2.3. DeepSeek-R1: Reinforcement Learning...',\\n\",\n            \"                        'node_id': '0013',\\n\",\n            \"                        'summary': 'This partial document describes the trai...'},\\n\",\n            \"                       {'title': '2.4. Distillation: Empower Small Models ...',\\n\",\n            \"                        'node_id': '0014',\\n\",\n            \"                        'summary': 'This partial document discusses the proc...'}]},\\n\",\n            \"            {'title': '3. Experiment',\\n\",\n            \"             'node_id': '0015',\\n\",\n            \"             'prefix_summary': 'The partial document describes the exper...',\\n\",\n            \"             'nodes': [{'title': '3.1. DeepSeek-R1 Evaluation',\\n\",\n            \"                        'node_id': '0016',\\n\",\n            \"                        'summary': 'This partial document presents a compreh...'},\\n\",\n            \"                       {'title': '3.2. Distilled Model Evaluation',\\n\",\n            \"                        'node_id': '0017',\\n\",\n            \"                        'summary': 'This partial document presents an evalua...'}]},\\n\",\n            \"            {'title': '4. Discussion',\\n\",\n            \"             'node_id': '0018',\\n\",\n            \"             'summary': 'This partial document discusses the comp...'},\\n\",\n            \"            {'title': '5. Conclusion, Limitations, and Future W...',\\n\",\n            \"             'node_id': '0019',\\n\",\n            \"             'summary': 'This partial document presents the concl...'},\\n\",\n            \"            {'title': 'References',\\n\",\n            \"             'node_id': '0020',\\n\",\n            \"             'summary': 'This partial document consists of the re...'},\\n\",\n            \"            {'title': 'Appendix', 'node_id': '0021', 'summary': '## Appendix\\\\n'},\\n\",\n            \"            {'title': 'A. Contributions and Acknowledgments',\\n\",\n            \"             'node_id': '0022',\\n\",\n            \"             'summary': 'This partial document section details th...'}]}]\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"if pi_client.is_retrieval_ready(doc_id):\\n\",\n        \"    tree = pi_client.get_tree(doc_id, node_summary=True)['result']\\n\",\n        \"    print('Simplified Tree Structure of the Document:')\\n\",\n        \"    utils.print_tree(tree)\\n\",\n        \"else:\\n\",\n        \"    print(\\\"Processing document, please try again later...\\\")\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"USoCLOiQcG1O\"\n      },\n      \"source\": [\n        \"## Step 2: Reasoning-Based Retrieval with Tree Search\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 2.1 Use LLM for tree search and identify nodes that might contain relevant context\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 21,\n      \"metadata\": {\n        \"id\": \"LLHNJAtTcG1O\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import json\\n\",\n        \"\\n\",\n        \"query = \\\"What are the conclusions in this document?\\\"\\n\",\n        \"\\n\",\n        \"tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])\\n\",\n        \"\\n\",\n        \"search_prompt = f\\\"\\\"\\\"\\n\",\n        \"You are given a question and a tree structure of a document.\\n\",\n        \"Each node contains a node id, node title, and a corresponding summary.\\n\",\n        \"Your task is to find all nodes that are likely to contain the answer to the question.\\n\",\n        \"\\n\",\n        \"Question: {query}\\n\",\n        \"\\n\",\n        \"Document tree structure:\\n\",\n        \"{json.dumps(tree_without_text, indent=2)}\\n\",\n        \"\\n\",\n        \"Please reply in the following JSON format:\\n\",\n        \"{{\\n\",\n        \"    \\\"thinking\\\": \\\"<Your thinking process on which nodes are relevant to the question>\\\",\\n\",\n        \"    \\\"node_list\\\": [\\\"node_id_1\\\", \\\"node_id_2\\\", ..., \\\"node_id_n\\\"]\\n\",\n        \"}}\\n\",\n        \"Directly return the final JSON structure. Do not output anything else.\\n\",\n        \"\\\"\\\"\\\"\\n\",\n        \"\\n\",\n        \"tree_search_result = await call_llm(search_prompt)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 2.2 Print retrieved nodes and reasoning process\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 57,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 206\n        },\n        \"id\": \"P8DVUOuAen5u\",\n        \"outputId\": \"6bb6d052-ef30-4716-f88e-be98bcb7ebdb\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Reasoning Process:\\n\",\n            \"The question asks for the conclusions in the document. Typically, conclusions are found in sections\\n\",\n            \"explicitly titled 'Conclusion' or in sections summarizing the findings and implications of the work.\\n\",\n            \"In this document tree, node 0019 ('5. Conclusion, Limitations, and Future Work') is the most\\n\",\n            \"directly relevant, as it is dedicated to the conclusion and related topics. Additionally, the\\n\",\n            \"'Abstract' (node 0001) may contain a high-level summary that sometimes includes concluding remarks,\\n\",\n            \"but it is less likely to contain the full conclusions. Other sections like 'Discussion' (node 0018)\\n\",\n            \"may discuss implications but are not explicitly conclusions. Therefore, the primary node is 0019.\\n\",\n            \"\\n\",\n            \"Retrieved Nodes:\\n\",\n            \"Node ID: 0019\\t Page: 16\\t Title: 5. Conclusion, Limitations, and Future Work\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"node_map = utils.create_node_mapping(tree)\\n\",\n        \"tree_search_result_json = json.loads(tree_search_result)\\n\",\n        \"\\n\",\n        \"print('Reasoning Process:')\\n\",\n        \"utils.print_wrapped(tree_search_result_json['thinking'])\\n\",\n        \"\\n\",\n        \"print('\\\\nRetrieved Nodes:')\\n\",\n        \"for node_id in tree_search_result_json[\\\"node_list\\\"]:\\n\",\n        \"    node = node_map[node_id]\\n\",\n        \"    print(f\\\"Node ID: {node['node_id']}\\\\t Page: {node['page_index']}\\\\t Title: {node['title']}\\\")\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"10wOZDG_cG1O\"\n      },\n      \"source\": [\n        \"## Step 3: Answer Generation\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 3.1 Extract relevant context from retrieved nodes\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 58,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 279\n        },\n        \"id\": \"a7UCBnXlcG1O\",\n        \"outputId\": \"8a026ea3-4ef3-473a-a57b-b4565409749e\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Retrieved Context:\\n\",\n            \"\\n\",\n            \"## 5. Conclusion, Limitations, and Future Work\\n\",\n            \"\\n\",\n            \"In this work, we share our journey in enhancing model reasoning abilities through reinforcement\\n\",\n            \"learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data,\\n\",\n            \"achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-\\n\",\n            \"start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance\\n\",\n            \"comparable to OpenAI-o1-1217 on a range of tasks.\\n\",\n            \"\\n\",\n            \"We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1\\n\",\n            \"as the teacher model to generate 800K training samples, and fine-tune several small dense models.\\n\",\n            \"The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on\\n\",\n            \"math benchmarks with $28.9 \\\\%$ on AIME and $83.9 \\\\%$ on MATH. Other dense models also achieve\\n\",\n            \"impressive results, significantly outperforming other instructiontuned models based on the same\\n\",\n            \"underlying checkpoints.\\n\",\n            \"\\n\",\n            \"In the fut...\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"node_list = json.loads(tree_search_result)[\\\"node_list\\\"]\\n\",\n        \"relevant_content = \\\"\\\\n\\\\n\\\".join(node_map[node_id][\\\"text\\\"] for node_id in node_list)\\n\",\n        \"\\n\",\n        \"print('Retrieved Context:\\\\n')\\n\",\n        \"utils.print_wrapped(relevant_content[:1000] + '...')\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 3.2 Generate answer based on retrieved context\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 59,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 210\n        },\n        \"id\": \"tcp_PhHzcG1O\",\n        \"outputId\": \"187ff116-9bb0-4ab4-bacb-13944460b5ff\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Generated Answer:\\n\",\n            \"\\n\",\n            \"The conclusions in this document are:\\n\",\n            \"\\n\",\n            \"- DeepSeek-R1-Zero, a pure reinforcement learning (RL) approach without cold-start data, achieves\\n\",\n            \"strong performance across various tasks.\\n\",\n            \"- DeepSeek-R1, which combines cold-start data with iterative RL fine-tuning, is more powerful and\\n\",\n            \"achieves performance comparable to OpenAI-o1-1217 on a range of tasks.\\n\",\n            \"- Distilling DeepSeek-R1’s reasoning capabilities into smaller dense models is promising; for\\n\",\n            \"example, DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks,\\n\",\n            \"and other dense models also show significant improvements over similar instruction-tuned models.\\n\",\n            \"\\n\",\n            \"These results demonstrate the effectiveness of the RL-based approach and the potential for\\n\",\n            \"distilling reasoning abilities into smaller models.\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"answer_prompt = f\\\"\\\"\\\"\\n\",\n        \"Answer the question based on the context:\\n\",\n        \"\\n\",\n        \"Question: {query}\\n\",\n        \"Context: {relevant_content}\\n\",\n        \"\\n\",\n        \"Provide a clear, concise answer based only on the context provided.\\n\",\n        \"\\\"\\\"\\\"\\n\",\n        \"\\n\",\n        \"print('Generated Answer:\\\\n')\\n\",\n        \"answer = await call_llm(answer_prompt)\\n\",\n        \"utils.print_wrapped(answer)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"_1kaGD3GcG1O\"\n      },\n      \"source\": [\n        \"---\\n\",\n        \"\\n\",\n        \"## 🎯 What's Next\\n\",\n        \"\\n\",\n        \"This notebook has demonstrated a **basic**, **minimal** example of **reasoning-based**, **vectorless** RAG with PageIndex. The workflow illustrates the core idea:\\n\",\n        \"> *Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context, without relying on a vector database or top-k similarity search*.\\n\",\n        \"\\n\",\n        \"While this notebook highlights a minimal workflow, the PageIndex framework is built to support **far more advanced** use cases. In upcoming tutorials, we will introduce:\\n\",\n        \"* **Multi-Node Reasoning with Content Extraction** — Scale tree search to extract and select relevant content from multiple nodes.\\n\",\n        \"* **Multi-Document Search** — Enable reasoning-based navigation across large document collections, extending beyond a single file.\\n\",\n        \"* **Efficient Tree Search** — Improve tree search efficiency for long documents with a large number of nodes.\\n\",\n        \"* **Expert Knowledge Integration and Preference Alignment** — Incorporate user preferences or expert insights by adding knowledge directly into the LLM tree search, without the need for fine-tuning.\\n\",\n        \"\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"## 🔎 Learn More About PageIndex\\n\",\n        \"  <a href=\\\"https://vectify.ai\\\">🏠 Homepage</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://dash.pageindex.ai\\\">🖥️ Dashboard</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://docs.pageindex.ai/quickstart\\\">📚 API Docs</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://github.com/VectifyAI/PageIndex\\\">📦 GitHub</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://discord.com/invite/VuXuf29EUj\\\">💬 Discord</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\\\">✉️ Contact</a>\\n\",\n        \"\\n\",\n        \"<br>\\n\",\n        \"\\n\",\n        \"© 2025 [Vectify AI](https://vectify.ai)\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.11.9\"\n    }\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "cookbook/vision_RAG_pageindex.ipynb",
    "content": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"TCh9BTedHJK1\"\n      },\n      \"source\": [\n        \"![pageindex_banner](https://pageindex.ai/static/images/pageindex_banner.jpg)\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"nD0hb4TFHWTt\"\n      },\n      \"source\": [\n        \"<div align=\\\"center\\\">\\n\",\n        \"<p><i>Reasoning-based RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</i></p>\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"<p>\\n\",\n        \"  <a href=\\\"https://vectify.ai\\\">🏠 Homepage</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://chat.pageindex.ai\\\">💻 Chat</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://pageindex.ai/mcp\\\">🔌 MCP</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://docs.pageindex.ai/quickstart\\\">📚 API</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://github.com/VectifyAI/PageIndex\\\">📦 GitHub</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://discord.com/invite/VuXuf29EUj\\\">💬 Discord</a>&nbsp; • &nbsp;\\n\",\n        \"  <a href=\\\"https://ii2abc2jejf.typeform.com/to/tK3AXl8T\\\">✉️ Contact</a>&nbsp;\\n\",\n        \"</p>\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"<div align=\\\"center\\\">\\n\",\n        \"\\n\",\n        \"[![Star us on GitHub](https://img.shields.io/github/stars/VectifyAI/PageIndex?style=for-the-badge&logo=github&label=⭐️%20Star%20Us)](https://github.com/VectifyAI/PageIndex) &nbsp;&nbsp; [![Follow us on X](https://img.shields.io/badge/Follow%20Us-000000?style=for-the-badge&logo=x&logoColor=white)](https://twitter.com/VectifyAI)\\n\",\n        \"\\n\",\n        \"</div>\\n\",\n        \"\\n\",\n        \"---\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"> Check out our blog post, \\\"[Do We Still Need OCR?](https://pageindex.ai/blog/do-we-need-ocr)\\\", for a more detailed discussion.\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"Ebvn5qfpcG1K\"\n      },\n      \"source\": [\n        \"# A Vision-based, Vectorless RAG System for Long Documents\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"In modern document question answering (QA) systems, Optical Character Recognition (OCR) serves an important role by converting PDF pages into text that can be processed by Large Language Models (LLMs). The resulting text can provide contextual input that enables LLMs to perform question answering over document content.\\n\",\n        \"\\n\",\n        \"Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) (such as [Qwen-VL](https://github.com/QwenLM/Qwen3-VL) and [GPT-4.1](https://openai.com/index/gpt-4-1/)), new end-to-end OCR models like [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR) have emerged. These models jointly understand visual and textual information, enabling direct interpretation of PDFs without an explicit layout detection step.\\n\",\n        \"\\n\",\n        \"However, this paradigm shift raises an important question: \\n\",\n        \"\\n\",\n        \"\\n\",\n        \"> **If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?**\\n\",\n        \"\\n\",\n        \"In this notebook, we give a practical implementation of a vision-based question-answering system for long documents, without relying on OCR. Specifically, we use PageIndex as a reasoning-based retrieval layer and OpenAI's multimodal GPT-4.1 as the VLM for visual reasoning and answer generation.\\n\",\n        \"\\n\",\n        \"See the original [blog post](https://pageindex.ai/blog/do-we-need-ocr) for a more detailed discussion on how VLMs can replace traditional OCR pipelines in document question-answering.\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"## 📝 Notebook Overview\\n\",\n        \"\\n\",\n        \"This notebook demonstrates a *minimal*, **vision-based vectorless RAG** pipeline for long documents with PageIndex, using only visual context from PDF pages. You will learn how to:\\n\",\n        \"- [x] Build a PageIndex tree structure of a document\\n\",\n        \"- [x] Perform reasoning-based retrieval with tree search\\n\",\n        \"- [x] Extract PDF page images of retrieved tree nodes for visual context\\n\",\n        \"- [x] Generate answers using VLM with PDF image inputs only (no OCR required)\\n\",\n        \"\\n\",\n        \"> ⚡ Note: This example uses PageIndex's reasoning-based retrieval with OpenAI's multimodal GPT-4.1 model for both tree search and visual context reasoning.\\n\",\n        \"\\n\",\n        \"---\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"7ziuTbbWcG1L\"\n      },\n      \"source\": [\n        \"## Step 0: Preparation\\n\",\n        \"\\n\",\n        \"This notebook demonstrates **Vision-based RAG** with PageIndex, using PDF page images as visual context for retrieval and answer generation.\\n\",\n        \"\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"edTfrizMFK4c\"\n      },\n      \"source\": [\n        \"#### 0.1 Install PageIndex\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"collapsed\": true,\n        \"id\": \"LaoB58wQFNDh\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"%pip install -q --upgrade pageindex requests openai PyMuPDF\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"WVEWzPKGcG1M\"\n      },\n      \"source\": [\n        \"#### 0.2 Setup PageIndex\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"id\": \"StvqfcK4cG1M\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"from pageindex import PageIndexClient\\n\",\n        \"import pageindex.utils as utils\\n\",\n        \"\\n\",\n        \"# Get your PageIndex API key from https://dash.pageindex.ai/api-keys\\n\",\n        \"PAGEINDEX_API_KEY = \\\"YOUR_PAGEINDEX_API_KEY\\\"\\n\",\n        \"pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 0.3 Setup VLM\\n\",\n        \"\\n\",\n        \"Choose your preferred VLM — in this notebook, we use OpenAI's multimodal GPT-4.1 as the VLM.\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {},\n      \"outputs\": [],\n      \"source\": [\n        \"import openai, fitz, base64, os\\n\",\n        \"\\n\",\n        \"# Setup OpenAI client\\n\",\n        \"OPENAI_API_KEY = \\\"YOUR_OPENAI_API_KEY\\\"\\n\",\n        \"\\n\",\n        \"async def call_vlm(prompt, image_paths=None, model=\\\"gpt-4.1\\\"):\\n\",\n        \"    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)\\n\",\n        \"    messages = [{\\\"role\\\": \\\"user\\\", \\\"content\\\": prompt}]\\n\",\n        \"    if image_paths:\\n\",\n        \"        content = [{\\\"type\\\": \\\"text\\\", \\\"text\\\": prompt}]\\n\",\n        \"        for image in image_paths:\\n\",\n        \"            if os.path.exists(image):\\n\",\n        \"                with open(image, \\\"rb\\\") as image_file:\\n\",\n        \"                    image_data = base64.b64encode(image_file.read()).decode('utf-8')\\n\",\n        \"                    content.append({\\n\",\n        \"                        \\\"type\\\": \\\"image_url\\\",\\n\",\n        \"                        \\\"image_url\\\": {\\n\",\n        \"                            \\\"url\\\": f\\\"data:image/jpeg;base64,{image_data}\\\"\\n\",\n        \"                        }\\n\",\n        \"                    })\\n\",\n        \"        messages[0][\\\"content\\\"] = content\\n\",\n        \"    response = await client.chat.completions.create(model=model, messages=messages, temperature=0)\\n\",\n        \"    return response.choices[0].message.content.strip()\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 0.4 PDF Image Extraction Helper Functions\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {},\n      \"outputs\": [],\n      \"source\": [\n        \"def extract_pdf_page_images(pdf_path, output_dir=\\\"pdf_images\\\"):\\n\",\n        \"    os.makedirs(output_dir, exist_ok=True)\\n\",\n        \"    pdf_document = fitz.open(pdf_path)\\n\",\n        \"    page_images = {}\\n\",\n        \"    total_pages = len(pdf_document)\\n\",\n        \"    for page_number in range(len(pdf_document)):\\n\",\n        \"        page = pdf_document.load_page(page_number)\\n\",\n        \"        # Convert page to image\\n\",\n        \"        mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for better quality\\n\",\n        \"        pix = page.get_pixmap(matrix=mat)\\n\",\n        \"        img_data = pix.tobytes(\\\"jpeg\\\")\\n\",\n        \"        image_path = os.path.join(output_dir, f\\\"page_{page_number + 1}.jpg\\\")\\n\",\n        \"        with open(image_path, \\\"wb\\\") as image_file:\\n\",\n        \"            image_file.write(img_data)\\n\",\n        \"        page_images[page_number + 1] = image_path\\n\",\n        \"        print(f\\\"Saved page {page_number + 1} image: {image_path}\\\")\\n\",\n        \"    pdf_document.close()\\n\",\n        \"    return page_images, total_pages\\n\",\n        \"\\n\",\n        \"def get_page_images_for_nodes(node_list, node_map, page_images):\\n\",\n        \"    # Get PDF page images for retrieved nodes\\n\",\n        \"    image_paths = []\\n\",\n        \"    seen_pages = set()\\n\",\n        \"    for node_id in node_list:\\n\",\n        \"        node_info = node_map[node_id]\\n\",\n        \"        for page_num in range(node_info['start_index'], node_info['end_index'] + 1):\\n\",\n        \"            if page_num not in seen_pages:\\n\",\n        \"                image_paths.append(page_images[page_num])\\n\",\n        \"                seen_pages.add(page_num)\\n\",\n        \"    return image_paths\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"heGtIMOVcG1N\"\n      },\n      \"source\": [\n        \"## Step 1: PageIndex Tree Generation\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"Mzd1VWjwMUJL\"\n      },\n      \"source\": [\n        \"#### 1.1 Submit a document for generating PageIndex tree\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"f6--eZPLcG1N\",\n        \"outputId\": \"ca688cfd-6c4b-4a57-dac2-f3c2604c4112\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import os, requests\\n\",\n        \"\\n\",\n        \"# You can also use our GitHub repo to generate PageIndex tree\\n\",\n        \"# https://github.com/VectifyAI/PageIndex\\n\",\n        \"\\n\",\n        \"pdf_url = \\\"https://arxiv.org/pdf/1706.03762.pdf\\\"  # the \\\"Attention Is All You Need\\\" paper\\n\",\n        \"pdf_path = os.path.join(\\\"../data\\\", pdf_url.split('/')[-1])\\n\",\n        \"os.makedirs(os.path.dirname(pdf_path), exist_ok=True)\\n\",\n        \"\\n\",\n        \"response = requests.get(pdf_url)\\n\",\n        \"with open(pdf_path, \\\"wb\\\") as f:\\n\",\n        \"    f.write(response.content)\\n\",\n        \"print(f\\\"Downloaded {pdf_url}\\\\n\\\")\\n\",\n        \"\\n\",\n        \"# Extract page images from PDF\\n\",\n        \"print(\\\"Extracting page images...\\\")\\n\",\n        \"page_images, total_pages = extract_pdf_page_images(pdf_path)\\n\",\n        \"print(f\\\"Extracted {len(page_images)} page images from {total_pages} total pages.\\\\n\\\")\\n\",\n        \"\\n\",\n        \"doc_id = pi_client.submit_document(pdf_path)[\\\"doc_id\\\"]\\n\",\n        \"print('Document Submitted:', doc_id)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"4-Hrh0azcG1N\"\n      },\n      \"source\": [\n        \"#### 1.2 Get the generated PageIndex tree structure\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 65,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 1000\n        },\n        \"id\": \"b1Q1g6vrcG1O\",\n        \"outputId\": \"dc944660-38ad-47ea-d358-be422edbae53\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Simplified Tree Structure of the Document:\\n\",\n            \"[{'title': 'Attention Is All You Need',\\n\",\n            \"  'node_id': '0000',\\n\",\n            \"  'page_index': 1,\\n\",\n            \"  'prefix_summary': '# Attention Is All You Need\\\\n\\\\nAshish Vasw...',\\n\",\n            \"  'nodes': [{'title': 'Abstract',\\n\",\n            \"             'node_id': '0001',\\n\",\n            \"             'page_index': 1,\\n\",\n            \"             'summary': 'The text introduces the Transformer, a n...'},\\n\",\n            \"            {'title': '1 Introduction',\\n\",\n            \"             'node_id': '0002',\\n\",\n            \"             'page_index': 2,\\n\",\n            \"             'summary': 'The text introduces the Transformer, a n...'},\\n\",\n            \"            {'title': '2 Background',\\n\",\n            \"             'node_id': '0003',\\n\",\n            \"             'page_index': 2,\\n\",\n            \"             'summary': 'This section discusses the Transformer m...'},\\n\",\n            \"            {'title': '3 Model Architecture',\\n\",\n            \"             'node_id': '0004',\\n\",\n            \"             'page_index': 2,\\n\",\n            \"             'prefix_summary': 'The text describes the encoder-decoder a...',\\n\",\n            \"             'nodes': [{'title': '3.1 Encoder and Decoder Stacks',\\n\",\n            \"                        'node_id': '0005',\\n\",\n            \"                        'page_index': 3,\\n\",\n            \"                        'summary': 'The text describes the encoder and decod...'},\\n\",\n            \"                       {'title': '3.2 Attention',\\n\",\n            \"                        'node_id': '0006',\\n\",\n            \"                        'page_index': 3,\\n\",\n            \"                        'prefix_summary': '### 3.2 Attention\\\\n\\\\nAn attention function...',\\n\",\n            \"                        'nodes': [{'title': '3.2.1 Scaled Dot-Product Attention',\\n\",\n            \"                                   'node_id': '0007',\\n\",\n            \"                                   'page_index': 4,\\n\",\n            \"                                   'summary': 'The text describes Scaled Dot-Product At...'},\\n\",\n            \"                                  {'title': '3.2.2 Multi-Head Attention',\\n\",\n            \"                                   'node_id': '0008',\\n\",\n            \"                                   'page_index': 4,\\n\",\n            \"                                   'summary': 'The text describes Multi-Head Attention,...'},\\n\",\n            \"                                  {'title': '3.2.3 Applications of Attention in our M...',\\n\",\n            \"                                   'node_id': '0009',\\n\",\n            \"                                   'page_index': 5,\\n\",\n            \"                                   'summary': 'The text describes the three application...'}]},\\n\",\n            \"                       {'title': '3.3 Position-wise Feed-Forward Networks',\\n\",\n            \"                        'node_id': '0010',\\n\",\n            \"                        'page_index': 5,\\n\",\n            \"                        'summary': '### 3.3 Position-wise Feed-Forward Netwo...'},\\n\",\n            \"                       {'title': '3.4 Embeddings and Softmax',\\n\",\n            \"                        'node_id': '0011',\\n\",\n            \"                        'page_index': 5,\\n\",\n            \"                        'summary': 'The text describes the use of learned em...'},\\n\",\n            \"                       {'title': '3.5 Positional Encoding',\\n\",\n            \"                        'node_id': '0012',\\n\",\n            \"                        'page_index': 6,\\n\",\n            \"                        'summary': 'This section explains the necessity of p...'}]},\\n\",\n            \"            {'title': '4 Why Self-Attention',\\n\",\n            \"             'node_id': '0013',\\n\",\n            \"             'page_index': 6,\\n\",\n            \"             'summary': 'This text compares self-attention layers...'},\\n\",\n            \"            {'title': '5 Training',\\n\",\n            \"             'node_id': '0014',\\n\",\n            \"             'page_index': 7,\\n\",\n            \"             'prefix_summary': '## 5 Training\\\\n\\\\nThis section describes th...',\\n\",\n            \"             'nodes': [{'title': '5.1 Training Data and Batching',\\n\",\n            \"                        'node_id': '0015',\\n\",\n            \"                        'page_index': 7,\\n\",\n            \"                        'summary': '### 5.1 Training Data and Batching\\\\n\\\\nWe t...'},\\n\",\n            \"                       {'title': '5.2 Hardware and Schedule',\\n\",\n            \"                        'node_id': '0016',\\n\",\n            \"                        'page_index': 7,\\n\",\n            \"                        'summary': '### 5.2 Hardware and Schedule\\\\n\\\\nWe traine...'},\\n\",\n            \"                       {'title': '5.3 Optimizer',\\n\",\n            \"                        'node_id': '0017',\\n\",\n            \"                        'page_index': 7,\\n\",\n            \"                        'summary': '### 5.3 Optimizer\\\\n\\\\nWe used the Adam opti...'},\\n\",\n            \"                       {'title': '5.4 Regularization',\\n\",\n            \"                        'node_id': '0018',\\n\",\n            \"                        'page_index': 7,\\n\",\n            \"                        'summary': 'The text details three regularization te...'}]},\\n\",\n            \"            {'title': '6 Results',\\n\",\n            \"             'node_id': '0019',\\n\",\n            \"             'page_index': 8,\\n\",\n            \"             'prefix_summary': '## 6 Results\\\\n',\\n\",\n            \"             'nodes': [{'title': '6.1 Machine Translation',\\n\",\n            \"                        'node_id': '0020',\\n\",\n            \"                        'page_index': 8,\\n\",\n            \"                        'summary': 'The text details the performance of a Tr...'},\\n\",\n            \"                       {'title': '6.2 Model Variations',\\n\",\n            \"                        'node_id': '0021',\\n\",\n            \"                        'page_index': 8,\\n\",\n            \"                        'summary': 'This text details experiments varying co...'},\\n\",\n            \"                       {'title': '6.3 English Constituency Parsing',\\n\",\n            \"                        'node_id': '0022',\\n\",\n            \"                        'page_index': 9,\\n\",\n            \"                        'summary': 'The text describes experiments evaluatin...'}]},\\n\",\n            \"            {'title': '7 Conclusion',\\n\",\n            \"             'node_id': '0023',\\n\",\n            \"             'page_index': 10,\\n\",\n            \"             'summary': 'This text concludes by presenting the Tr...'},\\n\",\n            \"            {'title': 'References',\\n\",\n            \"             'node_id': '0024',\\n\",\n            \"             'page_index': 10,\\n\",\n            \"             'summary': 'The provided text is a collection of ref...'},\\n\",\n            \"            {'title': 'Attention Visualizations',\\n\",\n            \"             'node_id': '0025',\\n\",\n            \"             'page_index': 13,\\n\",\n            \"             'summary': 'The text provides examples of attention ...'}]}]\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"if pi_client.is_retrieval_ready(doc_id):\\n\",\n        \"    tree = pi_client.get_tree(doc_id, node_summary=True)['result']\\n\",\n        \"    print('Simplified Tree Structure of the Document:')\\n\",\n        \"    utils.print_tree(tree, exclude_fields=['text'])\\n\",\n        \"else:\\n\",\n        \"    print(\\\"Processing document, please try again later...\\\")\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"USoCLOiQcG1O\"\n      },\n      \"source\": [\n        \"## Step 2: Reasoning-Based Retrieval with Tree Search\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 2.1 Reasoning-based retrieval with PageIndex to identify nodes that might contain relevant context\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"id\": \"LLHNJAtTcG1O\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import json\\n\",\n        \"\\n\",\n        \"query = \\\"What is the last operation in the Scaled Dot-Product Attention figure?\\\"\\n\",\n        \"\\n\",\n        \"tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])\\n\",\n        \"\\n\",\n        \"search_prompt = f\\\"\\\"\\\"\\n\",\n        \"You are given a question and a tree structure of a document.\\n\",\n        \"Each node contains a node id, node title, and a corresponding summary.\\n\",\n        \"Your task is to find all tree nodes that are likely to contain the answer to the question.\\n\",\n        \"\\n\",\n        \"Question: {query}\\n\",\n        \"\\n\",\n        \"Document tree structure:\\n\",\n        \"{json.dumps(tree_without_text, indent=2)}\\n\",\n        \"\\n\",\n        \"Please reply in the following JSON format:\\n\",\n        \"{{\\n\",\n        \"    \\\"thinking\\\": \\\"<Your thinking process on which nodes are relevant to the question>\\\",\\n\",\n        \"    \\\"node_list\\\": [\\\"node_id_1\\\", \\\"node_id_2\\\", ..., \\\"node_id_n\\\"]\\n\",\n        \"}}\\n\",\n        \"Directly return the final JSON structure. Do not output anything else.\\n\",\n        \"\\\"\\\"\\\"\\n\",\n        \"\\n\",\n        \"tree_search_result = await call_vlm(search_prompt)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 2.2 Print retrieved nodes and reasoning process\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 87,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 206\n        },\n        \"id\": \"P8DVUOuAen5u\",\n        \"outputId\": \"6bb6d052-ef30-4716-f88e-be98bcb7ebdb\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Reasoning Process:\\n\",\n            \"\\n\",\n            \"The question asks about the last operation in the Scaled Dot-Product Attention figure. The most\\n\",\n            \"relevant section is the one that describes Scaled Dot-Product Attention in detail, including its\\n\",\n            \"computation and the figure itself. This is likely found in section 3.2.1 'Scaled Dot-Product\\n\",\n            \"Attention' (node_id: 0007), which is a subsection of 3.2 'Attention' (node_id: 0006). The parent\\n\",\n            \"section 3.2 may also contain the figure and its caption, as the summary mentions Figure 2 (which is\\n\",\n            \"the Scaled Dot-Product Attention figure). Therefore, both node 0006 and node 0007 are likely to\\n\",\n            \"contain the answer.\\n\",\n            \"\\n\",\n            \"Retrieved Nodes:\\n\",\n            \"\\n\",\n            \"Node ID: 0006\\t Pages: 3-4\\t Title: 3.2 Attention\\n\",\n            \"Node ID: 0007\\t Pages: 4\\t Title: 3.2.1 Scaled Dot-Product Attention\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"node_map = utils.create_node_mapping(tree, include_page_ranges=True, max_page=total_pages)\\n\",\n        \"tree_search_result_json = json.loads(tree_search_result)\\n\",\n        \"\\n\",\n        \"print('Reasoning Process:\\\\n')\\n\",\n        \"utils.print_wrapped(tree_search_result_json['thinking'])\\n\",\n        \"\\n\",\n        \"print('\\\\nRetrieved Nodes:\\\\n')\\n\",\n        \"for node_id in tree_search_result_json[\\\"node_list\\\"]:\\n\",\n        \"    node_info = node_map[node_id]\\n\",\n        \"    node = node_info['node']\\n\",\n        \"    start_page = node_info['start_index']\\n\",\n        \"    end_page = node_info['end_index']\\n\",\n        \"    page_range = start_page if start_page == end_page else f\\\"{start_page}-{end_page}\\\"\\n\",\n        \"    print(f\\\"Node ID: {node['node_id']}\\\\t Pages: {page_range}\\\\t Title: {node['title']}\\\")\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 2.3 Get corresponding PDF page images of retrieved nodes\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 81,\n      \"metadata\": {},\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"\\n\",\n            \"Retrieved 2 PDF page image(s) for visual context.\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"retrieved_nodes = tree_search_result_json[\\\"node_list\\\"]\\n\",\n        \"retrieved_page_images = get_page_images_for_nodes(retrieved_nodes, node_map, page_images)\\n\",\n        \"print(f'\\\\nRetrieved {len(retrieved_page_images)} PDF page image(s) for visual context.')\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"10wOZDG_cG1O\"\n      },\n      \"source\": [\n        \"## Step 3: Answer Generation\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"#### 3.1 Generate answer using VLM with visual context\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 210\n        },\n        \"id\": \"tcp_PhHzcG1O\",\n        \"outputId\": \"187ff116-9bb0-4ab4-bacb-13944460b5ff\"\n      },\n      \"outputs\": [\n        {\n          \"name\": \"stdout\",\n          \"output_type\": \"stream\",\n          \"text\": [\n            \"Generated answer using VLM with retrieved PDF page images as visual context:\\n\",\n            \"\\n\",\n            \"The last operation in the **Scaled Dot-Product Attention** figure is a **MatMul** (matrix\\n\",\n            \"multiplication). This operation multiplies the attention weights (after softmax) by the value matrix\\n\",\n            \"\\\\( V \\\\).\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"# Generate answer using VLM with only PDF page images as visual context\\n\",\n        \"answer_prompt = f\\\"\\\"\\\"\\n\",\n        \"Answer the question based on the images of the document pages as context.\\n\",\n        \"\\n\",\n        \"Question: {query}\\n\",\n        \"\\n\",\n        \"Provide a clear, concise answer based only on the context provided.\\n\",\n        \"\\\"\\\"\\\"\\n\",\n        \"\\n\",\n        \"print('Generated answer using VLM with retrieved PDF page images as visual context:\\\\n')\\n\",\n        \"answer = await call_vlm(answer_prompt, retrieved_page_images)\\n\",\n        \"utils.print_wrapped(answer)\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"## Conclusion\\n\",\n        \"\\n\",\n        \"In this notebook, we demonstrated a *minimal* **vision-based, vectorless RAG pipeline** using PageIndex and a VLM. The system retrieves relevant pages by reasoning over the document’s hierarchical tree index and answers questions directly from PDF images — no OCR required.\\n\",\n        \"\\n\",\n        \"If you’re interested in building your own **reasoning-based document QA system**, try [PageIndex Chat](https://chat.pageindex.ai), or integrate via [PageIndex MCP](https://pageindex.ai/mcp) and the [API](https://docs.pageindex.ai/quickstart). You can also explore the [GitHub repo](https://github.com/VectifyAI/PageIndex) for open-source implementations and additional examples.\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {},\n      \"source\": [\n        \"\\n\",\n        \"\\n\",\n        \"© 2025 [Vectify AI](https://vectify.ai)\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"language\": \"python\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"codemirror_mode\": {\n        \"name\": \"ipython\",\n        \"version\": 3\n      },\n      \"file_extension\": \".py\",\n      \"mimetype\": \"text/x-python\",\n      \"name\": \"python\",\n      \"nbconvert_exporter\": \"python\",\n      \"pygments_lexer\": \"ipython3\",\n      \"version\": \"3.11.9\"\n    }\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "pageindex/__init__.py",
    "content": "from .page_index import *\nfrom .page_index_md import md_to_tree"
  },
  {
    "path": "pageindex/config.yaml",
    "content": "model: \"gpt-4o-2024-11-20\"\ntoc_check_page_num: 20\nmax_page_num_each_node: 10\nmax_token_num_each_node: 20000\nif_add_node_id: \"yes\"\nif_add_node_summary: \"yes\"\nif_add_doc_description: \"no\"\nif_add_node_text: \"no\""
  },
  {
    "path": "pageindex/page_index.py",
    "content": "import os\nimport json\nimport copy\nimport math\nimport random\nimport re\nfrom .utils import *\nimport os\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\n\n\n################### check title in page #########################################################\nasync def check_title_appearance(item, page_list, start_index=1, model=None):    \n    title=item['title']\n    if 'physical_index' not in item or item['physical_index'] is None:\n        return {'list_index': item.get('list_index'), 'answer': 'no', 'title':title, 'page_number': None}\n    \n    \n    page_number = item['physical_index']\n    page_text = page_list[page_number-start_index][0]\n\n    \n    prompt = f\"\"\"\n    Your job is to check if the given section appears or starts in the given page_text.\n\n    Note: do fuzzy matching, ignore any space inconsistency in the page_text.\n\n    The given section title is {title}.\n    The given page_text is {page_text}.\n    \n    Reply format:\n    {{\n        \n        \"thinking\": <why do you think the section appears or starts in the page_text>\n        \"answer\": \"yes or no\" (yes if the section appears or starts in the page_text, no otherwise)\n    }}\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    response = await ChatGPT_API_async(model=model, prompt=prompt)\n    response = extract_json(response)\n    if 'answer' in response:\n        answer = response['answer']\n    else:\n        answer = 'no'\n    return {'list_index': item['list_index'], 'answer': answer, 'title': title, 'page_number': page_number}\n\n\nasync def check_title_appearance_in_start(title, page_text, model=None, logger=None):    \n    prompt = f\"\"\"\n    You will be given the current section title and the current page_text.\n    Your job is to check if the current section starts in the beginning of the given page_text.\n    If there are other contents before the current section title, then the current section does not start in the beginning of the given page_text.\n    If the current section title is the first content in the given page_text, then the current section starts in the beginning of the given page_text.\n\n    Note: do fuzzy matching, ignore any space inconsistency in the page_text.\n\n    The given section title is {title}.\n    The given page_text is {page_text}.\n    \n    reply format:\n    {{\n        \"thinking\": <why do you think the section appears or starts in the page_text>\n        \"start_begin\": \"yes or no\" (yes if the section starts in the beginning of the page_text, no otherwise)\n    }}\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    response = await ChatGPT_API_async(model=model, prompt=prompt)\n    response = extract_json(response)\n    if logger:\n        logger.info(f\"Response: {response}\")\n    return response.get(\"start_begin\", \"no\")\n\n\nasync def check_title_appearance_in_start_concurrent(structure, page_list, model=None, logger=None):\n    if logger:\n        logger.info(\"Checking title appearance in start concurrently\")\n    \n    # skip items without physical_index\n    for item in structure:\n        if item.get('physical_index') is None:\n            item['appear_start'] = 'no'\n\n    # only for items with valid physical_index\n    tasks = []\n    valid_items = []\n    for item in structure:\n        if item.get('physical_index') is not None:\n            page_text = page_list[item['physical_index'] - 1][0]\n            tasks.append(check_title_appearance_in_start(item['title'], page_text, model=model, logger=logger))\n            valid_items.append(item)\n\n    results = await asyncio.gather(*tasks, return_exceptions=True)\n    for item, result in zip(valid_items, results):\n        if isinstance(result, Exception):\n            if logger:\n                logger.error(f\"Error checking start for {item['title']}: {result}\")\n            item['appear_start'] = 'no'\n        else:\n            item['appear_start'] = result\n\n    return structure\n\n\ndef toc_detector_single_page(content, model=None):\n    prompt = f\"\"\"\n    Your job is to detect if there is a table of content provided in the given text.\n\n    Given text: {content}\n\n    return the following JSON format:\n    {{\n        \"thinking\": <why do you think there is a table of content in the given text>\n        \"toc_detected\": \"<yes or no>\",\n    }}\n\n    Directly return the final JSON structure. Do not output anything else.\n    Please note: abstract,summary, notation list, figure list, table list, etc. are not table of contents.\"\"\"\n\n    response = ChatGPT_API(model=model, prompt=prompt)\n    # print('response', response)\n    json_content = extract_json(response)    \n    return json_content['toc_detected']\n\n\ndef check_if_toc_extraction_is_complete(content, toc, model=None):\n    prompt = f\"\"\"\n    You are given a partial document  and a  table of contents.\n    Your job is to check if the  table of contents is complete, which it contains all the main sections in the partial document.\n\n    Reply format:\n    {{\n        \"thinking\": <why do you think the table of contents is complete or not>\n        \"completed\": \"yes\" or \"no\"\n    }}\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    prompt = prompt + '\\n Document:\\n' + content + '\\n Table of contents:\\n' + toc\n    response = ChatGPT_API(model=model, prompt=prompt)\n    json_content = extract_json(response)\n    return json_content['completed']\n\n\ndef check_if_toc_transformation_is_complete(content, toc, model=None):\n    prompt = f\"\"\"\n    You are given a raw table of contents and a  table of contents.\n    Your job is to check if the  table of contents is complete.\n\n    Reply format:\n    {{\n        \"thinking\": <why do you think the cleaned table of contents is complete or not>\n        \"completed\": \"yes\" or \"no\"\n    }}\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    prompt = prompt + '\\n Raw Table of contents:\\n' + content + '\\n Cleaned Table of contents:\\n' + toc\n    response = ChatGPT_API(model=model, prompt=prompt)\n    json_content = extract_json(response)\n    return json_content['completed']\n\ndef extract_toc_content(content, model=None):\n    prompt = f\"\"\"\n    Your job is to extract the full table of contents from the given text, replace ... with :\n\n    Given text: {content}\n\n    Directly return the full table of contents content. Do not output anything else.\"\"\"\n\n    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)\n    \n    if_complete = check_if_toc_transformation_is_complete(content, response, model)\n    if if_complete == \"yes\" and finish_reason == \"finished\":\n        return response\n    \n    chat_history = [\n        {\"role\": \"user\", \"content\": prompt}, \n        {\"role\": \"assistant\", \"content\": response},    \n    ]\n    prompt = f\"\"\"please continue the generation of table of contents , directly output the remaining part of the structure\"\"\"\n    new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history)\n    response = response + new_response\n    if_complete = check_if_toc_transformation_is_complete(content, response, model)\n    \n    attempt = 0\n    max_attempts = 5\n\n    while not (if_complete == \"yes\" and finish_reason == \"finished\"):\n        attempt += 1\n        if attempt > max_attempts:\n            raise Exception('Failed to complete table of contents after maximum retries')\n\n        chat_history = [\n            {\"role\": \"user\", \"content\": prompt},\n            {\"role\": \"assistant\", \"content\": response},\n        ]\n        prompt = f\"\"\"please continue the generation of table of contents , directly output the remaining part of the structure\"\"\"\n        new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history)\n        response = response + new_response\n        if_complete = check_if_toc_transformation_is_complete(content, response, model)\n    \n    return response\n\ndef detect_page_index(toc_content, model=None):\n    print('start detect_page_index')\n    prompt = f\"\"\"\n    You will be given a table of contents.\n\n    Your job is to detect if there are page numbers/indices given within the table of contents.\n\n    Given text: {toc_content}\n\n    Reply format:\n    {{\n        \"thinking\": <why do you think there are page numbers/indices given within the table of contents>\n        \"page_index_given_in_toc\": \"<yes or no>\"\n    }}\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    response = ChatGPT_API(model=model, prompt=prompt)\n    json_content = extract_json(response)\n    return json_content['page_index_given_in_toc']\n\ndef toc_extractor(page_list, toc_page_list, model):\n    def transform_dots_to_colon(text):\n        text = re.sub(r'\\.{5,}', ': ', text)\n        # Handle dots separated by spaces\n        text = re.sub(r'(?:\\. ){5,}\\.?', ': ', text)\n        return text\n    \n    toc_content = \"\"\n    for page_index in toc_page_list:\n        toc_content += page_list[page_index][0]\n    toc_content = transform_dots_to_colon(toc_content)\n    has_page_index = detect_page_index(toc_content, model=model)\n    \n    return {\n        \"toc_content\": toc_content,\n        \"page_index_given_in_toc\": has_page_index\n    }\n\n\n\n\ndef toc_index_extractor(toc, content, model=None):\n    print('start toc_index_extractor')\n    toc_extractor_prompt = \"\"\"\n    You are given a table of contents in a json format and several pages of a document, your job is to add the physical_index to the table of contents in the json format.\n\n    The provided pages contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X.\n\n    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.\n\n    The response should be in the following JSON format: \n    [\n        {\n            \"structure\": <structure index, \"x.x.x\" or None> (string),\n            \"title\": <title of the section>,\n            \"physical_index\": \"<physical_index_X>\" (keep the format)\n        },\n        ...\n    ]\n\n    Only add the physical_index to the sections that are in the provided pages.\n    If the section is not in the provided pages, do not add the physical_index to it.\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    prompt = toc_extractor_prompt + '\\nTable of contents:\\n' + str(toc) + '\\nDocument pages:\\n' + content\n    response = ChatGPT_API(model=model, prompt=prompt)\n    json_content = extract_json(response)    \n    return json_content\n\n\n\ndef toc_transformer(toc_content, model=None):\n    print('start toc_transformer')\n    init_prompt = \"\"\"\n    You are given a table of contents, You job is to transform the whole table of content into a JSON format included table_of_contents.\n\n    structure is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.\n\n    The response should be in the following JSON format: \n    {\n    table_of_contents: [\n        {\n            \"structure\": <structure index, \"x.x.x\" or None> (string),\n            \"title\": <title of the section>,\n            \"page\": <page number or None>,\n        },\n        ...\n        ],\n    }\n    You should transform the full table of contents in one go.\n    Directly return the final JSON structure, do not output anything else. \"\"\"\n\n    prompt = init_prompt + '\\n Given table of contents\\n:' + toc_content\n    last_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)\n    if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)\n    if if_complete == \"yes\" and finish_reason == \"finished\":\n        last_complete = extract_json(last_complete)\n        cleaned_response=convert_page_to_int(last_complete['table_of_contents'])\n        return cleaned_response\n    \n    last_complete = get_json_content(last_complete)\n    while not (if_complete == \"yes\" and finish_reason == \"finished\"):\n        position = last_complete.rfind('}')\n        if position != -1:\n            last_complete = last_complete[:position+2]\n        prompt = f\"\"\"\n        Your task is to continue the table of contents json structure, directly output the remaining part of the json structure.\n        The response should be in the following JSON format: \n\n        The raw table of contents json structure is:\n        {toc_content}\n\n        The incomplete transformed table of contents json structure is:\n        {last_complete}\n\n        Please continue the json structure, directly output the remaining part of the json structure.\"\"\"\n\n        new_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)\n\n        if new_complete.startswith('```json'):\n            new_complete =  get_json_content(new_complete)\n            last_complete = last_complete+new_complete\n\n        if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)\n        \n\n    last_complete = json.loads(last_complete)\n\n    cleaned_response=convert_page_to_int(last_complete['table_of_contents'])\n    return cleaned_response\n    \n\n\n\ndef find_toc_pages(start_page_index, page_list, opt, logger=None):\n    print('start find_toc_pages')\n    last_page_is_yes = False\n    toc_page_list = []\n    i = start_page_index\n    \n    while i < len(page_list):\n        # Only check beyond max_pages if we're still finding TOC pages\n        if i >= opt.toc_check_page_num and not last_page_is_yes:\n            break\n        detected_result = toc_detector_single_page(page_list[i][0],model=opt.model)\n        if detected_result == 'yes':\n            if logger:\n                logger.info(f'Page {i} has toc')\n            toc_page_list.append(i)\n            last_page_is_yes = True\n        elif detected_result == 'no' and last_page_is_yes:\n            if logger:\n                logger.info(f'Found the last page with toc: {i-1}')\n            break\n        i += 1\n    \n    if not toc_page_list and logger:\n        logger.info('No toc found')\n        \n    return toc_page_list\n\ndef remove_page_number(data):\n    if isinstance(data, dict):\n        data.pop('page_number', None)  \n        for key in list(data.keys()):\n            if 'nodes' in key:\n                remove_page_number(data[key])\n    elif isinstance(data, list):\n        for item in data:\n            remove_page_number(item)\n    return data\n\ndef extract_matching_page_pairs(toc_page, toc_physical_index, start_page_index):\n    pairs = []\n    for phy_item in toc_physical_index:\n        for page_item in toc_page:\n            if phy_item.get('title') == page_item.get('title'):\n                physical_index = phy_item.get('physical_index')\n                if physical_index is not None and int(physical_index) >= start_page_index:\n                    pairs.append({\n                        'title': phy_item.get('title'),\n                        'page': page_item.get('page'),\n                        'physical_index': physical_index\n                    })\n    return pairs\n\n\ndef calculate_page_offset(pairs):\n    differences = []\n    for pair in pairs:\n        try:\n            physical_index = pair['physical_index']\n            page_number = pair['page']\n            difference = physical_index - page_number\n            differences.append(difference)\n        except (KeyError, TypeError):\n            continue\n    \n    if not differences:\n        return None\n    \n    difference_counts = {}\n    for diff in differences:\n        difference_counts[diff] = difference_counts.get(diff, 0) + 1\n    \n    most_common = max(difference_counts.items(), key=lambda x: x[1])[0]\n    \n    return most_common\n\ndef add_page_offset_to_toc_json(data, offset):\n    for i in range(len(data)):\n        if data[i].get('page') is not None and isinstance(data[i]['page'], int):\n            data[i]['physical_index'] = data[i]['page'] + offset\n            del data[i]['page']\n    \n    return data\n\n\n\ndef page_list_to_group_text(page_contents, token_lengths, max_tokens=20000, overlap_page=1):    \n    num_tokens = sum(token_lengths)\n    \n    if num_tokens <= max_tokens:\n        # merge all pages into one text\n        page_text = \"\".join(page_contents)\n        return [page_text]\n    \n    subsets = []\n    current_subset = []\n    current_token_count = 0\n\n    expected_parts_num = math.ceil(num_tokens / max_tokens)\n    average_tokens_per_part = math.ceil(((num_tokens / expected_parts_num) + max_tokens) / 2)\n    \n    for i, (page_content, page_tokens) in enumerate(zip(page_contents, token_lengths)):\n        if current_token_count + page_tokens > average_tokens_per_part:\n\n            subsets.append(''.join(current_subset))\n            # Start new subset from overlap if specified\n            overlap_start = max(i - overlap_page, 0)\n            current_subset = page_contents[overlap_start:i]\n            current_token_count = sum(token_lengths[overlap_start:i])\n        \n        # Add current page to the subset\n        current_subset.append(page_content)\n        current_token_count += page_tokens\n\n    # Add the last subset if it contains any pages\n    if current_subset:\n        subsets.append(''.join(current_subset))\n    \n    print('divide page_list to groups', len(subsets))\n    return subsets\n\ndef add_page_number_to_toc(part, structure, model=None):\n    fill_prompt_seq = \"\"\"\n    You are given an JSON structure of a document and a partial part of the document. Your task is to check if the title that is described in the structure is started in the partial given document.\n\n    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X. \n\n    If the full target section starts in the partial given document, insert the given JSON structure with the \"start\": \"yes\", and \"start_index\": \"<physical_index_X>\".\n\n    If the full target section does not start in the partial given document, insert \"start\": \"no\",  \"start_index\": None.\n\n    The response should be in the following format. \n        [\n            {\n                \"structure\": <structure index, \"x.x.x\" or None> (string),\n                \"title\": <title of the section>,\n                \"start\": \"<yes or no>\",\n                \"physical_index\": \"<physical_index_X> (keep the format)\" or None\n            },\n            ...\n        ]    \n    The given structure contains the result of the previous part, you need to fill the result of the current part, do not change the previous result.\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    prompt = fill_prompt_seq + f\"\\n\\nCurrent Partial Document:\\n{part}\\n\\nGiven Structure\\n{json.dumps(structure, indent=2)}\\n\"\n    current_json_raw = ChatGPT_API(model=model, prompt=prompt)\n    json_result = extract_json(current_json_raw)\n    \n    for item in json_result:\n        if 'start' in item:\n            del item['start']\n    return json_result\n\n\ndef remove_first_physical_index_section(text):\n    \"\"\"\n    Removes the first section between <physical_index_X> and <physical_index_X> tags,\n    and returns the remaining text.\n    \"\"\"\n    pattern = r'<physical_index_\\d+>.*?<physical_index_\\d+>'\n    match = re.search(pattern, text, re.DOTALL)\n    if match:\n        # Remove the first matched section\n        return text.replace(match.group(0), '', 1)\n    return text\n\n### add verify completeness\ndef generate_toc_continue(toc_content, part, model=\"gpt-4o-2024-11-20\"):\n    print('start generate_toc_continue')\n    prompt = \"\"\"\n    You are an expert in extracting hierarchical tree structure.\n    You are given a tree structure of the previous part and the text of the current part.\n    Your task is to continue the tree structure from the previous part to include the current part.\n\n    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.\n\n    For the title, you need to extract the original title from the text, only fix the space inconsistency.\n\n    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. \\\n    \n    For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format.\n\n    The response should be in the following format. \n        [\n            {\n                \"structure\": <structure index, \"x.x.x\"> (string),\n                \"title\": <title of the section, keep the original title>,\n                \"physical_index\": \"<physical_index_X> (keep the format)\"\n            },\n            ...\n        ]    \n\n    Directly return the additional part of the final JSON structure. Do not output anything else.\"\"\"\n\n    prompt = prompt + '\\nGiven text\\n:' + part + '\\nPrevious tree structure\\n:' + json.dumps(toc_content, indent=2)\n    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)\n    if finish_reason == 'finished':\n        return extract_json(response)\n    else:\n        raise Exception(f'finish reason: {finish_reason}')\n    \n### add verify completeness\ndef generate_toc_init(part, model=None):\n    print('start generate_toc_init')\n    prompt = \"\"\"\n    You are an expert in extracting hierarchical tree structure, your task is to generate the tree structure of the document.\n\n    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.\n\n    For the title, you need to extract the original title from the text, only fix the space inconsistency.\n\n    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. \n\n    For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format.\n\n    The response should be in the following format. \n        [\n            {{\n                \"structure\": <structure index, \"x.x.x\"> (string),\n                \"title\": <title of the section, keep the original title>,\n                \"physical_index\": \"<physical_index_X> (keep the format)\"\n            }},\n            \n        ],\n\n\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    prompt = prompt + '\\nGiven text\\n:' + part\n    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)\n\n    if finish_reason == 'finished':\n         return extract_json(response)\n    else:\n        raise Exception(f'finish reason: {finish_reason}')\n\ndef process_no_toc(page_list, start_index=1, model=None, logger=None):\n    page_contents=[]\n    token_lengths=[]\n    for page_index in range(start_index, start_index+len(page_list)):\n        page_text = f\"<physical_index_{page_index}>\\n{page_list[page_index-start_index][0]}\\n<physical_index_{page_index}>\\n\\n\"\n        page_contents.append(page_text)\n        token_lengths.append(count_tokens(page_text, model))\n    group_texts = page_list_to_group_text(page_contents, token_lengths)\n    logger.info(f'len(group_texts): {len(group_texts)}')\n\n    toc_with_page_number= generate_toc_init(group_texts[0], model)\n    for group_text in group_texts[1:]:\n        toc_with_page_number_additional = generate_toc_continue(toc_with_page_number, group_text, model)    \n        toc_with_page_number.extend(toc_with_page_number_additional)\n    logger.info(f'generate_toc: {toc_with_page_number}')\n\n    toc_with_page_number = convert_physical_index_to_int(toc_with_page_number)\n    logger.info(f'convert_physical_index_to_int: {toc_with_page_number}')\n\n    return toc_with_page_number\n\ndef process_toc_no_page_numbers(toc_content, toc_page_list, page_list,  start_index=1, model=None, logger=None):\n    page_contents=[]\n    token_lengths=[]\n    toc_content = toc_transformer(toc_content, model)\n    logger.info(f'toc_transformer: {toc_content}')\n    for page_index in range(start_index, start_index+len(page_list)):\n        page_text = f\"<physical_index_{page_index}>\\n{page_list[page_index-start_index][0]}\\n<physical_index_{page_index}>\\n\\n\"\n        page_contents.append(page_text)\n        token_lengths.append(count_tokens(page_text, model))\n    \n    group_texts = page_list_to_group_text(page_contents, token_lengths)\n    logger.info(f'len(group_texts): {len(group_texts)}')\n\n    toc_with_page_number=copy.deepcopy(toc_content)\n    for group_text in group_texts:\n        toc_with_page_number = add_page_number_to_toc(group_text, toc_with_page_number, model)\n    logger.info(f'add_page_number_to_toc: {toc_with_page_number}')\n\n    toc_with_page_number = convert_physical_index_to_int(toc_with_page_number)\n    logger.info(f'convert_physical_index_to_int: {toc_with_page_number}')\n\n    return toc_with_page_number\n\n\n\ndef process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=None, model=None, logger=None):\n    toc_with_page_number = toc_transformer(toc_content, model)\n    logger.info(f'toc_with_page_number: {toc_with_page_number}')\n\n    toc_no_page_number = remove_page_number(copy.deepcopy(toc_with_page_number))\n    \n    start_page_index = toc_page_list[-1] + 1\n    main_content = \"\"\n    for page_index in range(start_page_index, min(start_page_index + toc_check_page_num, len(page_list))):\n        main_content += f\"<physical_index_{page_index+1}>\\n{page_list[page_index][0]}\\n<physical_index_{page_index+1}>\\n\\n\"\n\n    toc_with_physical_index = toc_index_extractor(toc_no_page_number, main_content, model)\n    logger.info(f'toc_with_physical_index: {toc_with_physical_index}')\n\n    toc_with_physical_index = convert_physical_index_to_int(toc_with_physical_index)\n    logger.info(f'toc_with_physical_index: {toc_with_physical_index}')\n\n    matching_pairs = extract_matching_page_pairs(toc_with_page_number, toc_with_physical_index, start_page_index)\n    logger.info(f'matching_pairs: {matching_pairs}')\n\n    offset = calculate_page_offset(matching_pairs)\n    logger.info(f'offset: {offset}')\n\n    toc_with_page_number = add_page_offset_to_toc_json(toc_with_page_number, offset)\n    logger.info(f'toc_with_page_number: {toc_with_page_number}')\n\n    toc_with_page_number = process_none_page_numbers(toc_with_page_number, page_list, model=model)\n    logger.info(f'toc_with_page_number: {toc_with_page_number}')\n\n    return toc_with_page_number\n\n\n\n##check if needed to process none page numbers\ndef process_none_page_numbers(toc_items, page_list, start_index=1, model=None):\n    for i, item in enumerate(toc_items):\n        if \"physical_index\" not in item:\n            # logger.info(f\"fix item: {item}\")\n            # Find previous physical_index\n            prev_physical_index = 0  # Default if no previous item exists\n            for j in range(i - 1, -1, -1):\n                if toc_items[j].get('physical_index') is not None:\n                    prev_physical_index = toc_items[j]['physical_index']\n                    break\n            \n            # Find next physical_index\n            next_physical_index = -1  # Default if no next item exists\n            for j in range(i + 1, len(toc_items)):\n                if toc_items[j].get('physical_index') is not None:\n                    next_physical_index = toc_items[j]['physical_index']\n                    break\n\n            page_contents = []\n            for page_index in range(prev_physical_index, next_physical_index+1):\n                # Add bounds checking to prevent IndexError\n                list_index = page_index - start_index\n                if list_index >= 0 and list_index < len(page_list):\n                    page_text = f\"<physical_index_{page_index}>\\n{page_list[list_index][0]}\\n<physical_index_{page_index}>\\n\\n\"\n                    page_contents.append(page_text)\n                else:\n                    continue\n\n            item_copy = copy.deepcopy(item)\n            del item_copy['page']\n            result = add_page_number_to_toc(page_contents, item_copy, model)\n            if isinstance(result[0]['physical_index'], str) and result[0]['physical_index'].startswith('<physical_index'):\n                item['physical_index'] = int(result[0]['physical_index'].split('_')[-1].rstrip('>').strip())\n                del item['page']\n    \n    return toc_items\n\n\n\n\ndef check_toc(page_list, opt=None):\n    toc_page_list = find_toc_pages(start_page_index=0, page_list=page_list, opt=opt)\n    if len(toc_page_list) == 0:\n        print('no toc found')\n        return {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'}\n    else:\n        print('toc found')\n        toc_json = toc_extractor(page_list, toc_page_list, opt.model)\n\n        if toc_json['page_index_given_in_toc'] == 'yes':\n            print('index found')\n            return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'yes'}\n        else:\n            current_start_index = toc_page_list[-1] + 1\n            \n            while (toc_json['page_index_given_in_toc'] == 'no' and \n                   current_start_index < len(page_list) and \n                   current_start_index < opt.toc_check_page_num):\n                \n                additional_toc_pages = find_toc_pages(\n                    start_page_index=current_start_index,\n                    page_list=page_list,\n                    opt=opt\n                )\n                \n                if len(additional_toc_pages) == 0:\n                    break\n\n                additional_toc_json = toc_extractor(page_list, additional_toc_pages, opt.model)\n                if additional_toc_json['page_index_given_in_toc'] == 'yes':\n                    print('index found')\n                    return {'toc_content': additional_toc_json['toc_content'], 'toc_page_list': additional_toc_pages, 'page_index_given_in_toc': 'yes'}\n\n                else:\n                    current_start_index = additional_toc_pages[-1] + 1\n            print('index not found')\n            return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'no'}\n\n\n\n\n\n\n################### fix incorrect toc #########################################################\ndef single_toc_item_index_fixer(section_title, content, model=\"gpt-4o-2024-11-20\"):\n    toc_extractor_prompt = \"\"\"\n    You are given a section title and several pages of a document, your job is to find the physical index of the start page of the section in the partial document.\n\n    The provided pages contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X.\n\n    Reply in a JSON format:\n    {\n        \"thinking\": <explain which page, started and closed by <physical_index_X>, contains the start of this section>,\n        \"physical_index\": \"<physical_index_X>\" (keep the format)\n    }\n    Directly return the final JSON structure. Do not output anything else.\"\"\"\n\n    prompt = toc_extractor_prompt + '\\nSection Title:\\n' + str(section_title) + '\\nDocument pages:\\n' + content\n    response = ChatGPT_API(model=model, prompt=prompt)\n    json_content = extract_json(response)    \n    return convert_physical_index_to_int(json_content['physical_index'])\n\n\n\nasync def fix_incorrect_toc(toc_with_page_number, page_list, incorrect_results, start_index=1, model=None, logger=None):\n    print(f'start fix_incorrect_toc with {len(incorrect_results)} incorrect results')\n    incorrect_indices = {result['list_index'] for result in incorrect_results}\n    \n    end_index = len(page_list) + start_index - 1\n    \n    incorrect_results_and_range_logs = []\n    # Helper function to process and check a single incorrect item\n    async def process_and_check_item(incorrect_item):\n        list_index = incorrect_item['list_index']\n        \n        # Check if list_index is valid\n        if list_index < 0 or list_index >= len(toc_with_page_number):\n            # Return an invalid result for out-of-bounds indices\n            return {\n                'list_index': list_index,\n                'title': incorrect_item['title'],\n                'physical_index': incorrect_item.get('physical_index'),\n                'is_valid': False\n            }\n        \n        # Find the previous correct item\n        prev_correct = None\n        for i in range(list_index-1, -1, -1):\n            if i not in incorrect_indices and i >= 0 and i < len(toc_with_page_number):\n                physical_index = toc_with_page_number[i].get('physical_index')\n                if physical_index is not None:\n                    prev_correct = physical_index\n                    break\n        # If no previous correct item found, use start_index\n        if prev_correct is None:\n            prev_correct = start_index - 1\n        \n        # Find the next correct item\n        next_correct = None\n        for i in range(list_index+1, len(toc_with_page_number)):\n            if i not in incorrect_indices and i >= 0 and i < len(toc_with_page_number):\n                physical_index = toc_with_page_number[i].get('physical_index')\n                if physical_index is not None:\n                    next_correct = physical_index\n                    break\n        # If no next correct item found, use end_index\n        if next_correct is None:\n            next_correct = end_index\n        \n        incorrect_results_and_range_logs.append({\n            'list_index': list_index,\n            'title': incorrect_item['title'],\n            'prev_correct': prev_correct,\n            'next_correct': next_correct\n        })\n\n        page_contents=[]\n        for page_index in range(prev_correct, next_correct+1):\n            # Add bounds checking to prevent IndexError\n            page_list_idx = page_index - start_index\n            if page_list_idx >= 0 and page_list_idx < len(page_list):\n                page_text = f\"<physical_index_{page_index}>\\n{page_list[page_list_idx][0]}\\n<physical_index_{page_index}>\\n\\n\"\n                page_contents.append(page_text)\n            else:\n                continue\n        content_range = ''.join(page_contents)\n        \n        physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)\n        \n        # Check if the result is correct\n        check_item = incorrect_item.copy()\n        check_item['physical_index'] = physical_index_int\n        check_result = await check_title_appearance(check_item, page_list, start_index, model)\n\n        return {\n            'list_index': list_index,\n            'title': incorrect_item['title'],\n            'physical_index': physical_index_int,\n            'is_valid': check_result['answer'] == 'yes'\n        }\n\n    # Process incorrect items concurrently\n    tasks = [\n        process_and_check_item(item)\n        for item in incorrect_results\n    ]\n    results = await asyncio.gather(*tasks, return_exceptions=True)\n    for item, result in zip(incorrect_results, results):\n        if isinstance(result, Exception):\n            print(f\"Processing item {item} generated an exception: {result}\")\n            continue\n    results = [result for result in results if not isinstance(result, Exception)]\n\n    # Update the toc_with_page_number with the fixed indices and check for any invalid results\n    invalid_results = []\n    for result in results:\n        if result['is_valid']:\n            # Add bounds checking to prevent IndexError\n            list_idx = result['list_index']\n            if 0 <= list_idx < len(toc_with_page_number):\n                toc_with_page_number[list_idx]['physical_index'] = result['physical_index']\n            else:\n                # Index is out of bounds, treat as invalid\n                invalid_results.append({\n                    'list_index': result['list_index'],\n                    'title': result['title'],\n                    'physical_index': result['physical_index'],\n                })\n        else:\n            invalid_results.append({\n                'list_index': result['list_index'],\n                'title': result['title'],\n                'physical_index': result['physical_index'],\n            })\n\n    logger.info(f'incorrect_results_and_range_logs: {incorrect_results_and_range_logs}')\n    logger.info(f'invalid_results: {invalid_results}')\n\n    return toc_with_page_number, invalid_results\n\n\n\nasync def fix_incorrect_toc_with_retries(toc_with_page_number, page_list, incorrect_results, start_index=1, max_attempts=3, model=None, logger=None):\n    print('start fix_incorrect_toc')\n    fix_attempt = 0\n    current_toc = toc_with_page_number\n    current_incorrect = incorrect_results\n\n    while current_incorrect:\n        print(f\"Fixing {len(current_incorrect)} incorrect results\")\n        \n        current_toc, current_incorrect = await fix_incorrect_toc(current_toc, page_list, current_incorrect, start_index, model, logger)\n                \n        fix_attempt += 1\n        if fix_attempt >= max_attempts:\n            logger.info(\"Maximum fix attempts reached\")\n            break\n    \n    return current_toc, current_incorrect\n\n\n\n\n################### verify toc #########################################################\nasync def verify_toc(page_list, list_result, start_index=1, N=None, model=None):\n    print('start verify_toc')\n    # Find the last non-None physical_index\n    last_physical_index = None\n    for item in reversed(list_result):\n        if item.get('physical_index') is not None:\n            last_physical_index = item['physical_index']\n            break\n    \n    # Early return if we don't have valid physical indices\n    if last_physical_index is None or last_physical_index < len(page_list)/2:\n        return 0, []\n    \n    # Determine which items to check\n    if N is None:\n        print('check all items')\n        sample_indices = range(0, len(list_result))\n    else:\n        N = min(N, len(list_result))\n        print(f'check {N} items')\n        sample_indices = random.sample(range(0, len(list_result)), N)\n\n    # Prepare items with their list indices\n    indexed_sample_list = []\n    for idx in sample_indices:\n        item = list_result[idx]\n        # Skip items with None physical_index (these were invalidated by validate_and_truncate_physical_indices)\n        if item.get('physical_index') is not None:\n            item_with_index = item.copy()\n            item_with_index['list_index'] = idx  # Add the original index in list_result\n            indexed_sample_list.append(item_with_index)\n\n    # Run checks concurrently\n    tasks = [\n        check_title_appearance(item, page_list, start_index, model)\n        for item in indexed_sample_list\n    ]\n    results = await asyncio.gather(*tasks)\n    \n    # Process results\n    correct_count = 0\n    incorrect_results = []\n    for result in results:\n        if result['answer'] == 'yes':\n            correct_count += 1\n        else:\n            incorrect_results.append(result)\n    \n    # Calculate accuracy\n    checked_count = len(results)\n    accuracy = correct_count / checked_count if checked_count > 0 else 0\n    print(f\"accuracy: {accuracy*100:.2f}%\")\n    return accuracy, incorrect_results\n\n\n\n\n\n################### main process #########################################################\nasync def meta_processor(page_list, mode=None, toc_content=None, toc_page_list=None, start_index=1, opt=None, logger=None):\n    print(mode)\n    print(f'start_index: {start_index}')\n    \n    if mode == 'process_toc_with_page_numbers':\n        toc_with_page_number = process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=opt.toc_check_page_num, model=opt.model, logger=logger)\n    elif mode == 'process_toc_no_page_numbers':\n        toc_with_page_number = process_toc_no_page_numbers(toc_content, toc_page_list, page_list, model=opt.model, logger=logger)\n    else:\n        toc_with_page_number = process_no_toc(page_list, start_index=start_index, model=opt.model, logger=logger)\n            \n    toc_with_page_number = [item for item in toc_with_page_number if item.get('physical_index') is not None] \n    \n    toc_with_page_number = validate_and_truncate_physical_indices(\n        toc_with_page_number, \n        len(page_list), \n        start_index=start_index, \n        logger=logger\n    )\n    \n    accuracy, incorrect_results = await verify_toc(page_list, toc_with_page_number, start_index=start_index, model=opt.model)\n        \n    logger.info({\n        'mode': 'process_toc_with_page_numbers',\n        'accuracy': accuracy,\n        'incorrect_results': incorrect_results\n    })\n    if accuracy == 1.0 and len(incorrect_results) == 0:\n        return toc_with_page_number\n    if accuracy > 0.6 and len(incorrect_results) > 0:\n        toc_with_page_number, incorrect_results = await fix_incorrect_toc_with_retries(toc_with_page_number, page_list, incorrect_results,start_index=start_index, max_attempts=3, model=opt.model, logger=logger)\n        return toc_with_page_number\n    else:\n        if mode == 'process_toc_with_page_numbers':\n            return await meta_processor(page_list, mode='process_toc_no_page_numbers', toc_content=toc_content, toc_page_list=toc_page_list, start_index=start_index, opt=opt, logger=logger)\n        elif mode == 'process_toc_no_page_numbers':\n            return await meta_processor(page_list, mode='process_no_toc', start_index=start_index, opt=opt, logger=logger)\n        else:\n            raise Exception('Processing failed')\n        \n \nasync def process_large_node_recursively(node, page_list, opt=None, logger=None):\n    node_page_list = page_list[node['start_index']-1:node['end_index']]\n    token_num = sum([page[1] for page in node_page_list])\n    \n    if node['end_index'] - node['start_index'] > opt.max_page_num_each_node and token_num >= opt.max_token_num_each_node:\n        print('large node:', node['title'], 'start_index:', node['start_index'], 'end_index:', node['end_index'], 'token_num:', token_num)\n\n        node_toc_tree = await meta_processor(node_page_list, mode='process_no_toc', start_index=node['start_index'], opt=opt, logger=logger)\n        node_toc_tree = await check_title_appearance_in_start_concurrent(node_toc_tree, page_list, model=opt.model, logger=logger)\n        \n        # Filter out items with None physical_index before post_processing\n        valid_node_toc_items = [item for item in node_toc_tree if item.get('physical_index') is not None]\n        \n        if valid_node_toc_items and node['title'].strip() == valid_node_toc_items[0]['title'].strip():\n            node['nodes'] = post_processing(valid_node_toc_items[1:], node['end_index'])\n            node['end_index'] = valid_node_toc_items[1]['start_index'] if len(valid_node_toc_items) > 1 else node['end_index']\n        else:\n            node['nodes'] = post_processing(valid_node_toc_items, node['end_index'])\n            node['end_index'] = valid_node_toc_items[0]['start_index'] if valid_node_toc_items else node['end_index']\n        \n    if 'nodes' in node and node['nodes']:\n        tasks = [\n            process_large_node_recursively(child_node, page_list, opt, logger=logger)\n            for child_node in node['nodes']\n        ]\n        await asyncio.gather(*tasks)\n    \n    return node\n\nasync def tree_parser(page_list, opt, doc=None, logger=None):\n    check_toc_result = check_toc(page_list, opt)\n    logger.info(check_toc_result)\n\n    if check_toc_result.get(\"toc_content\") and check_toc_result[\"toc_content\"].strip() and check_toc_result[\"page_index_given_in_toc\"] == \"yes\":\n        toc_with_page_number = await meta_processor(\n            page_list, \n            mode='process_toc_with_page_numbers', \n            start_index=1, \n            toc_content=check_toc_result['toc_content'], \n            toc_page_list=check_toc_result['toc_page_list'], \n            opt=opt,\n            logger=logger)\n    else:\n        toc_with_page_number = await meta_processor(\n            page_list, \n            mode='process_no_toc', \n            start_index=1, \n            opt=opt,\n            logger=logger)\n\n    toc_with_page_number = add_preface_if_needed(toc_with_page_number)\n    toc_with_page_number = await check_title_appearance_in_start_concurrent(toc_with_page_number, page_list, model=opt.model, logger=logger)\n    \n    # Filter out items with None physical_index before post_processings\n    valid_toc_items = [item for item in toc_with_page_number if item.get('physical_index') is not None]\n    \n    toc_tree = post_processing(valid_toc_items, len(page_list))\n    tasks = [\n        process_large_node_recursively(node, page_list, opt, logger=logger)\n        for node in toc_tree\n    ]\n    await asyncio.gather(*tasks)\n    \n    return toc_tree\n\n\ndef page_index_main(doc, opt=None):\n    logger = JsonLogger(doc)\n    \n    is_valid_pdf = (\n        (isinstance(doc, str) and os.path.isfile(doc) and doc.lower().endswith(\".pdf\")) or \n        isinstance(doc, BytesIO)\n    )\n    if not is_valid_pdf:\n        raise ValueError(\"Unsupported input type. Expected a PDF file path or BytesIO object.\")\n\n    print('Parsing PDF...')\n    page_list = get_page_tokens(doc)\n\n    logger.info({'total_page_number': len(page_list)})\n    logger.info({'total_token': sum([page[1] for page in page_list])})\n\n    async def page_index_builder():\n        structure = await tree_parser(page_list, opt, doc=doc, logger=logger)\n        if opt.if_add_node_id == 'yes':\n            write_node_id(structure)    \n        if opt.if_add_node_text == 'yes':\n            add_node_text(structure, page_list)\n        if opt.if_add_node_summary == 'yes':\n            if opt.if_add_node_text == 'no':\n                add_node_text(structure, page_list)\n            await generate_summaries_for_structure(structure, model=opt.model)\n            if opt.if_add_node_text == 'no':\n                remove_structure_text(structure)\n            if opt.if_add_doc_description == 'yes':\n                # Create a clean structure without unnecessary fields for description generation\n                clean_structure = create_clean_structure_for_description(structure)\n                doc_description = generate_doc_description(clean_structure, model=opt.model)\n                return {\n                    'doc_name': get_pdf_name(doc),\n                    'doc_description': doc_description,\n                    'structure': structure,\n                }\n        return {\n            'doc_name': get_pdf_name(doc),\n            'structure': structure,\n        }\n\n    return asyncio.run(page_index_builder())\n\n\ndef page_index(doc, model=None, toc_check_page_num=None, max_page_num_each_node=None, max_token_num_each_node=None,\n               if_add_node_id=None, if_add_node_summary=None, if_add_doc_description=None, if_add_node_text=None):\n    \n    user_opt = {\n        arg: value for arg, value in locals().items()\n        if arg != \"doc\" and value is not None\n    }\n    opt = ConfigLoader().load(user_opt)\n    return page_index_main(doc, opt)\n\n\ndef validate_and_truncate_physical_indices(toc_with_page_number, page_list_length, start_index=1, logger=None):\n    \"\"\"\n    Validates and truncates physical indices that exceed the actual document length.\n    This prevents errors when TOC references pages that don't exist in the document (e.g. the file is broken or incomplete).\n    \"\"\"\n    if not toc_with_page_number:\n        return toc_with_page_number\n    \n    max_allowed_page = page_list_length + start_index - 1\n    truncated_items = []\n    \n    for i, item in enumerate(toc_with_page_number):\n        if item.get('physical_index') is not None:\n            original_index = item['physical_index']\n            if original_index > max_allowed_page:\n                item['physical_index'] = None\n                truncated_items.append({\n                    'title': item.get('title', 'Unknown'),\n                    'original_index': original_index\n                })\n                if logger:\n                    logger.info(f\"Removed physical_index for '{item.get('title', 'Unknown')}' (was {original_index}, too far beyond document)\")\n    \n    if truncated_items and logger:\n        logger.info(f\"Total removed items: {len(truncated_items)}\")\n        \n    print(f\"Document validation: {page_list_length} pages, max allowed index: {max_allowed_page}\")\n    if truncated_items:\n        print(f\"Truncated {len(truncated_items)} TOC items that exceeded document length\")\n     \n    return toc_with_page_number"
  },
  {
    "path": "pageindex/page_index_md.py",
    "content": "import asyncio\nimport json\nimport re\nimport os\ntry:\n    from .utils import *\nexcept:\n    from utils import *\n\nasync def get_node_summary(node, summary_token_threshold=200, model=None):\n    node_text = node.get('text')\n    num_tokens = count_tokens(node_text, model=model)\n    if num_tokens < summary_token_threshold:\n        return node_text\n    else:\n        return await generate_node_summary(node, model=model)\n\n\nasync def generate_summaries_for_structure_md(structure, summary_token_threshold, model=None):\n    nodes = structure_to_list(structure)\n    tasks = [get_node_summary(node, summary_token_threshold=summary_token_threshold, model=model) for node in nodes]\n    summaries = await asyncio.gather(*tasks)\n    \n    for node, summary in zip(nodes, summaries):\n        if not node.get('nodes'):\n            node['summary'] = summary\n        else:\n            node['prefix_summary'] = summary\n    return structure\n\n\ndef extract_nodes_from_markdown(markdown_content):\n    header_pattern = r'^(#{1,6})\\s+(.+)$'\n    code_block_pattern = r'^```'\n    node_list = []\n    \n    lines = markdown_content.split('\\n')\n    in_code_block = False\n    \n    for line_num, line in enumerate(lines, 1):\n        stripped_line = line.strip()\n        \n        # Check for code block delimiters (triple backticks)\n        if re.match(code_block_pattern, stripped_line):\n            in_code_block = not in_code_block\n            continue\n        \n        # Skip empty lines\n        if not stripped_line:\n            continue\n        \n        # Only look for headers when not inside a code block\n        if not in_code_block:\n            match = re.match(header_pattern, stripped_line)\n            if match:\n                title = match.group(2).strip()\n                node_list.append({'node_title': title, 'line_num': line_num})\n\n    return node_list, lines\n\n\ndef extract_node_text_content(node_list, markdown_lines):    \n    all_nodes = []\n    for node in node_list:\n        line_content = markdown_lines[node['line_num'] - 1]\n        header_match = re.match(r'^(#{1,6})', line_content)\n        \n        if header_match is None:\n            print(f\"Warning: Line {node['line_num']} does not contain a valid header: '{line_content}'\")\n            continue\n            \n        processed_node = {\n            'title': node['node_title'],\n            'line_num': node['line_num'],\n            'level': len(header_match.group(1))\n        }\n        all_nodes.append(processed_node)\n    \n    for i, node in enumerate(all_nodes):\n        start_line = node['line_num'] - 1 \n        if i + 1 < len(all_nodes):\n            end_line = all_nodes[i + 1]['line_num'] - 1 \n        else:\n            end_line = len(markdown_lines)\n        \n        node['text'] = '\\n'.join(markdown_lines[start_line:end_line]).strip()    \n    return all_nodes\n\ndef update_node_list_with_text_token_count(node_list, model=None):\n\n    def find_all_children(parent_index, parent_level, node_list):\n        \"\"\"Find all direct and indirect children of a parent node\"\"\"\n        children_indices = []\n        \n        # Look for children after the parent\n        for i in range(parent_index + 1, len(node_list)):\n            current_level = node_list[i]['level']\n            \n            # If we hit a node at same or higher level than parent, stop\n            if current_level <= parent_level:\n                break\n                \n            # This is a descendant\n            children_indices.append(i)\n        \n        return children_indices\n    \n    # Make a copy to avoid modifying the original\n    result_list = node_list.copy()\n    \n    # Process nodes from end to beginning to ensure children are processed before parents\n    for i in range(len(result_list) - 1, -1, -1):\n        current_node = result_list[i]\n        current_level = current_node['level']\n        \n        # Get all children of this node\n        children_indices = find_all_children(i, current_level, result_list)\n        \n        # Start with the node's own text\n        node_text = current_node.get('text', '')\n        total_text = node_text\n        \n        # Add all children's text\n        for child_index in children_indices:\n            child_text = result_list[child_index].get('text', '')\n            if child_text:\n                total_text += '\\n' + child_text\n        \n        # Calculate token count for combined text\n        result_list[i]['text_token_count'] = count_tokens(total_text, model=model)\n    \n    return result_list\n\n\ndef tree_thinning_for_index(node_list, min_node_token=None, model=None):\n    def find_all_children(parent_index, parent_level, node_list):\n        children_indices = []\n        \n        for i in range(parent_index + 1, len(node_list)):\n            current_level = node_list[i]['level']\n            \n            if current_level <= parent_level:\n                break\n                \n            children_indices.append(i)\n        \n        return children_indices\n    \n    result_list = node_list.copy()\n    nodes_to_remove = set()\n    \n    for i in range(len(result_list) - 1, -1, -1):\n        if i in nodes_to_remove:\n            continue\n            \n        current_node = result_list[i]\n        current_level = current_node['level']\n        \n        total_tokens = current_node.get('text_token_count', 0)\n        \n        if total_tokens < min_node_token:\n            children_indices = find_all_children(i, current_level, result_list)\n            \n            children_texts = []\n            for child_index in sorted(children_indices):\n                if child_index not in nodes_to_remove:\n                    child_text = result_list[child_index].get('text', '')\n                    if child_text.strip():\n                        children_texts.append(child_text)\n                    nodes_to_remove.add(child_index)\n            \n            if children_texts:\n                parent_text = current_node.get('text', '')\n                merged_text = parent_text\n                for child_text in children_texts:\n                    if merged_text and not merged_text.endswith('\\n'):\n                        merged_text += '\\n\\n'\n                    merged_text += child_text\n                \n                result_list[i]['text'] = merged_text\n                \n                result_list[i]['text_token_count'] = count_tokens(merged_text, model=model)\n    \n    for index in sorted(nodes_to_remove, reverse=True):\n        result_list.pop(index)\n    \n    return result_list\n\n\ndef build_tree_from_nodes(node_list):\n    if not node_list:\n        return []\n    \n    stack = []\n    root_nodes = []\n    node_counter = 1\n    \n    for node in node_list:\n        current_level = node['level']\n        \n        tree_node = {\n            'title': node['title'],\n            'node_id': str(node_counter).zfill(4),\n            'text': node['text'],\n            'line_num': node['line_num'],\n            'nodes': []\n        }\n        node_counter += 1\n        \n        while stack and stack[-1][1] >= current_level:\n            stack.pop()\n        \n        if not stack:\n            root_nodes.append(tree_node)\n        else:\n            parent_node, parent_level = stack[-1]\n            parent_node['nodes'].append(tree_node)\n        \n        stack.append((tree_node, current_level))\n    \n    return root_nodes\n\n\ndef clean_tree_for_output(tree_nodes):\n    cleaned_nodes = []\n    \n    for node in tree_nodes:\n        cleaned_node = {\n            'title': node['title'],\n            'node_id': node['node_id'],\n            'text': node['text'],\n            'line_num': node['line_num']\n        }\n        \n        if node['nodes']:\n            cleaned_node['nodes'] = clean_tree_for_output(node['nodes'])\n        \n        cleaned_nodes.append(cleaned_node)\n    \n    return cleaned_nodes\n\n\nasync def md_to_tree(md_path, if_thinning=False, min_token_threshold=None, if_add_node_summary='no', summary_token_threshold=None, model=None, if_add_doc_description='no', if_add_node_text='no', if_add_node_id='yes'):\n    with open(md_path, 'r', encoding='utf-8') as f:\n        markdown_content = f.read()\n    \n    print(f\"Extracting nodes from markdown...\")\n    node_list, markdown_lines = extract_nodes_from_markdown(markdown_content)\n\n    print(f\"Extracting text content from nodes...\")\n    nodes_with_content = extract_node_text_content(node_list, markdown_lines)\n    \n    if if_thinning:\n        nodes_with_content = update_node_list_with_text_token_count(nodes_with_content, model=model)\n        print(f\"Thinning nodes...\")\n        nodes_with_content = tree_thinning_for_index(nodes_with_content, min_token_threshold, model=model)\n    \n    print(f\"Building tree from nodes...\")\n    tree_structure = build_tree_from_nodes(nodes_with_content)\n\n    if if_add_node_id == 'yes':\n        write_node_id(tree_structure)\n\n    print(f\"Formatting tree structure...\")\n    \n    if if_add_node_summary == 'yes':\n        # Always include text for summary generation\n        tree_structure = format_structure(tree_structure, order = ['title', 'node_id', 'summary', 'prefix_summary', 'text', 'line_num', 'nodes'])\n        \n        print(f\"Generating summaries for each node...\")\n        tree_structure = await generate_summaries_for_structure_md(tree_structure, summary_token_threshold=summary_token_threshold, model=model)\n        \n        if if_add_node_text == 'no':\n            # Remove text after summary generation if not requested\n            tree_structure = format_structure(tree_structure, order = ['title', 'node_id', 'summary', 'prefix_summary', 'line_num', 'nodes'])\n        \n        if if_add_doc_description == 'yes':\n            print(f\"Generating document description...\")\n            # Create a clean structure without unnecessary fields for description generation\n            clean_structure = create_clean_structure_for_description(tree_structure)\n            doc_description = generate_doc_description(clean_structure, model=model)\n            return {\n                'doc_name': os.path.splitext(os.path.basename(md_path))[0],\n                'doc_description': doc_description,\n                'structure': tree_structure,\n            }\n    else:\n        # No summaries needed, format based on text preference\n        if if_add_node_text == 'yes':\n            tree_structure = format_structure(tree_structure, order = ['title', 'node_id', 'summary', 'prefix_summary', 'text', 'line_num', 'nodes'])\n        else:\n            tree_structure = format_structure(tree_structure, order = ['title', 'node_id', 'summary', 'prefix_summary', 'line_num', 'nodes'])\n    \n    return {\n        'doc_name': os.path.splitext(os.path.basename(md_path))[0],\n        'structure': tree_structure,\n    }\n\n\nif __name__ == \"__main__\":\n    import os\n    import json\n    \n    # MD_NAME = 'Detect-Order-Construct'\n    MD_NAME = 'cognitive-load'\n    MD_PATH = os.path.join(os.path.dirname(__file__), '..', 'tests/markdowns/', f'{MD_NAME}.md')\n\n\n    MODEL=\"gpt-4.1\"\n    IF_THINNING=False\n    THINNING_THRESHOLD=5000\n    SUMMARY_TOKEN_THRESHOLD=200\n    IF_SUMMARY=True\n\n    tree_structure = asyncio.run(md_to_tree(\n        md_path=MD_PATH, \n        if_thinning=IF_THINNING, \n        min_token_threshold=THINNING_THRESHOLD, \n        if_add_node_summary='yes' if IF_SUMMARY else 'no', \n        summary_token_threshold=SUMMARY_TOKEN_THRESHOLD, \n        model=MODEL))\n    \n    print('\\n' + '='*60)\n    print('TREE STRUCTURE')\n    print('='*60)\n    print_json(tree_structure)\n\n    print('\\n' + '='*60)\n    print('TABLE OF CONTENTS')\n    print('='*60)\n    print_toc(tree_structure['structure'])\n\n    output_path = os.path.join(os.path.dirname(__file__), '..', 'results', f'{MD_NAME}_structure.json')\n    os.makedirs(os.path.dirname(output_path), exist_ok=True)\n    \n    with open(output_path, 'w', encoding='utf-8') as f:\n        json.dump(tree_structure, f, indent=2, ensure_ascii=False)\n    \n    print(f\"\\nTree structure saved to: {output_path}\")"
  },
  {
    "path": "pageindex/utils.py",
    "content": "import tiktoken\nimport openai\nimport logging\nimport os\nfrom datetime import datetime\nimport time\nimport json\nimport PyPDF2\nimport copy\nimport asyncio\nimport pymupdf\nfrom io import BytesIO\nfrom dotenv import load_dotenv\nload_dotenv()\nimport logging\nimport yaml\nfrom pathlib import Path\nfrom types import SimpleNamespace as config\n\nCHATGPT_API_KEY = os.getenv(\"CHATGPT_API_KEY\")\n\ndef count_tokens(text, model=None):\n    if not text:\n        return 0\n    enc = tiktoken.encoding_for_model(model)\n    tokens = enc.encode(text)\n    return len(tokens)\n\ndef ChatGPT_API_with_finish_reason(model, prompt, api_key=CHATGPT_API_KEY, chat_history=None):\n    max_retries = 10\n    client = openai.OpenAI(api_key=api_key)\n    for i in range(max_retries):\n        try:\n            if chat_history:\n                messages = chat_history\n                messages.append({\"role\": \"user\", \"content\": prompt})\n            else:\n                messages = [{\"role\": \"user\", \"content\": prompt}]\n            \n            response = client.chat.completions.create(\n                model=model,\n                messages=messages,\n                temperature=0,\n            )\n            if response.choices[0].finish_reason == \"length\":\n                return response.choices[0].message.content, \"max_output_reached\"\n            else:\n                return response.choices[0].message.content, \"finished\"\n\n        except Exception as e:\n            print('************* Retrying *************')\n            logging.error(f\"Error: {e}\")\n            if i < max_retries - 1:\n                time.sleep(1)  # Wait for 1秒 before retrying\n            else:\n                logging.error('Max retries reached for prompt: ' + prompt)\n                return \"\", \"error\"\n\n\n\ndef ChatGPT_API(model, prompt, api_key=CHATGPT_API_KEY, chat_history=None):\n    max_retries = 10\n    client = openai.OpenAI(api_key=api_key)\n    for i in range(max_retries):\n        try:\n            if chat_history:\n                messages = chat_history\n                messages.append({\"role\": \"user\", \"content\": prompt})\n            else:\n                messages = [{\"role\": \"user\", \"content\": prompt}]\n            \n            response = client.chat.completions.create(\n                model=model,\n                messages=messages,\n                temperature=0,\n            )\n   \n            return response.choices[0].message.content\n        except Exception as e:\n            print('************* Retrying *************')\n            logging.error(f\"Error: {e}\")\n            if i < max_retries - 1:\n                time.sleep(1)  # Wait for 1秒 before retrying\n            else:\n                logging.error('Max retries reached for prompt: ' + prompt)\n                return \"Error\"\n            \n\nasync def ChatGPT_API_async(model, prompt, api_key=CHATGPT_API_KEY):\n    max_retries = 10\n    messages = [{\"role\": \"user\", \"content\": prompt}]\n    for i in range(max_retries):\n        try:\n            async with openai.AsyncOpenAI(api_key=api_key) as client:\n                response = await client.chat.completions.create(\n                    model=model,\n                    messages=messages,\n                    temperature=0,\n                )\n                return response.choices[0].message.content\n        except Exception as e:\n            print('************* Retrying *************')\n            logging.error(f\"Error: {e}\")\n            if i < max_retries - 1:\n                await asyncio.sleep(1)  # Wait for 1s before retrying\n            else:\n                logging.error('Max retries reached for prompt: ' + prompt)\n                return \"Error\"  \n            \n            \ndef get_json_content(response):\n    start_idx = response.find(\"```json\")\n    if start_idx != -1:\n        start_idx += 7\n        response = response[start_idx:]\n        \n    end_idx = response.rfind(\"```\")\n    if end_idx != -1:\n        response = response[:end_idx]\n    \n    json_content = response.strip()\n    return json_content\n         \n\ndef extract_json(content):\n    try:\n        # First, try to extract JSON enclosed within ```json and ```\n        start_idx = content.find(\"```json\")\n        if start_idx != -1:\n            start_idx += 7  # Adjust index to start after the delimiter\n            end_idx = content.rfind(\"```\")\n            json_content = content[start_idx:end_idx].strip()\n        else:\n            # If no delimiters, assume entire content could be JSON\n            json_content = content.strip()\n\n        # Clean up common issues that might cause parsing errors\n        json_content = json_content.replace('None', 'null')  # Replace Python None with JSON null\n        json_content = json_content.replace('\\n', ' ').replace('\\r', ' ')  # Remove newlines\n        json_content = ' '.join(json_content.split())  # Normalize whitespace\n\n        # Attempt to parse and return the JSON object\n        return json.loads(json_content)\n    except json.JSONDecodeError as e:\n        logging.error(f\"Failed to extract JSON: {e}\")\n        # Try to clean up the content further if initial parsing fails\n        try:\n            # Remove any trailing commas before closing brackets/braces\n            json_content = json_content.replace(',]', ']').replace(',}', '}')\n            return json.loads(json_content)\n        except:\n            logging.error(\"Failed to parse JSON even after cleanup\")\n            return {}\n    except Exception as e:\n        logging.error(f\"Unexpected error while extracting JSON: {e}\")\n        return {}\n\ndef write_node_id(data, node_id=0):\n    if isinstance(data, dict):\n        data['node_id'] = str(node_id).zfill(4)\n        node_id += 1\n        for key in list(data.keys()):\n            if 'nodes' in key:\n                node_id = write_node_id(data[key], node_id)\n    elif isinstance(data, list):\n        for index in range(len(data)):\n            node_id = write_node_id(data[index], node_id)\n    return node_id\n\ndef get_nodes(structure):\n    if isinstance(structure, dict):\n        structure_node = copy.deepcopy(structure)\n        structure_node.pop('nodes', None)\n        nodes = [structure_node]\n        for key in list(structure.keys()):\n            if 'nodes' in key:\n                nodes.extend(get_nodes(structure[key]))\n        return nodes\n    elif isinstance(structure, list):\n        nodes = []\n        for item in structure:\n            nodes.extend(get_nodes(item))\n        return nodes\n    \ndef structure_to_list(structure):\n    if isinstance(structure, dict):\n        nodes = []\n        nodes.append(structure)\n        if 'nodes' in structure:\n            nodes.extend(structure_to_list(structure['nodes']))\n        return nodes\n    elif isinstance(structure, list):\n        nodes = []\n        for item in structure:\n            nodes.extend(structure_to_list(item))\n        return nodes\n\n    \ndef get_leaf_nodes(structure):\n    if isinstance(structure, dict):\n        if not structure['nodes']:\n            structure_node = copy.deepcopy(structure)\n            structure_node.pop('nodes', None)\n            return [structure_node]\n        else:\n            leaf_nodes = []\n            for key in list(structure.keys()):\n                if 'nodes' in key:\n                    leaf_nodes.extend(get_leaf_nodes(structure[key]))\n            return leaf_nodes\n    elif isinstance(structure, list):\n        leaf_nodes = []\n        for item in structure:\n            leaf_nodes.extend(get_leaf_nodes(item))\n        return leaf_nodes\n\ndef is_leaf_node(data, node_id):\n    # Helper function to find the node by its node_id\n    def find_node(data, node_id):\n        if isinstance(data, dict):\n            if data.get('node_id') == node_id:\n                return data\n            for key in data.keys():\n                if 'nodes' in key:\n                    result = find_node(data[key], node_id)\n                    if result:\n                        return result\n        elif isinstance(data, list):\n            for item in data:\n                result = find_node(item, node_id)\n                if result:\n                    return result\n        return None\n\n    # Find the node with the given node_id\n    node = find_node(data, node_id)\n\n    # Check if the node is a leaf node\n    if node and not node.get('nodes'):\n        return True\n    return False\n\ndef get_last_node(structure):\n    return structure[-1]\n\n\ndef extract_text_from_pdf(pdf_path):\n    pdf_reader = PyPDF2.PdfReader(pdf_path)\n    ###return text not list \n    text=\"\"\n    for page_num in range(len(pdf_reader.pages)):\n        page = pdf_reader.pages[page_num]\n        text+=page.extract_text()\n    return text\n\ndef get_pdf_title(pdf_path):\n    pdf_reader = PyPDF2.PdfReader(pdf_path)\n    meta = pdf_reader.metadata\n    title = meta.title if meta and meta.title else 'Untitled'\n    return title\n\ndef get_text_of_pages(pdf_path, start_page, end_page, tag=True):\n    pdf_reader = PyPDF2.PdfReader(pdf_path)\n    text = \"\"\n    for page_num in range(start_page-1, end_page):\n        page = pdf_reader.pages[page_num]\n        page_text = page.extract_text()\n        if tag:\n            text += f\"<start_index_{page_num+1}>\\n{page_text}\\n<end_index_{page_num+1}>\\n\"\n        else:\n            text += page_text\n    return text\n\ndef get_first_start_page_from_text(text):\n    start_page = -1\n    start_page_match = re.search(r'<start_index_(\\d+)>', text)\n    if start_page_match:\n        start_page = int(start_page_match.group(1))\n    return start_page\n\ndef get_last_start_page_from_text(text):\n    start_page = -1\n    # Find all matches of start_index tags\n    start_page_matches = re.finditer(r'<start_index_(\\d+)>', text)\n    # Convert iterator to list and get the last match if any exist\n    matches_list = list(start_page_matches)\n    if matches_list:\n        start_page = int(matches_list[-1].group(1))\n    return start_page\n\n\ndef sanitize_filename(filename, replacement='-'):\n    # In Linux, only '/' and '\\0' (null) are invalid in filenames.\n    # Null can't be represented in strings, so we only handle '/'.\n    return filename.replace('/', replacement)\n\ndef get_pdf_name(pdf_path):\n    # Extract PDF name\n    if isinstance(pdf_path, str):\n        pdf_name = os.path.basename(pdf_path)\n    elif isinstance(pdf_path, BytesIO):\n        pdf_reader = PyPDF2.PdfReader(pdf_path)\n        meta = pdf_reader.metadata\n        pdf_name = meta.title if meta and meta.title else 'Untitled'\n        pdf_name = sanitize_filename(pdf_name)\n    return pdf_name\n\n\nclass JsonLogger:\n    def __init__(self, file_path):\n        # Extract PDF name for logger name\n        pdf_name = get_pdf_name(file_path)\n            \n        current_time = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n        self.filename = f\"{pdf_name}_{current_time}.json\"\n        os.makedirs(\"./logs\", exist_ok=True)\n        # Initialize empty list to store all messages\n        self.log_data = []\n\n    def log(self, level, message, **kwargs):\n        if isinstance(message, dict):\n            self.log_data.append(message)\n        else:\n            self.log_data.append({'message': message})\n        # Add new message to the log data\n        \n        # Write entire log data to file\n        with open(self._filepath(), \"w\") as f:\n            json.dump(self.log_data, f, indent=2)\n\n    def info(self, message, **kwargs):\n        self.log(\"INFO\", message, **kwargs)\n\n    def error(self, message, **kwargs):\n        self.log(\"ERROR\", message, **kwargs)\n\n    def debug(self, message, **kwargs):\n        self.log(\"DEBUG\", message, **kwargs)\n\n    def exception(self, message, **kwargs):\n        kwargs[\"exception\"] = True\n        self.log(\"ERROR\", message, **kwargs)\n\n    def _filepath(self):\n        return os.path.join(\"logs\", self.filename)\n    \n\n\n\ndef list_to_tree(data):\n    def get_parent_structure(structure):\n        \"\"\"Helper function to get the parent structure code\"\"\"\n        if not structure:\n            return None\n        parts = str(structure).split('.')\n        return '.'.join(parts[:-1]) if len(parts) > 1 else None\n    \n    # First pass: Create nodes and track parent-child relationships\n    nodes = {}\n    root_nodes = []\n    \n    for item in data:\n        structure = item.get('structure')\n        node = {\n            'title': item.get('title'),\n            'start_index': item.get('start_index'),\n            'end_index': item.get('end_index'),\n            'nodes': []\n        }\n        \n        nodes[structure] = node\n        \n        # Find parent\n        parent_structure = get_parent_structure(structure)\n        \n        if parent_structure:\n            # Add as child to parent if parent exists\n            if parent_structure in nodes:\n                nodes[parent_structure]['nodes'].append(node)\n            else:\n                root_nodes.append(node)\n        else:\n            # No parent, this is a root node\n            root_nodes.append(node)\n    \n    # Helper function to clean empty children arrays\n    def clean_node(node):\n        if not node['nodes']:\n            del node['nodes']\n        else:\n            for child in node['nodes']:\n                clean_node(child)\n        return node\n    \n    # Clean and return the tree\n    return [clean_node(node) for node in root_nodes]\n\ndef add_preface_if_needed(data):\n    if not isinstance(data, list) or not data:\n        return data\n\n    if data[0]['physical_index'] is not None and data[0]['physical_index'] > 1:\n        preface_node = {\n            \"structure\": \"0\",\n            \"title\": \"Preface\",\n            \"physical_index\": 1,\n        }\n        data.insert(0, preface_node)\n    return data\n\n\n\ndef get_page_tokens(pdf_path, model=\"gpt-4o-2024-11-20\", pdf_parser=\"PyPDF2\"):\n    enc = tiktoken.encoding_for_model(model)\n    if pdf_parser == \"PyPDF2\":\n        pdf_reader = PyPDF2.PdfReader(pdf_path)\n        page_list = []\n        for page_num in range(len(pdf_reader.pages)):\n            page = pdf_reader.pages[page_num]\n            page_text = page.extract_text()\n            token_length = len(enc.encode(page_text))\n            page_list.append((page_text, token_length))\n        return page_list\n    elif pdf_parser == \"PyMuPDF\":\n        if isinstance(pdf_path, BytesIO):\n            pdf_stream = pdf_path\n            doc = pymupdf.open(stream=pdf_stream, filetype=\"pdf\")\n        elif isinstance(pdf_path, str) and os.path.isfile(pdf_path) and pdf_path.lower().endswith(\".pdf\"):\n            doc = pymupdf.open(pdf_path)\n        page_list = []\n        for page in doc:\n            page_text = page.get_text()\n            token_length = len(enc.encode(page_text))\n            page_list.append((page_text, token_length))\n        return page_list\n    else:\n        raise ValueError(f\"Unsupported PDF parser: {pdf_parser}\")\n\n        \n\ndef get_text_of_pdf_pages(pdf_pages, start_page, end_page):\n    text = \"\"\n    for page_num in range(start_page-1, end_page):\n        text += pdf_pages[page_num][0]\n    return text\n\ndef get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page):\n    text = \"\"\n    for page_num in range(start_page-1, end_page):\n        text += f\"<physical_index_{page_num+1}>\\n{pdf_pages[page_num][0]}\\n<physical_index_{page_num+1}>\\n\"\n    return text\n\ndef get_number_of_pages(pdf_path):\n    pdf_reader = PyPDF2.PdfReader(pdf_path)\n    num = len(pdf_reader.pages)\n    return num\n\n\n\ndef post_processing(structure, end_physical_index):\n    # First convert page_number to start_index in flat list\n    for i, item in enumerate(structure):\n        item['start_index'] = item.get('physical_index')\n        if i < len(structure) - 1:\n            if structure[i + 1].get('appear_start') == 'yes':\n                item['end_index'] = structure[i + 1]['physical_index']-1\n            else:\n                item['end_index'] = structure[i + 1]['physical_index']\n        else:\n            item['end_index'] = end_physical_index\n    tree = list_to_tree(structure)\n    if len(tree)!=0:\n        return tree\n    else:\n        ### remove appear_start \n        for node in structure:\n            node.pop('appear_start', None)\n            node.pop('physical_index', None)\n        return structure\n\ndef clean_structure_post(data):\n    if isinstance(data, dict):\n        data.pop('page_number', None)\n        data.pop('start_index', None)\n        data.pop('end_index', None)\n        if 'nodes' in data:\n            clean_structure_post(data['nodes'])\n    elif isinstance(data, list):\n        for section in data:\n            clean_structure_post(section)\n    return data\n\ndef remove_fields(data, fields=['text']):\n    if isinstance(data, dict):\n        return {k: remove_fields(v, fields)\n            for k, v in data.items() if k not in fields}\n    elif isinstance(data, list):\n        return [remove_fields(item, fields) for item in data]\n    return data\n\ndef print_toc(tree, indent=0):\n    for node in tree:\n        print('  ' * indent + node['title'])\n        if node.get('nodes'):\n            print_toc(node['nodes'], indent + 1)\n\ndef print_json(data, max_len=40, indent=2):\n    def simplify_data(obj):\n        if isinstance(obj, dict):\n            return {k: simplify_data(v) for k, v in obj.items()}\n        elif isinstance(obj, list):\n            return [simplify_data(item) for item in obj]\n        elif isinstance(obj, str) and len(obj) > max_len:\n            return obj[:max_len] + '...'\n        else:\n            return obj\n    \n    simplified = simplify_data(data)\n    print(json.dumps(simplified, indent=indent, ensure_ascii=False))\n\n\ndef remove_structure_text(data):\n    if isinstance(data, dict):\n        data.pop('text', None)\n        if 'nodes' in data:\n            remove_structure_text(data['nodes'])\n    elif isinstance(data, list):\n        for item in data:\n            remove_structure_text(item)\n    return data\n\n\ndef check_token_limit(structure, limit=110000):\n    list = structure_to_list(structure)\n    for node in list:\n        num_tokens = count_tokens(node['text'], model='gpt-4o')\n        if num_tokens > limit:\n            print(f\"Node ID: {node['node_id']} has {num_tokens} tokens\")\n            print(\"Start Index:\", node['start_index'])\n            print(\"End Index:\", node['end_index'])\n            print(\"Title:\", node['title'])\n            print(\"\\n\")\n\n\ndef convert_physical_index_to_int(data):\n    if isinstance(data, list):\n        for i in range(len(data)):\n            # Check if item is a dictionary and has 'physical_index' key\n            if isinstance(data[i], dict) and 'physical_index' in data[i]:\n                if isinstance(data[i]['physical_index'], str):\n                    if data[i]['physical_index'].startswith('<physical_index_'):\n                        data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].rstrip('>').strip())\n                    elif data[i]['physical_index'].startswith('physical_index_'):\n                        data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].strip())\n    elif isinstance(data, str):\n        if data.startswith('<physical_index_'):\n            data = int(data.split('_')[-1].rstrip('>').strip())\n        elif data.startswith('physical_index_'):\n            data = int(data.split('_')[-1].strip())\n        # Check data is int\n        if isinstance(data, int):\n            return data\n        else:\n            return None\n    return data\n\n\ndef convert_page_to_int(data):\n    for item in data:\n        if 'page' in item and isinstance(item['page'], str):\n            try:\n                item['page'] = int(item['page'])\n            except ValueError:\n                # Keep original value if conversion fails\n                pass\n    return data\n\n\ndef add_node_text(node, pdf_pages):\n    if isinstance(node, dict):\n        start_page = node.get('start_index')\n        end_page = node.get('end_index')\n        node['text'] = get_text_of_pdf_pages(pdf_pages, start_page, end_page)\n        if 'nodes' in node:\n            add_node_text(node['nodes'], pdf_pages)\n    elif isinstance(node, list):\n        for index in range(len(node)):\n            add_node_text(node[index], pdf_pages)\n    return\n\n\ndef add_node_text_with_labels(node, pdf_pages):\n    if isinstance(node, dict):\n        start_page = node.get('start_index')\n        end_page = node.get('end_index')\n        node['text'] = get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page)\n        if 'nodes' in node:\n            add_node_text_with_labels(node['nodes'], pdf_pages)\n    elif isinstance(node, list):\n        for index in range(len(node)):\n            add_node_text_with_labels(node[index], pdf_pages)\n    return\n\n\nasync def generate_node_summary(node, model=None):\n    prompt = f\"\"\"You are given a part of a document, your task is to generate a description of the partial document about what are main points covered in the partial document.\n\n    Partial Document Text: {node['text']}\n    \n    Directly return the description, do not include any other text.\n    \"\"\"\n    response = await ChatGPT_API_async(model, prompt)\n    return response\n\n\nasync def generate_summaries_for_structure(structure, model=None):\n    nodes = structure_to_list(structure)\n    tasks = [generate_node_summary(node, model=model) for node in nodes]\n    summaries = await asyncio.gather(*tasks)\n    \n    for node, summary in zip(nodes, summaries):\n        node['summary'] = summary\n    return structure\n\n\ndef create_clean_structure_for_description(structure):\n    \"\"\"\n    Create a clean structure for document description generation,\n    excluding unnecessary fields like 'text'.\n    \"\"\"\n    if isinstance(structure, dict):\n        clean_node = {}\n        # Only include essential fields for description\n        for key in ['title', 'node_id', 'summary', 'prefix_summary']:\n            if key in structure:\n                clean_node[key] = structure[key]\n        \n        # Recursively process child nodes\n        if 'nodes' in structure and structure['nodes']:\n            clean_node['nodes'] = create_clean_structure_for_description(structure['nodes'])\n        \n        return clean_node\n    elif isinstance(structure, list):\n        return [create_clean_structure_for_description(item) for item in structure]\n    else:\n        return structure\n\n\ndef generate_doc_description(structure, model=None):\n    prompt = f\"\"\"Your are an expert in generating descriptions for a document.\n    You are given a structure of a document. Your task is to generate a one-sentence description for the document, which makes it easy to distinguish the document from other documents.\n        \n    Document Structure: {structure}\n    \n    Directly return the description, do not include any other text.\n    \"\"\"\n    response = ChatGPT_API(model, prompt)\n    return response\n\n\ndef reorder_dict(data, key_order):\n    if not key_order:\n        return data\n    return {key: data[key] for key in key_order if key in data}\n\n\ndef format_structure(structure, order=None):\n    if not order:\n        return structure\n    if isinstance(structure, dict):\n        if 'nodes' in structure:\n            structure['nodes'] = format_structure(structure['nodes'], order)\n        if not structure.get('nodes'):\n            structure.pop('nodes', None)\n        structure = reorder_dict(structure, order)\n    elif isinstance(structure, list):\n        structure = [format_structure(item, order) for item in structure]\n    return structure\n\n\nclass ConfigLoader:\n    def __init__(self, default_path: str = None):\n        if default_path is None:\n            default_path = Path(__file__).parent / \"config.yaml\"\n        self._default_dict = self._load_yaml(default_path)\n\n    @staticmethod\n    def _load_yaml(path):\n        with open(path, \"r\", encoding=\"utf-8\") as f:\n            return yaml.safe_load(f) or {}\n\n    def _validate_keys(self, user_dict):\n        unknown_keys = set(user_dict) - set(self._default_dict)\n        if unknown_keys:\n            raise ValueError(f\"Unknown config keys: {unknown_keys}\")\n\n    def load(self, user_opt=None) -> config:\n        \"\"\"\n        Load the configuration, merging user options with default values.\n        \"\"\"\n        if user_opt is None:\n            user_dict = {}\n        elif isinstance(user_opt, config):\n            user_dict = vars(user_opt)\n        elif isinstance(user_opt, dict):\n            user_dict = user_opt\n        else:\n            raise TypeError(\"user_opt must be dict, config(SimpleNamespace) or None\")\n\n        self._validate_keys(user_dict)\n        merged = {**self._default_dict, **user_dict}\n        return config(**merged)\n"
  },
  {
    "path": "requirements.txt",
    "content": "openai==1.101.0\npymupdf==1.26.4\nPyPDF2==3.0.1\npython-dotenv==1.1.0\ntiktoken==0.11.0\npyyaml==6.0.2\n"
  },
  {
    "path": "run_pageindex.py",
    "content": "import argparse\nimport os\nimport json\nfrom pageindex import *\nfrom pageindex.page_index_md import md_to_tree\n\nif __name__ == \"__main__\":\n    # Set up argument parser\n    parser = argparse.ArgumentParser(description='Process PDF or Markdown document and generate structure')\n    parser.add_argument('--pdf_path', type=str, help='Path to the PDF file')\n    parser.add_argument('--md_path', type=str, help='Path to the Markdown file')\n\n    parser.add_argument('--model', type=str, default='gpt-4o-2024-11-20', help='Model to use')\n\n    parser.add_argument('--toc-check-pages', type=int, default=20, \n                      help='Number of pages to check for table of contents (PDF only)')\n    parser.add_argument('--max-pages-per-node', type=int, default=10,\n                      help='Maximum number of pages per node (PDF only)')\n    parser.add_argument('--max-tokens-per-node', type=int, default=20000,\n                      help='Maximum number of tokens per node (PDF only)')\n\n    parser.add_argument('--if-add-node-id', type=str, default='yes',\n                      help='Whether to add node id to the node')\n    parser.add_argument('--if-add-node-summary', type=str, default='yes',\n                      help='Whether to add summary to the node')\n    parser.add_argument('--if-add-doc-description', type=str, default='no',\n                      help='Whether to add doc description to the doc')\n    parser.add_argument('--if-add-node-text', type=str, default='no',\n                      help='Whether to add text to the node')\n                      \n    # Markdown specific arguments\n    parser.add_argument('--if-thinning', type=str, default='no',\n                      help='Whether to apply tree thinning for markdown (markdown only)')\n    parser.add_argument('--thinning-threshold', type=int, default=5000,\n                      help='Minimum token threshold for thinning (markdown only)')\n    parser.add_argument('--summary-token-threshold', type=int, default=200,\n                      help='Token threshold for generating summaries (markdown only)')\n    args = parser.parse_args()\n    \n    # Validate that exactly one file type is specified\n    if not args.pdf_path and not args.md_path:\n        raise ValueError(\"Either --pdf_path or --md_path must be specified\")\n    if args.pdf_path and args.md_path:\n        raise ValueError(\"Only one of --pdf_path or --md_path can be specified\")\n    \n    if args.pdf_path:\n        # Validate PDF file\n        if not args.pdf_path.lower().endswith('.pdf'):\n            raise ValueError(\"PDF file must have .pdf extension\")\n        if not os.path.isfile(args.pdf_path):\n            raise ValueError(f\"PDF file not found: {args.pdf_path}\")\n            \n        # Process PDF file\n        # Configure options\n        opt = config(\n            model=args.model,\n            toc_check_page_num=args.toc_check_pages,\n            max_page_num_each_node=args.max_pages_per_node,\n            max_token_num_each_node=args.max_tokens_per_node,\n            if_add_node_id=args.if_add_node_id,\n            if_add_node_summary=args.if_add_node_summary,\n            if_add_doc_description=args.if_add_doc_description,\n            if_add_node_text=args.if_add_node_text\n        )\n\n        # Process the PDF\n        toc_with_page_number = page_index_main(args.pdf_path, opt)\n        print('Parsing done, saving to file...')\n        \n        # Save results\n        pdf_name = os.path.splitext(os.path.basename(args.pdf_path))[0]    \n        output_dir = './results'\n        output_file = f'{output_dir}/{pdf_name}_structure.json'\n        os.makedirs(output_dir, exist_ok=True)\n        \n        with open(output_file, 'w', encoding='utf-8') as f:\n            json.dump(toc_with_page_number, f, indent=2)\n        \n        print(f'Tree structure saved to: {output_file}')\n            \n    elif args.md_path:\n        # Validate Markdown file\n        if not args.md_path.lower().endswith(('.md', '.markdown')):\n            raise ValueError(\"Markdown file must have .md or .markdown extension\")\n        if not os.path.isfile(args.md_path):\n            raise ValueError(f\"Markdown file not found: {args.md_path}\")\n            \n        # Process markdown file\n        print('Processing markdown file...')\n        \n        # Process the markdown\n        import asyncio\n        \n        # Use ConfigLoader to get consistent defaults (matching PDF behavior)\n        from pageindex.utils import ConfigLoader\n        config_loader = ConfigLoader()\n        \n        # Create options dict with user args\n        user_opt = {\n            'model': args.model,\n            'if_add_node_summary': args.if_add_node_summary,\n            'if_add_doc_description': args.if_add_doc_description,\n            'if_add_node_text': args.if_add_node_text,\n            'if_add_node_id': args.if_add_node_id\n        }\n        \n        # Load config with defaults from config.yaml\n        opt = config_loader.load(user_opt)\n        \n        toc_with_page_number = asyncio.run(md_to_tree(\n            md_path=args.md_path,\n            if_thinning=args.if_thinning.lower() == 'yes',\n            min_token_threshold=args.thinning_threshold,\n            if_add_node_summary=opt.if_add_node_summary,\n            summary_token_threshold=args.summary_token_threshold,\n            model=opt.model,\n            if_add_doc_description=opt.if_add_doc_description,\n            if_add_node_text=opt.if_add_node_text,\n            if_add_node_id=opt.if_add_node_id\n        ))\n        \n        print('Parsing done, saving to file...')\n        \n        # Save results\n        md_name = os.path.splitext(os.path.basename(args.md_path))[0]    \n        output_dir = './results'\n        output_file = f'{output_dir}/{md_name}_structure.json'\n        os.makedirs(output_dir, exist_ok=True)\n        \n        with open(output_file, 'w', encoding='utf-8') as f:\n            json.dump(toc_with_page_number, f, indent=2, ensure_ascii=False)\n        \n        print(f'Tree structure saved to: {output_file}')"
  },
  {
    "path": "scripts/autoclose-labeled-issues.js",
    "content": "/**\n * scripts/autoclose-labeled-issues.js\n *\n * Auto-closes issues that have a bot \"possible duplicate\" comment older than\n * 3 days, unless:\n * - A human has commented after the bot's duplicate comment\n * - The author reacted with thumbs-down on the duplicate comment\n *\n * Required environment variables:\n *   GITHUB_TOKEN  - GitHub Actions token\n *   REPO_OWNER    - Repository owner\n *   REPO_NAME     - Repository name\n *\n * Optional:\n *   DRY_RUN       - If \"true\", report but do not close (default: false)\n */\n\n'use strict';\n\nconst https = require('https');\n\nconst GITHUB_TOKEN = process.env.GITHUB_TOKEN;\nconst REPO_OWNER   = process.env.REPO_OWNER;\nconst REPO_NAME    = process.env.REPO_NAME;\nconst DRY_RUN      = process.env.DRY_RUN === 'true';\n\nconst THREE_DAYS_MS = 3 * 24 * 60 * 60 * 1000;\n\nfunction githubRequest(method, path, body = null, retried = false) {\n  return new Promise((resolve, reject) => {\n    const payload = body ? JSON.stringify(body) : null;\n    const options = {\n      hostname: 'api.github.com',\n      path,\n      method,\n      headers: {\n        'Authorization': `Bearer ${GITHUB_TOKEN}`,\n        'Accept': 'application/vnd.github+json',\n        'User-Agent': 'PageIndex-Autoclose/1.0',\n        'X-GitHub-Api-Version': '2022-11-28',\n        ...(payload ? { 'Content-Type': 'application/json', 'Content-Length': Buffer.byteLength(payload) } : {}),\n      },\n    };\n\n    const req = https.request(options, (res) => {\n      let data = '';\n      res.on('data', chunk => (data += chunk));\n      res.on('end', async () => {\n        // 429: 始终重试（rate limit）\n        if (res.statusCode === 429 && !retried) {\n          const retryAfter = parseInt(res.headers['retry-after'] || '60', 10);\n          console.log(`  Rate limited on ${method} ${path}, retrying after ${retryAfter}s...`);\n          await sleep(retryAfter * 1000);\n          try { resolve(await githubRequest(method, path, body, true)); }\n          catch (err) { reject(err); }\n          return;\n        }\n        // 403: 只在 rate limit 相关时重试\n        if (res.statusCode === 403 && !retried) {\n          const rateLimitRemaining = res.headers['x-ratelimit-remaining'];\n          const hasRetryAfter = res.headers['retry-after'];\n          if (rateLimitRemaining === '0' || hasRetryAfter) {\n            const retryAfter = parseInt(hasRetryAfter || '60', 10);\n            console.log(`  Rate limited (403) on ${method} ${path}, retrying after ${retryAfter}s...`);\n            await sleep(retryAfter * 1000);\n            try { resolve(await githubRequest(method, path, body, true)); }\n            catch (err) { reject(err); }\n            return;\n          }\n        }\n        if (res.statusCode >= 400) {\n          reject(new Error(`GitHub API ${method} ${path} -> ${res.statusCode}: ${data}`));\n          return;\n        }\n        try { resolve(data ? JSON.parse(data) : {}); }\n        catch { resolve({}); }\n      });\n    });\n    req.on('error', reject);\n    if (payload) req.write(payload);\n    req.end();\n  });\n}\n\nconst sleep = (ms) => new Promise(r => setTimeout(r, ms));\n\n/**\n * Fetches open issues with the \"duplicate\" label, paginating as needed.\n * Only returns issues created more than 3 days ago.\n */\nasync function fetchDuplicateIssues() {\n  const issues = [];\n  let page = 1;\n  while (true) {\n    const data = await githubRequest(\n      'GET',\n      `/repos/${REPO_OWNER}/${REPO_NAME}/issues?state=open&labels=duplicate&per_page=100&page=${page}`\n    );\n    if (!Array.isArray(data) || data.length === 0) break;\n    issues.push(...data.filter(i => !i.pull_request));\n    if (data.length < 100) break;\n    page++;\n  }\n\n  const cutoff = new Date(Date.now() - THREE_DAYS_MS);\n  return issues.filter(i => new Date(i.created_at) < cutoff);\n}\n\nfunction isBot(user) {\n  return user.type === 'Bot' || user.login.endsWith('[bot]') || user.login === 'github-actions';\n}\n\n/**\n * Finds the bot's duplicate comment on an issue (contains \"possible duplicate\").\n */\nfunction findDuplicateComment(comments) {\n  return comments.find(c =>\n    isBot(c.user) && c.body.includes('possible duplicate')\n  );\n}\n\n/**\n * Checks if there are human comments after the duplicate comment.\n */\nfunction hasHumanCommentAfter(comments, afterDate) {\n  return comments.some(c => {\n    if (isBot(c.user)) return false;\n    return new Date(c.created_at) > afterDate;\n  });\n}\n\n/**\n * Fetches all comments for an issue, handling pagination.\n * Requests per_page=100 and loops until we get fewer than 100 or an empty array.\n */\nasync function fetchAllComments(issueNumber) {\n  const allComments = [];\n  let page = 1;\n  while (true) {\n    const comments = await githubRequest(\n      'GET',\n      `/repos/${REPO_OWNER}/${REPO_NAME}/issues/${issueNumber}/comments?per_page=100&page=${page}`\n    );\n    if (!Array.isArray(comments) || comments.length === 0) break;\n    allComments.push(...comments);\n    if (comments.length < 100) break;\n    page++;\n  }\n  return allComments;\n}\n\n/**\n * Checks if the duplicate comment has a thumbs-down reaction.\n */\nasync function hasThumbsDownReaction(commentId) {\n  const reactions = await githubRequest(\n    'GET',\n    `/repos/${REPO_OWNER}/${REPO_NAME}/issues/comments/${commentId}/reactions`\n  );\n  return Array.isArray(reactions) && reactions.some(r => r.content === '-1');\n}\n\n/**\n * Closes an issue as duplicate with a comment.\n */\nasync function closeAsDuplicate(issueNumber) {\n  const body =\n    'This issue has been automatically closed as a duplicate. ' +\n    'No human activity or objection was received within the 3-day grace period.\\n\\n' +\n    'If you believe this was closed in error, please reopen the issue and leave a comment.';\n\n  await githubRequest(\n    'POST',\n    `/repos/${REPO_OWNER}/${REPO_NAME}/issues/${issueNumber}/comments`,\n    { body }\n  );\n\n  await githubRequest(\n    'PATCH',\n    `/repos/${REPO_OWNER}/${REPO_NAME}/issues/${issueNumber}`,\n    { state: 'closed', state_reason: 'completed' }\n  );\n}\n\nasync function processIssue(issue) {\n  const num = issue.number;\n  console.log(`\\nChecking issue #${num}: ${issue.title}`);\n\n  const comments = await fetchAllComments(num);\n\n  if (!Array.isArray(comments) || comments.length === 0) {\n    console.log(`  -> Could not fetch comments, skipping.`);\n    return false;\n  }\n\n  const dupeComment = findDuplicateComment(comments);\n  if (!dupeComment) {\n    console.log(`  -> No duplicate comment found, skipping.`);\n    return false;\n  }\n\n  const commentDate = new Date(dupeComment.created_at);\n  const ageMs = Date.now() - commentDate.getTime();\n\n  if (ageMs < THREE_DAYS_MS) {\n    const daysLeft = Math.ceil((THREE_DAYS_MS - ageMs) / (24 * 60 * 60 * 1000));\n    console.log(`  -> Duplicate comment is less than 3 days old (${daysLeft}d remaining), skipping.`);\n    return false;\n  }\n\n  if (hasHumanCommentAfter(comments, commentDate)) {\n    console.log(`  -> Human commented after duplicate comment, skipping.`);\n    return false;\n  }\n\n  if (await hasThumbsDownReaction(dupeComment.id)) {\n    console.log(`  -> Author reacted with thumbs-down, skipping.`);\n    return false;\n  }\n\n  if (DRY_RUN) {\n    console.log(`  [DRY RUN] Would close issue #${num}`);\n    return true;\n  }\n\n  await closeAsDuplicate(num);\n  console.log(`  -> Closed issue #${num} as duplicate`);\n  return true;\n}\n\nasync function main() {\n  const missing = ['GITHUB_TOKEN', 'REPO_OWNER', 'REPO_NAME'].filter(k => !process.env[k]);\n  if (missing.length) {\n    console.error(`Missing required environment variables: ${missing.join(', ')}`);\n    process.exit(1);\n  }\n\n  console.log('Auto-close duplicate issues');\n  console.log(`  Repository: ${REPO_OWNER}/${REPO_NAME}`);\n  console.log(`  Dry run:    ${DRY_RUN}`);\n\n  const issues = await fetchDuplicateIssues();\n  console.log(`\\nFound ${issues.length} duplicate-labeled issue(s) older than 3 days.`);\n\n  let closedCount = 0;\n  for (const issue of issues) {\n    const closed = await processIssue(issue);\n    if (closed) closedCount++;\n    await sleep(1000);\n  }\n\n  console.log(`\\nSummary: ${closedCount} issue(s) closed.`);\n}\n\nmain().catch(err => {\n  console.error('Fatal error:', err.message);\n  process.exit(1);\n});\n"
  },
  {
    "path": "scripts/comment-on-duplicates.sh",
    "content": "#!/usr/bin/env bash\n#\n# comment-on-duplicates.sh - Posts a duplicate issue comment with auto-close warning.\n#\n# Usage:\n#   ./scripts/comment-on-duplicates.sh --base-issue 123 --potential-duplicates 456 789\n#\nset -euo pipefail\n\nREPO=\"${GITHUB_REPOSITORY:-}\"\nif [ -z \"$REPO\" ]; then\n  echo \"Error: GITHUB_REPOSITORY is not set\" >&2\n  exit 1\nfi\n\nBASE_ISSUE=\"\"\nDUPLICATES=()\n\n# Parse arguments\nwhile [[ $# -gt 0 ]]; do\n  case \"$1\" in\n    --base-issue)\n      BASE_ISSUE=\"$2\"\n      shift 2\n      ;;\n    --potential-duplicates)\n      shift\n      while [[ $# -gt 0 && ! \"$1\" =~ ^-- ]]; do\n        DUPLICATES+=(\"$1\")\n        shift\n      done\n      ;;\n    *)\n      echo \"Error: Unknown argument: $1\" >&2\n      exit 1\n      ;;\n  esac\ndone\n\n# Validate inputs\nif [ -z \"$BASE_ISSUE\" ]; then\n  echo \"Error: --base-issue is required\" >&2\n  exit 1\nfi\n\nif ! [[ \"$BASE_ISSUE\" =~ ^[0-9]+$ ]]; then\n  echo \"Error: --base-issue must be a number, got: $BASE_ISSUE\" >&2\n  exit 1\nfi\n\nif [ ${#DUPLICATES[@]} -eq 0 ]; then\n  echo \"Error: --potential-duplicates requires at least one issue number\" >&2\n  exit 1\nfi\n\nfor dup in \"${DUPLICATES[@]}\"; do\n  if ! [[ \"$dup\" =~ ^[0-9]+$ ]]; then\n    echo \"Error: duplicate issue must be a number, got: $dup\" >&2\n    exit 1\n  fi\ndone\n\n# Limit to 3 duplicates max\nif [ ${#DUPLICATES[@]} -gt 3 ]; then\n  echo \"Warning: Limiting to first 3 duplicates\" >&2\n  DUPLICATES=(\"${DUPLICATES[@]:0:3}\")\nfi\n\n# Build the duplicate links list\nCOUNT=0\nLINKS=\"\"\nfor dup in \"${DUPLICATES[@]}\"; do\n  COUNT=$((COUNT + 1))\n  LINKS=\"${LINKS}${COUNT}. https://github.com/${REPO}/issues/${dup}\n\"\ndone\n\n# Build and post the comment — if the issue is closed or doesn't exist, gh will error out\nCOMMENT=\"Found ${COUNT} possible duplicate issue(s):\n\n${LINKS}\nThis issue will be automatically closed as a duplicate in 3 days.\n- To prevent auto-closure, add a comment or react with :thumbsdown: on this comment.\"\n\ngh issue comment \"$BASE_ISSUE\" --repo \"$REPO\" --body \"$COMMENT\"\ngh issue edit \"$BASE_ISSUE\" --repo \"$REPO\" --add-label \"duplicate\"\n\necho \"Posted duplicate comment on issue #$BASE_ISSUE with $COUNT potential duplicate(s)\"\n"
  },
  {
    "path": "tests/results/2023-annual-report-truncated_structure.json",
    "content": "{\n  \"doc_name\": \"2023-annual-report-truncated.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end_index\": 4,\n      \"node_id\": \"0000\"\n    },\n    {\n      \"title\": \"About the Federal Reserve\",\n      \"start_index\": 5,\n      \"end_index\": 7,\n      \"node_id\": \"0001\"\n    },\n    {\n      \"title\": \"Overview\",\n      \"start_index\": 7,\n      \"end_index\": 8,\n      \"node_id\": \"0002\"\n    },\n    {\n      \"title\": \"Monetary Policy and Economic Developments\",\n      \"start_index\": 9,\n      \"end_index\": 9,\n      \"nodes\": [\n        {\n          \"title\": \"March 2024 Summary\",\n          \"start_index\": 9,\n          \"end_index\": 14,\n          \"node_id\": \"0004\"\n        },\n        {\n          \"title\": \"June 2023 Summary\",\n          \"start_index\": 15,\n          \"end_index\": 20,\n          \"node_id\": \"0005\"\n        }\n      ],\n      \"node_id\": \"0003\"\n    },\n    {\n      \"title\": \"Financial Stability\",\n      \"start_index\": 21,\n      \"end_index\": 21,\n      \"nodes\": [\n        {\n          \"title\": \"Monitoring Financial Vulnerabilities\",\n          \"start_index\": 22,\n          \"end_index\": 28,\n          \"node_id\": \"0007\"\n        },\n        {\n          \"title\": \"Domestic and International Cooperation and Coordination\",\n          \"start_index\": 28,\n          \"end_index\": 30,\n          \"node_id\": \"0008\"\n        }\n      ],\n      \"node_id\": \"0006\"\n    },\n    {\n      \"title\": \"Supervision and Regulation\",\n      \"start_index\": 31,\n      \"end_index\": 32,\n      \"nodes\": [\n        {\n          \"title\": \"Supervised and Regulated Institutions\",\n          \"start_index\": 32,\n          \"end_index\": 35,\n          \"node_id\": \"0010\"\n        },\n        {\n          \"title\": \"Supervisory Developments\",\n          \"start_index\": 35,\n          \"end_index\": 50,\n          \"node_id\": \"0011\"\n        }\n      ],\n      \"node_id\": \"0009\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/results/2023-annual-report_structure.json",
    "content": "{\n  \"doc_name\": \"2023-annual-report.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end_index\": 4,\n      \"node_id\": \"0000\"\n    },\n    {\n      \"title\": \"About the Federal Reserve\",\n      \"start_index\": 5,\n      \"end_index\": 6,\n      \"node_id\": \"0001\"\n    },\n    {\n      \"title\": \"Overview\",\n      \"start_index\": 7,\n      \"end_index\": 8,\n      \"node_id\": \"0002\"\n    },\n    {\n      \"title\": \"Monetary Policy and Economic Developments\",\n      \"start_index\": 9,\n      \"end_index\": 9,\n      \"nodes\": [\n        {\n          \"title\": \"March 2024 Summary\",\n          \"start_index\": 9,\n          \"end_index\": 14,\n          \"node_id\": \"0004\"\n        },\n        {\n          \"title\": \"June 2023 Summary\",\n          \"start_index\": 15,\n          \"end_index\": 20,\n          \"node_id\": \"0005\"\n        }\n      ],\n      \"node_id\": \"0003\"\n    },\n    {\n      \"title\": \"Financial Stability\",\n      \"start_index\": 21,\n      \"end_index\": 21,\n      \"nodes\": [\n        {\n          \"title\": \"Monitoring Financial Vulnerabilities\",\n          \"start_index\": 22,\n          \"end_index\": 28,\n          \"node_id\": \"0007\"\n        },\n        {\n          \"title\": \"Domestic and International Cooperation and Coordination\",\n          \"start_index\": 28,\n          \"end_index\": 31,\n          \"node_id\": \"0008\"\n        }\n      ],\n      \"node_id\": \"0006\"\n    },\n    {\n      \"title\": \"Supervision and Regulation\",\n      \"start_index\": 31,\n      \"end_index\": 31,\n      \"nodes\": [\n        {\n          \"title\": \"Supervised and Regulated Institutions\",\n          \"start_index\": 32,\n          \"end_index\": 35,\n          \"node_id\": \"0010\"\n        },\n        {\n          \"title\": \"Supervisory Developments\",\n          \"start_index\": 35,\n          \"end_index\": 54,\n          \"node_id\": \"0011\"\n        },\n        {\n          \"title\": \"Regulatory Developments\",\n          \"start_index\": 55,\n          \"end_index\": 59,\n          \"node_id\": \"0012\"\n        }\n      ],\n      \"node_id\": \"0009\"\n    },\n    {\n      \"title\": \"Payment System and Reserve Bank Oversight\",\n      \"start_index\": 59,\n      \"end_index\": 59,\n      \"nodes\": [\n        {\n          \"title\": \"Payment Services to Depository and Other Institutions\",\n          \"start_index\": 60,\n          \"end_index\": 65,\n          \"node_id\": \"0014\"\n        },\n        {\n          \"title\": \"Currency and Coin\",\n          \"start_index\": 66,\n          \"end_index\": 68,\n          \"node_id\": \"0015\"\n        },\n        {\n          \"title\": \"Fiscal Agency and Government Depository Services\",\n          \"start_index\": 69,\n          \"end_index\": 72,\n          \"node_id\": \"0016\"\n        },\n        {\n          \"title\": \"Evolutions and Improvements to the System\",\n          \"start_index\": 72,\n          \"end_index\": 75,\n          \"node_id\": \"0017\"\n        },\n        {\n          \"title\": \"Oversight of Federal Reserve Banks\",\n          \"start_index\": 75,\n          \"end_index\": 81,\n          \"node_id\": \"0018\"\n        },\n        {\n          \"title\": \"Pro Forma Financial Statements for Federal Reserve Priced Services\",\n          \"start_index\": 82,\n          \"end_index\": 88,\n          \"node_id\": \"0019\"\n        }\n      ],\n      \"node_id\": \"0013\"\n    },\n    {\n      \"title\": \"Consumer and Community Affairs\",\n      \"start_index\": 89,\n      \"end_index\": 89,\n      \"nodes\": [\n        {\n          \"title\": \"Consumer Compliance Supervision\",\n          \"start_index\": 89,\n          \"end_index\": 101,\n          \"node_id\": \"0021\"\n        },\n        {\n          \"title\": \"Consumer Laws and Regulations\",\n          \"start_index\": 101,\n          \"end_index\": 102,\n          \"node_id\": \"0022\"\n        },\n        {\n          \"title\": \"Consumer Research and Analysis of Emerging Issues and Policy\",\n          \"start_index\": 102,\n          \"end_index\": 105,\n          \"node_id\": \"0023\"\n        },\n        {\n          \"title\": \"Community Development\",\n          \"start_index\": 105,\n          \"end_index\": 106,\n          \"node_id\": \"0024\"\n        }\n      ],\n      \"node_id\": \"0020\"\n    },\n    {\n      \"title\": \"Appendixes\",\n      \"start_index\": 107,\n      \"end_index\": 109,\n      \"node_id\": \"0025\"\n    },\n    {\n      \"title\": \"Federal Reserve System Organization\",\n      \"start_index\": 109,\n      \"end_index\": 109,\n      \"nodes\": [\n        {\n          \"title\": \"Board of Governors\",\n          \"start_index\": 109,\n          \"end_index\": 116,\n          \"node_id\": \"0027\"\n        },\n        {\n          \"title\": \"Federal Open Market Committee\",\n          \"start_index\": 117,\n          \"end_index\": 118,\n          \"node_id\": \"0028\"\n        },\n        {\n          \"title\": \"Board of Governors Advisory Councils\",\n          \"start_index\": 119,\n          \"end_index\": 122,\n          \"node_id\": \"0029\"\n        },\n        {\n          \"title\": \"Federal Reserve Banks and Branches\",\n          \"start_index\": 123,\n          \"end_index\": 146,\n          \"node_id\": \"0030\"\n        }\n      ],\n      \"node_id\": \"0026\"\n    },\n    {\n      \"title\": \"Minutes of Federal Open Market Committee Meetings\",\n      \"start_index\": 147,\n      \"end_index\": 147,\n      \"nodes\": [\n        {\n          \"title\": \"Meeting Minutes\",\n          \"start_index\": 147,\n          \"end_index\": 149,\n          \"node_id\": \"0032\"\n        }\n      ],\n      \"node_id\": \"0031\"\n    },\n    {\n      \"title\": \"Federal Reserve System Audits\",\n      \"start_index\": 149,\n      \"end_index\": 149,\n      \"nodes\": [\n        {\n          \"title\": \"Office of Inspector General Activities\",\n          \"start_index\": 149,\n          \"end_index\": 151,\n          \"node_id\": \"0034\"\n        },\n        {\n          \"title\": \"Government Accountability Office Reviews\",\n          \"start_index\": 151,\n          \"end_index\": 152,\n          \"node_id\": \"0035\"\n        }\n      ],\n      \"node_id\": \"0033\"\n    },\n    {\n      \"title\": \"Federal Reserve System Budgets\",\n      \"start_index\": 153,\n      \"end_index\": 153,\n      \"nodes\": [\n        {\n          \"title\": \"System Budgets Overview\",\n          \"start_index\": 153,\n          \"end_index\": 157,\n          \"node_id\": \"0037\"\n        },\n        {\n          \"title\": \"Board of Governors Budgets\",\n          \"start_index\": 157,\n          \"end_index\": 163,\n          \"node_id\": \"0038\"\n        },\n        {\n          \"title\": \"Federal Reserve Banks Budgets\",\n          \"start_index\": 163,\n          \"end_index\": 169,\n          \"node_id\": \"0039\"\n        },\n        {\n          \"title\": \"Currency Budget\",\n          \"start_index\": 169,\n          \"end_index\": 174,\n          \"node_id\": \"0040\"\n        }\n      ],\n      \"node_id\": \"0036\"\n    },\n    {\n      \"title\": \"Record of Policy Actions of the Board of Governors\",\n      \"start_index\": 175,\n      \"end_index\": 175,\n      \"nodes\": [\n        {\n          \"title\": \"Rules and Regulations\",\n          \"start_index\": 175,\n          \"end_index\": 176,\n          \"node_id\": \"0042\"\n        },\n        {\n          \"title\": \"Policy Statements and Other Actions\",\n          \"start_index\": 177,\n          \"end_index\": 181,\n          \"node_id\": \"0043\"\n        },\n        {\n          \"title\": \"Discount Rates for Depository Institutions in 2023\",\n          \"start_index\": 181,\n          \"end_index\": 183,\n          \"node_id\": \"0044\"\n        },\n        {\n          \"title\": \"The Board of Governors and the Government Performance and Results Act\",\n          \"start_index\": 184,\n          \"end_index\": 184,\n          \"node_id\": \"0045\"\n        }\n      ],\n      \"node_id\": \"0041\"\n    },\n    {\n      \"title\": \"Litigation\",\n      \"start_index\": 185,\n      \"end_index\": 185,\n      \"nodes\": [\n        {\n          \"title\": \"Pending\",\n          \"start_index\": 185,\n          \"end_index\": 186,\n          \"node_id\": \"0047\"\n        },\n        {\n          \"title\": \"Resolved\",\n          \"start_index\": 186,\n          \"end_index\": 187,\n          \"node_id\": \"0048\"\n        }\n      ],\n      \"node_id\": \"0046\"\n    },\n    {\n      \"title\": \"Statistical Tables\",\n      \"start_index\": 187,\n      \"end_index\": 187,\n      \"nodes\": [\n        {\n          \"title\": \"Federal Reserve open market transactions, 2023\",\n          \"start_index\": 187,\n          \"end_index\": 187,\n          \"nodes\": [\n            {\n              \"title\": \"Type of security and transaction\",\n              \"start_index\": 187,\n              \"end_index\": 188,\n              \"node_id\": \"0051\"\n            },\n            {\n              \"title\": \"Federal agency obligations\",\n              \"start_index\": 188,\n              \"end_index\": 188,\n              \"node_id\": \"0052\"\n            },\n            {\n              \"title\": \"Mortgage-backed securities\",\n              \"start_index\": 188,\n              \"end_index\": 188,\n              \"node_id\": \"0053\"\n            },\n            {\n              \"title\": \"Temporary transactions\",\n              \"start_index\": 188,\n              \"end_index\": 188,\n              \"node_id\": \"0054\"\n            }\n          ],\n          \"node_id\": \"0050\"\n        },\n        {\n          \"title\": \"Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\\u201323\",\n          \"start_index\": 189,\n          \"end_index\": 189,\n          \"nodes\": [\n            {\n              \"title\": \"By remaining maturity\",\n              \"start_index\": 189,\n              \"end_index\": 189,\n              \"node_id\": \"0056\"\n            },\n            {\n              \"title\": \"By type\",\n              \"start_index\": 189,\n              \"end_index\": 190,\n              \"node_id\": \"0057\"\n            },\n            {\n              \"title\": \"By issuer\",\n              \"start_index\": 190,\n              \"end_index\": 190,\n              \"node_id\": \"0058\"\n            }\n          ],\n          \"node_id\": \"0055\"\n        },\n        {\n          \"title\": \"Reserve requirements of depository institutions, December 31, 2023\",\n          \"start_index\": 191,\n          \"end_index\": 191,\n          \"node_id\": \"0059\"\n        },\n        {\n          \"title\": \"Banking offices and banks affiliated with bank holding companies in the United States, December 31, 2022 and 2023\",\n          \"start_index\": 192,\n          \"end_index\": 192,\n          \"node_id\": \"0060\"\n        },\n        {\n          \"title\": \"Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\\u20132023 and month-end 2023\",\n          \"start_index\": 193,\n          \"end_index\": 196,\n          \"node_id\": \"0061\"\n        },\n        {\n          \"title\": \"Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\\u20131983\",\n          \"start_index\": 197,\n          \"end_index\": 200,\n          \"node_id\": \"0062\"\n        },\n        {\n          \"title\": \"Principal assets and liabilities of insured commercial banks, by class of bank, June 30, 2023 and 2022\",\n          \"start_index\": 201,\n          \"end_index\": 201,\n          \"node_id\": \"0063\"\n        },\n        {\n          \"title\": \"Initial margin requirements under Regulations T, U, and X\",\n          \"start_index\": 202,\n          \"end_index\": 203,\n          \"node_id\": \"0064\"\n        },\n        {\n          \"title\": \"Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022\",\n          \"start_index\": 203,\n          \"end_index\": 209,\n          \"node_id\": \"0065\"\n        },\n        {\n          \"title\": \"Statement of condition of the Federal Reserve Banks, December 31, 2023 and 2022\",\n          \"start_index\": 209,\n          \"end_index\": 210,\n          \"node_id\": \"0066\"\n        },\n        {\n          \"title\": \"Income and expenses of the Federal Reserve Banks, by Bank, 2023\",\n          \"start_index\": 210,\n          \"end_index\": 212,\n          \"nodes\": [\n            {\n              \"title\": \"Income and expenses of the Federal Reserve Banks, by Bank, 2023\\u2014continued\",\n              \"start_index\": 212,\n              \"end_index\": 214,\n              \"node_id\": \"0068\"\n            }\n          ],\n          \"node_id\": \"0067\"\n        },\n        {\n          \"title\": \"Income and expenses of the Federal Reserve Banks, 1914\\u20132023\",\n          \"start_index\": 214,\n          \"end_index\": 215,\n          \"nodes\": [\n            {\n              \"title\": \"Income and expenses of the Federal Reserve Banks, 1914\\u20132023\\u2014continued\",\n              \"start_index\": 215,\n              \"end_index\": 216,\n              \"node_id\": \"0070\"\n            },\n            {\n              \"title\": \"Income and expenses of the Federal Reserve Banks, 1914\\u20132023\\u2014continued\",\n              \"start_index\": 216,\n              \"end_index\": 217,\n              \"node_id\": \"0071\"\n            },\n            {\n              \"title\": \"Income and expenses of the Federal Reserve Banks, 1914\\u20132023\\u2014continued\",\n              \"start_index\": 217,\n              \"end_index\": 217,\n              \"node_id\": \"0072\"\n            }\n          ],\n          \"node_id\": \"0069\"\n        },\n        {\n          \"title\": \"Operations in principal departments of the Federal Reserve Banks, 2020\\u201323\",\n          \"start_index\": 218,\n          \"end_index\": 218,\n          \"node_id\": \"0073\"\n        },\n        {\n          \"title\": \"Number and annual salaries of officers and employees of the Federal Reserve Banks, December 31, 2023\",\n          \"start_index\": 219,\n          \"end_index\": 220,\n          \"node_id\": \"0074\"\n        },\n        {\n          \"title\": \"Acquisition costs and net book value of the premises of the Federal Reserve Banks and Branches, December 31, 2023\",\n          \"start_index\": 220,\n          \"end_index\": 222,\n          \"node_id\": \"0075\"\n        }\n      ],\n      \"node_id\": \"0049\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/results/PRML_structure.json",
    "content": "{\n  \"doc_name\": \"PRML.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end_index\": 6,\n      \"node_id\": \"0000\"\n    },\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 7,\n      \"end_index\": 10,\n      \"node_id\": \"0001\"\n    },\n    {\n      \"title\": \"Mathematical notation\",\n      \"start_index\": 11,\n      \"end_index\": 13,\n      \"node_id\": \"0002\"\n    },\n    {\n      \"title\": \"Contents\",\n      \"start_index\": 13,\n      \"end_index\": 20,\n      \"node_id\": \"0003\"\n    },\n    {\n      \"title\": \"Introduction\",\n      \"start_index\": 21,\n      \"end_index\": 24,\n      \"nodes\": [\n        {\n          \"title\": \"Example: Polynomial Curve Fitting\",\n          \"start_index\": 24,\n          \"end_index\": 32,\n          \"node_id\": \"0005\"\n        },\n        {\n          \"title\": \"Probability Theory\",\n          \"start_index\": 32,\n          \"end_index\": 37,\n          \"nodes\": [\n            {\n              \"title\": \"Probability densities\",\n              \"start_index\": 37,\n              \"end_index\": 39,\n              \"node_id\": \"0007\"\n            },\n            {\n              \"title\": \"Expectations and covariances\",\n              \"start_index\": 39,\n              \"end_index\": 41,\n              \"node_id\": \"0008\"\n            },\n            {\n              \"title\": \"Bayesian probabilities\",\n              \"start_index\": 41,\n              \"end_index\": 44,\n              \"node_id\": \"0009\"\n            },\n            {\n              \"title\": \"The Gaussian distribution\",\n              \"start_index\": 44,\n              \"end_index\": 48,\n              \"node_id\": \"0010\"\n            },\n            {\n              \"title\": \"Curve fitting re-visited\",\n              \"start_index\": 48,\n              \"end_index\": 50,\n              \"node_id\": \"0011\"\n            },\n            {\n              \"title\": \"Bayesian curve fitting\",\n              \"start_index\": 50,\n              \"end_index\": 52,\n              \"node_id\": \"0012\"\n            }\n          ],\n          \"node_id\": \"0006\"\n        },\n        {\n          \"title\": \"Model Selection\",\n          \"start_index\": 52,\n          \"end_index\": 53,\n          \"node_id\": \"0013\"\n        },\n        {\n          \"title\": \"The Curse of Dimensionality\",\n          \"start_index\": 53,\n          \"end_index\": 58,\n          \"node_id\": \"0014\"\n        },\n        {\n          \"title\": \"Decision Theory\",\n          \"start_index\": 58,\n          \"end_index\": 59,\n          \"nodes\": [\n            {\n              \"title\": \"Minimizing the misclassification rate\",\n              \"start_index\": 59,\n              \"end_index\": 61,\n              \"node_id\": \"0016\"\n            },\n            {\n              \"title\": \"Minimizing the expected loss\",\n              \"start_index\": 61,\n              \"end_index\": 62,\n              \"node_id\": \"0017\"\n            },\n            {\n              \"title\": \"The reject option\",\n              \"start_index\": 62,\n              \"end_index\": 62,\n              \"node_id\": \"0018\"\n            },\n            {\n              \"title\": \"Inference and decision\",\n              \"start_index\": 62,\n              \"end_index\": 66,\n              \"node_id\": \"0019\"\n            },\n            {\n              \"title\": \"Loss functions for regression\",\n              \"start_index\": 66,\n              \"end_index\": 68,\n              \"node_id\": \"0020\"\n            }\n          ],\n          \"node_id\": \"0015\"\n        },\n        {\n          \"title\": \"Information Theory\",\n          \"start_index\": 68,\n          \"end_index\": 75,\n          \"nodes\": [\n            {\n              \"title\": \"Relative entropy and mutual information\",\n              \"start_index\": 75,\n              \"end_index\": 78,\n              \"node_id\": \"0022\"\n            }\n          ],\n          \"node_id\": \"0021\"\n        }\n      ],\n      \"node_id\": \"0004\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 78,\n      \"end_index\": 87,\n      \"node_id\": \"0023\"\n    },\n    {\n      \"title\": \"Probability Distributions\",\n      \"start_index\": 87,\n      \"end_index\": 88,\n      \"nodes\": [\n        {\n          \"title\": \"Binary Variables\",\n          \"start_index\": 88,\n          \"end_index\": 91,\n          \"nodes\": [\n            {\n              \"title\": \"The beta distribution\",\n              \"start_index\": 91,\n              \"end_index\": 94,\n              \"node_id\": \"0026\"\n            }\n          ],\n          \"node_id\": \"0025\"\n        },\n        {\n          \"title\": \"Multinomial Variables\",\n          \"start_index\": 94,\n          \"end_index\": 96,\n          \"nodes\": [\n            {\n              \"title\": \"The Dirichlet distribution\",\n              \"start_index\": 96,\n              \"end_index\": 98,\n              \"node_id\": \"0028\"\n            }\n          ],\n          \"node_id\": \"0027\"\n        },\n        {\n          \"title\": \"The Gaussian Distribution\",\n          \"start_index\": 98,\n          \"end_index\": 105,\n          \"nodes\": [\n            {\n              \"title\": \"Conditional Gaussian distributions\",\n              \"start_index\": 105,\n              \"end_index\": 108,\n              \"node_id\": \"0030\"\n            },\n            {\n              \"title\": \"Marginal Gaussian distributions\",\n              \"start_index\": 108,\n              \"end_index\": 110,\n              \"node_id\": \"0031\"\n            },\n            {\n              \"title\": \"Bayes\\u2019 theorem for Gaussian variables\",\n              \"start_index\": 110,\n              \"end_index\": 113,\n              \"node_id\": \"0032\"\n            },\n            {\n              \"title\": \"Maximum likelihood for the Gaussian\",\n              \"start_index\": 113,\n              \"end_index\": 114,\n              \"node_id\": \"0033\"\n            },\n            {\n              \"title\": \"Sequential estimation\",\n              \"start_index\": 114,\n              \"end_index\": 117,\n              \"node_id\": \"0034\"\n            },\n            {\n              \"title\": \"Bayesian inference for the Gaussian\",\n              \"start_index\": 117,\n              \"end_index\": 122,\n              \"node_id\": \"0035\"\n            },\n            {\n              \"title\": \"Student\\u2019s t-distribution\",\n              \"start_index\": 122,\n              \"end_index\": 125,\n              \"node_id\": \"0036\"\n            },\n            {\n              \"title\": \"Periodic variables\",\n              \"start_index\": 125,\n              \"end_index\": 130,\n              \"node_id\": \"0037\"\n            },\n            {\n              \"title\": \"Mixtures of Gaussians\",\n              \"start_index\": 130,\n              \"end_index\": 133,\n              \"node_id\": \"0038\"\n            }\n          ],\n          \"node_id\": \"0029\"\n        },\n        {\n          \"title\": \"The Exponential Family\",\n          \"start_index\": 133,\n          \"end_index\": 136,\n          \"nodes\": [\n            {\n              \"title\": \"Maximum likelihood and sufficient statistics\",\n              \"start_index\": 136,\n              \"end_index\": 137,\n              \"node_id\": \"0040\"\n            },\n            {\n              \"title\": \"Conjugate priors\",\n              \"start_index\": 137,\n              \"end_index\": 137,\n              \"node_id\": \"0041\"\n            },\n            {\n              \"title\": \"Noninformative priors\",\n              \"start_index\": 137,\n              \"end_index\": 140,\n              \"node_id\": \"0042\"\n            }\n          ],\n          \"node_id\": \"0039\"\n        },\n        {\n          \"title\": \"Nonparametric Methods\",\n          \"start_index\": 140,\n          \"end_index\": 142,\n          \"nodes\": [\n            {\n              \"title\": \"Kernel density estimators\",\n              \"start_index\": 142,\n              \"end_index\": 144,\n              \"node_id\": \"0044\"\n            },\n            {\n              \"title\": \"Nearest-neighbour methods\",\n              \"start_index\": 144,\n              \"end_index\": 147,\n              \"node_id\": \"0045\"\n            }\n          ],\n          \"node_id\": \"0043\"\n        }\n      ],\n      \"node_id\": \"0024\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 147,\n      \"end_index\": 156,\n      \"node_id\": \"0046\"\n    },\n    {\n      \"title\": \"Linear Models for Regression\",\n      \"start_index\": 157,\n      \"end_index\": 158,\n      \"nodes\": [\n        {\n          \"title\": \"Linear Basis Function Models\",\n          \"start_index\": 158,\n          \"end_index\": 160,\n          \"nodes\": [\n            {\n              \"title\": \"Maximum likelihood and least squares\",\n              \"start_index\": 160,\n              \"end_index\": 163,\n              \"node_id\": \"0049\"\n            },\n            {\n              \"title\": \"Geometry of least squares\",\n              \"start_index\": 163,\n              \"end_index\": 163,\n              \"node_id\": \"0050\"\n            },\n            {\n              \"title\": \"Sequential learning\",\n              \"start_index\": 163,\n              \"end_index\": 164,\n              \"node_id\": \"0051\"\n            },\n            {\n              \"title\": \"Regularized least squares\",\n              \"start_index\": 164,\n              \"end_index\": 166,\n              \"node_id\": \"0052\"\n            },\n            {\n              \"title\": \"Multiple outputs\",\n              \"start_index\": 166,\n              \"end_index\": 167,\n              \"node_id\": \"0053\"\n            }\n          ],\n          \"node_id\": \"0048\"\n        },\n        {\n          \"title\": \"The Bias-Variance Decomposition\",\n          \"start_index\": 167,\n          \"end_index\": 172,\n          \"node_id\": \"0054\"\n        },\n        {\n          \"title\": \"Bayesian Linear Regression\",\n          \"start_index\": 172,\n          \"end_index\": 172,\n          \"nodes\": [\n            {\n              \"title\": \"Parameter distribution\",\n              \"start_index\": 172,\n              \"end_index\": 176,\n              \"node_id\": \"0056\"\n            },\n            {\n              \"title\": \"Predictive distribution\",\n              \"start_index\": 176,\n              \"end_index\": 179,\n              \"node_id\": \"0057\"\n            },\n            {\n              \"title\": \"Equivalent kernel\",\n              \"start_index\": 179,\n              \"end_index\": 181,\n              \"node_id\": \"0058\"\n            }\n          ],\n          \"node_id\": \"0055\"\n        },\n        {\n          \"title\": \"Bayesian Model Comparison\",\n          \"start_index\": 181,\n          \"end_index\": 185,\n          \"node_id\": \"0059\"\n        },\n        {\n          \"title\": \"The Evidence Approximation\",\n          \"start_index\": 185,\n          \"end_index\": 186,\n          \"nodes\": [\n            {\n              \"title\": \"Evaluation of the evidence function\",\n              \"start_index\": 186,\n              \"end_index\": 188,\n              \"node_id\": \"0061\"\n            },\n            {\n              \"title\": \"Maximizing the evidence function\",\n              \"start_index\": 188,\n              \"end_index\": 190,\n              \"node_id\": \"0062\"\n            },\n            {\n              \"title\": \"Effective number of parameters\",\n              \"start_index\": 190,\n              \"end_index\": 192,\n              \"node_id\": \"0063\"\n            }\n          ],\n          \"node_id\": \"0060\"\n        },\n        {\n          \"title\": \"Limitations of Fixed Basis Functions\",\n          \"start_index\": 192,\n          \"end_index\": 193,\n          \"node_id\": \"0064\"\n        }\n      ],\n      \"node_id\": \"0047\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 193,\n      \"end_index\": 199,\n      \"node_id\": \"0065\"\n    },\n    {\n      \"title\": \"Linear Models for Classification\",\n      \"start_index\": 199,\n      \"end_index\": 201,\n      \"nodes\": [\n        {\n          \"title\": \"Discriminant Functions\",\n          \"start_index\": 201,\n          \"end_index\": 201,\n          \"nodes\": [\n            {\n              \"title\": \"Two classes\",\n              \"start_index\": 201,\n              \"end_index\": 202,\n              \"node_id\": \"0068\"\n            },\n            {\n              \"title\": \"Multiple classes\",\n              \"start_index\": 202,\n              \"end_index\": 204,\n              \"node_id\": \"0069\"\n            },\n            {\n              \"title\": \"Least squares for classification\",\n              \"start_index\": 204,\n              \"end_index\": 206,\n              \"node_id\": \"0070\"\n            },\n            {\n              \"title\": \"Fisher\\u2019s linear discriminant\",\n              \"start_index\": 206,\n              \"end_index\": 209,\n              \"node_id\": \"0071\"\n            },\n            {\n              \"title\": \"Relation to least squares\",\n              \"start_index\": 209,\n              \"end_index\": 211,\n              \"node_id\": \"0072\"\n            },\n            {\n              \"title\": \"Fisher\\u2019s discriminant for multiple classes\",\n              \"start_index\": 211,\n              \"end_index\": 212,\n              \"node_id\": \"0073\"\n            },\n            {\n              \"title\": \"The perceptron algorithm\",\n              \"start_index\": 212,\n              \"end_index\": 216,\n              \"node_id\": \"0074\"\n            }\n          ],\n          \"node_id\": \"0067\"\n        },\n        {\n          \"title\": \"Probabilistic Generative Models\",\n          \"start_index\": 216,\n          \"end_index\": 218,\n          \"nodes\": [\n            {\n              \"title\": \"Continuous inputs\",\n              \"start_index\": 218,\n              \"end_index\": 220,\n              \"node_id\": \"0076\"\n            },\n            {\n              \"title\": \"Maximum likelihood solution\",\n              \"start_index\": 220,\n              \"end_index\": 222,\n              \"node_id\": \"0077\"\n            },\n            {\n              \"title\": \"Discrete features\",\n              \"start_index\": 222,\n              \"end_index\": 222,\n              \"node_id\": \"0078\"\n            },\n            {\n              \"title\": \"Exponential family\",\n              \"start_index\": 222,\n              \"end_index\": 223,\n              \"node_id\": \"0079\"\n            }\n          ],\n          \"node_id\": \"0075\"\n        },\n        {\n          \"title\": \"Probabilistic Discriminative Models\",\n          \"start_index\": 223,\n          \"end_index\": 224,\n          \"nodes\": [\n            {\n              \"title\": \"Fixed basis functions\",\n              \"start_index\": 224,\n              \"end_index\": 225,\n              \"node_id\": \"0081\"\n            },\n            {\n              \"title\": \"Logistic regression\",\n              \"start_index\": 225,\n              \"end_index\": 227,\n              \"node_id\": \"0082\"\n            },\n            {\n              \"title\": \"Iterative reweighted least squares\",\n              \"start_index\": 227,\n              \"end_index\": 229,\n              \"node_id\": \"0083\"\n            },\n            {\n              \"title\": \"Multiclass logistic regression\",\n              \"start_index\": 229,\n              \"end_index\": 230,\n              \"node_id\": \"0084\"\n            },\n            {\n              \"title\": \"Probit regression\",\n              \"start_index\": 230,\n              \"end_index\": 232,\n              \"node_id\": \"0085\"\n            },\n            {\n              \"title\": \"Canonical link functions\",\n              \"start_index\": 232,\n              \"end_index\": 232,\n              \"node_id\": \"0086\"\n            }\n          ],\n          \"node_id\": \"0080\"\n        },\n        {\n          \"title\": \"The Laplace Approximation\",\n          \"start_index\": 233,\n          \"end_index\": 236,\n          \"nodes\": [\n            {\n              \"title\": \"Model comparison and BIC\",\n              \"start_index\": 236,\n              \"end_index\": 237,\n              \"node_id\": \"0088\"\n            }\n          ],\n          \"node_id\": \"0087\"\n        },\n        {\n          \"title\": \"Bayesian Logistic Regression\",\n          \"start_index\": 237,\n          \"end_index\": 237,\n          \"nodes\": [\n            {\n              \"title\": \"Laplace approximation\",\n              \"start_index\": 237,\n              \"end_index\": 238,\n              \"node_id\": \"0090\"\n            },\n            {\n              \"title\": \"Predictive distribution\",\n              \"start_index\": 238,\n              \"end_index\": 240,\n              \"node_id\": \"0091\"\n            }\n          ],\n          \"node_id\": \"0089\"\n        }\n      ],\n      \"node_id\": \"0066\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 240,\n      \"end_index\": 245,\n      \"node_id\": \"0092\"\n    },\n    {\n      \"title\": \"Neural Networks\",\n      \"start_index\": 245,\n      \"end_index\": 247,\n      \"nodes\": [\n        {\n          \"title\": \"Feed-forward Network Functions\",\n          \"start_index\": 247,\n          \"end_index\": 251,\n          \"nodes\": [\n            {\n              \"title\": \"Weight-space symmetries\",\n              \"start_index\": 251,\n              \"end_index\": 252,\n              \"node_id\": \"0095\"\n            }\n          ],\n          \"node_id\": \"0094\"\n        },\n        {\n          \"title\": \"Network Training\",\n          \"start_index\": 252,\n          \"end_index\": 256,\n          \"nodes\": [\n            {\n              \"title\": \"Parameter optimization\",\n              \"start_index\": 256,\n              \"end_index\": 257,\n              \"node_id\": \"0097\"\n            },\n            {\n              \"title\": \"Local quadratic approximation\",\n              \"start_index\": 257,\n              \"end_index\": 259,\n              \"node_id\": \"0098\"\n            },\n            {\n              \"title\": \"Use of gradient information\",\n              \"start_index\": 259,\n              \"end_index\": 260,\n              \"node_id\": \"0099\"\n            },\n            {\n              \"title\": \"Gradient descent optimization\",\n              \"start_index\": 260,\n              \"end_index\": 261,\n              \"node_id\": \"0100\"\n            }\n          ],\n          \"node_id\": \"0096\"\n        },\n        {\n          \"title\": \"Error Backpropagation\",\n          \"start_index\": 261,\n          \"end_index\": 262,\n          \"nodes\": [\n            {\n              \"title\": \"Evaluation of error-function derivatives\",\n              \"start_index\": 262,\n              \"end_index\": 265,\n              \"node_id\": \"0102\"\n            },\n            {\n              \"title\": \"A simple example\",\n              \"start_index\": 265,\n              \"end_index\": 266,\n              \"node_id\": \"0103\"\n            },\n            {\n              \"title\": \"Efficiency of backpropagation\",\n              \"start_index\": 266,\n              \"end_index\": 267,\n              \"node_id\": \"0104\"\n            },\n            {\n              \"title\": \"The Jacobian matrix\",\n              \"start_index\": 267,\n              \"end_index\": 269,\n              \"node_id\": \"0105\"\n            }\n          ],\n          \"node_id\": \"0101\"\n        },\n        {\n          \"title\": \"The Hessian Matrix\",\n          \"start_index\": 269,\n          \"end_index\": 270,\n          \"nodes\": [\n            {\n              \"title\": \"Diagonal approximation\",\n              \"start_index\": 270,\n              \"end_index\": 271,\n              \"node_id\": \"0107\"\n            },\n            {\n              \"title\": \"Outer product approximation\",\n              \"start_index\": 271,\n              \"end_index\": 272,\n              \"node_id\": \"0108\"\n            },\n            {\n              \"title\": \"Inverse Hessian\",\n              \"start_index\": 272,\n              \"end_index\": 272,\n              \"node_id\": \"0109\"\n            },\n            {\n              \"title\": \"Finite differences\",\n              \"start_index\": 272,\n              \"end_index\": 273,\n              \"node_id\": \"0110\"\n            },\n            {\n              \"title\": \"Exact evaluation of the Hessian\",\n              \"start_index\": 273,\n              \"end_index\": 274,\n              \"node_id\": \"0111\"\n            },\n            {\n              \"title\": \"Fast multiplication by the Hessian\",\n              \"start_index\": 274,\n              \"end_index\": 276,\n              \"node_id\": \"0112\"\n            }\n          ],\n          \"node_id\": \"0106\"\n        },\n        {\n          \"title\": \"Regularization in Neural Networks\",\n          \"start_index\": 276,\n          \"end_index\": 277,\n          \"nodes\": [\n            {\n              \"title\": \"Consistent Gaussian priors\",\n              \"start_index\": 277,\n              \"end_index\": 279,\n              \"node_id\": \"0114\"\n            },\n            {\n              \"title\": \"Early stopping\",\n              \"start_index\": 279,\n              \"end_index\": 281,\n              \"node_id\": \"0115\"\n            },\n            {\n              \"title\": \"Invariances\",\n              \"start_index\": 281,\n              \"end_index\": 283,\n              \"node_id\": \"0116\"\n            },\n            {\n              \"title\": \"Tangent propagation\",\n              \"start_index\": 283,\n              \"end_index\": 285,\n              \"node_id\": \"0117\"\n            },\n            {\n              \"title\": \"Training with transformed data\",\n              \"start_index\": 285,\n              \"end_index\": 287,\n              \"node_id\": \"0118\"\n            },\n            {\n              \"title\": \"Convolutional networks\",\n              \"start_index\": 287,\n              \"end_index\": 289,\n              \"node_id\": \"0119\"\n            },\n            {\n              \"title\": \"Soft weight sharing\",\n              \"start_index\": 289,\n              \"end_index\": 292,\n              \"node_id\": \"0120\"\n            }\n          ],\n          \"node_id\": \"0113\"\n        },\n        {\n          \"title\": \"Mixture Density Networks\",\n          \"start_index\": 292,\n          \"end_index\": 297,\n          \"node_id\": \"0121\"\n        },\n        {\n          \"title\": \"Bayesian Neural Networks\",\n          \"start_index\": 297,\n          \"end_index\": 298,\n          \"nodes\": [\n            {\n              \"title\": \"Posterior parameter distribution\",\n              \"start_index\": 298,\n              \"end_index\": 300,\n              \"node_id\": \"0123\"\n            },\n            {\n              \"title\": \"Hyperparameter optimization\",\n              \"start_index\": 300,\n              \"end_index\": 301,\n              \"node_id\": \"0124\"\n            },\n            {\n              \"title\": \"Bayesian neural networks for classification\",\n              \"start_index\": 301,\n              \"end_index\": 304,\n              \"node_id\": \"0125\"\n            }\n          ],\n          \"node_id\": \"0122\"\n        }\n      ],\n      \"node_id\": \"0093\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 304,\n      \"end_index\": 311,\n      \"node_id\": \"0126\"\n    },\n    {\n      \"title\": \"Kernel Methods\",\n      \"start_index\": 311,\n      \"end_index\": 313,\n      \"nodes\": [\n        {\n          \"title\": \"Dual Representations\",\n          \"start_index\": 313,\n          \"end_index\": 314,\n          \"node_id\": \"0128\"\n        },\n        {\n          \"title\": \"Constructing Kernels\",\n          \"start_index\": 314,\n          \"end_index\": 319,\n          \"node_id\": \"0129\"\n        },\n        {\n          \"title\": \"Radial Basis Function Networks\",\n          \"start_index\": 319,\n          \"end_index\": 321,\n          \"nodes\": [\n            {\n              \"title\": \"Nadaraya-Watson model\",\n              \"start_index\": 321,\n              \"end_index\": 323,\n              \"node_id\": \"0131\"\n            }\n          ],\n          \"node_id\": \"0130\"\n        },\n        {\n          \"title\": \"Gaussian Processes\",\n          \"start_index\": 323,\n          \"end_index\": 324,\n          \"nodes\": [\n            {\n              \"title\": \"Linear regression revisited\",\n              \"start_index\": 324,\n              \"end_index\": 326,\n              \"node_id\": \"0133\"\n            },\n            {\n              \"title\": \"Gaussian processes for regression\",\n              \"start_index\": 326,\n              \"end_index\": 331,\n              \"node_id\": \"0134\"\n            },\n            {\n              \"title\": \"Learning the hyperparameters\",\n              \"start_index\": 331,\n              \"end_index\": 332,\n              \"node_id\": \"0135\"\n            },\n            {\n              \"title\": \"Automatic relevance determination\",\n              \"start_index\": 332,\n              \"end_index\": 333,\n              \"node_id\": \"0136\"\n            },\n            {\n              \"title\": \"Gaussian processes for classification\",\n              \"start_index\": 333,\n              \"end_index\": 335,\n              \"node_id\": \"0137\"\n            },\n            {\n              \"title\": \"Laplace approximation\",\n              \"start_index\": 335,\n              \"end_index\": 339,\n              \"node_id\": \"0138\"\n            },\n            {\n              \"title\": \"Connection to neural networks\",\n              \"start_index\": 339,\n              \"end_index\": 340,\n              \"node_id\": \"0139\"\n            }\n          ],\n          \"node_id\": \"0132\"\n        }\n      ],\n      \"node_id\": \"0127\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 340,\n      \"end_index\": 344,\n      \"node_id\": \"0140\"\n    },\n    {\n      \"title\": \"Sparse Kernel Machines\",\n      \"start_index\": 345,\n      \"end_index\": 346,\n      \"nodes\": [\n        {\n          \"title\": \"Maximum Margin Classifiers\",\n          \"start_index\": 346,\n          \"end_index\": 351,\n          \"nodes\": [\n            {\n              \"title\": \"Overlapping class distributions\",\n              \"start_index\": 351,\n              \"end_index\": 356,\n              \"node_id\": \"0143\"\n            },\n            {\n              \"title\": \"Relation to logistic regression\",\n              \"start_index\": 356,\n              \"end_index\": 358,\n              \"node_id\": \"0144\"\n            },\n            {\n              \"title\": \"Multiclass SVMs\",\n              \"start_index\": 358,\n              \"end_index\": 359,\n              \"node_id\": \"0145\"\n            },\n            {\n              \"title\": \"SVMs for regression\",\n              \"start_index\": 359,\n              \"end_index\": 364,\n              \"node_id\": \"0146\"\n            },\n            {\n              \"title\": \"Computational learning theory\",\n              \"start_index\": 364,\n              \"end_index\": 365,\n              \"node_id\": \"0147\"\n            }\n          ],\n          \"node_id\": \"0142\"\n        },\n        {\n          \"title\": \"Relevance Vector Machines\",\n          \"start_index\": 365,\n          \"end_index\": 365,\n          \"nodes\": [\n            {\n              \"title\": \"RVM for regression\",\n              \"start_index\": 365,\n              \"end_index\": 369,\n              \"node_id\": \"0149\"\n            },\n            {\n              \"title\": \"Analysis of sparsity\",\n              \"start_index\": 369,\n              \"end_index\": 373,\n              \"node_id\": \"0150\"\n            },\n            {\n              \"title\": \"RVM for classification\",\n              \"start_index\": 373,\n              \"end_index\": 377,\n              \"node_id\": \"0151\"\n            }\n          ],\n          \"node_id\": \"0148\"\n        }\n      ],\n      \"node_id\": \"0141\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 377,\n      \"end_index\": 379,\n      \"node_id\": \"0152\"\n    },\n    {\n      \"title\": \"Graphical Models\",\n      \"start_index\": 379,\n      \"end_index\": 380,\n      \"nodes\": [\n        {\n          \"title\": \"Bayesian Networks\",\n          \"start_index\": 380,\n          \"end_index\": 382,\n          \"nodes\": [\n            {\n              \"title\": \"Example: Polynomial regression\",\n              \"start_index\": 382,\n              \"end_index\": 385,\n              \"node_id\": \"0155\"\n            },\n            {\n              \"title\": \"Generative models\",\n              \"start_index\": 385,\n              \"end_index\": 386,\n              \"node_id\": \"0156\"\n            },\n            {\n              \"title\": \"Discrete variables\",\n              \"start_index\": 386,\n              \"end_index\": 390,\n              \"node_id\": \"0157\"\n            },\n            {\n              \"title\": \"Linear-Gaussian models\",\n              \"start_index\": 390,\n              \"end_index\": 392,\n              \"node_id\": \"0158\"\n            }\n          ],\n          \"node_id\": \"0154\"\n        },\n        {\n          \"title\": \"Conditional Independence\",\n          \"start_index\": 392,\n          \"end_index\": 393,\n          \"nodes\": [\n            {\n              \"title\": \"Three example graphs\",\n              \"start_index\": 393,\n              \"end_index\": 398,\n              \"node_id\": \"0160\"\n            },\n            {\n              \"title\": \"D-separation\",\n              \"start_index\": 398,\n              \"end_index\": 403,\n              \"node_id\": \"0161\"\n            }\n          ],\n          \"node_id\": \"0159\"\n        },\n        {\n          \"title\": \"Markov Random Fields\",\n          \"start_index\": 403,\n          \"end_index\": 403,\n          \"nodes\": [\n            {\n              \"title\": \"Conditional independence properties\",\n              \"start_index\": 403,\n              \"end_index\": 404,\n              \"node_id\": \"0163\"\n            },\n            {\n              \"title\": \"Factorization properties\",\n              \"start_index\": 404,\n              \"end_index\": 407,\n              \"node_id\": \"0164\"\n            },\n            {\n              \"title\": \"Illustration: Image de-noising\",\n              \"start_index\": 407,\n              \"end_index\": 410,\n              \"node_id\": \"0165\"\n            },\n            {\n              \"title\": \"Relation to directed graphs\",\n              \"start_index\": 410,\n              \"end_index\": 413,\n              \"node_id\": \"0166\"\n            }\n          ],\n          \"node_id\": \"0162\"\n        },\n        {\n          \"title\": \"Inference in Graphical Models\",\n          \"start_index\": 413,\n          \"end_index\": 414,\n          \"nodes\": [\n            {\n              \"title\": \"Inference on a chain\",\n              \"start_index\": 414,\n              \"end_index\": 418,\n              \"node_id\": \"0168\"\n            },\n            {\n              \"title\": \"Trees\",\n              \"start_index\": 418,\n              \"end_index\": 419,\n              \"node_id\": \"0169\"\n            },\n            {\n              \"title\": \"Factor graphs\",\n              \"start_index\": 419,\n              \"end_index\": 422,\n              \"node_id\": \"0170\"\n            },\n            {\n              \"title\": \"The sum-product algorithm\",\n              \"start_index\": 422,\n              \"end_index\": 431,\n              \"node_id\": \"0171\"\n            },\n            {\n              \"title\": \"The max-sum algorithm\",\n              \"start_index\": 431,\n              \"end_index\": 436,\n              \"node_id\": \"0172\"\n            },\n            {\n              \"title\": \"Exact inference in general graphs\",\n              \"start_index\": 436,\n              \"end_index\": 437,\n              \"node_id\": \"0173\"\n            },\n            {\n              \"title\": \"Loopy belief propagation\",\n              \"start_index\": 437,\n              \"end_index\": 438,\n              \"node_id\": \"0174\"\n            },\n            {\n              \"title\": \"Learning the graph structure\",\n              \"start_index\": 438,\n              \"end_index\": 438,\n              \"node_id\": \"0175\"\n            }\n          ],\n          \"node_id\": \"0167\"\n        }\n      ],\n      \"node_id\": \"0153\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 438,\n      \"end_index\": 443,\n      \"node_id\": \"0176\"\n    },\n    {\n      \"title\": \"Mixture Models and EM\",\n      \"start_index\": 443,\n      \"end_index\": 444,\n      \"nodes\": [\n        {\n          \"title\": \"K-means Clustering\",\n          \"start_index\": 444,\n          \"end_index\": 448,\n          \"nodes\": [\n            {\n              \"title\": \"Image segmentation and compression\",\n              \"start_index\": 448,\n              \"end_index\": 450,\n              \"node_id\": \"0179\"\n            }\n          ],\n          \"node_id\": \"0178\"\n        },\n        {\n          \"title\": \"Mixtures of Gaussians\",\n          \"start_index\": 450,\n          \"end_index\": 452,\n          \"nodes\": [\n            {\n              \"title\": \"Maximum likelihood\",\n              \"start_index\": 452,\n              \"end_index\": 455,\n              \"node_id\": \"0181\"\n            },\n            {\n              \"title\": \"EM for Gaussian mixtures\",\n              \"start_index\": 455,\n              \"end_index\": 459,\n              \"node_id\": \"0182\"\n            }\n          ],\n          \"node_id\": \"0180\"\n        },\n        {\n          \"title\": \"An Alternative View of EM\",\n          \"start_index\": 459,\n          \"end_index\": 461,\n          \"nodes\": [\n            {\n              \"title\": \"Gaussian mixtures revisited\",\n              \"start_index\": 461,\n              \"end_index\": 463,\n              \"node_id\": \"0184\"\n            },\n            {\n              \"title\": \"Relation to K-means\",\n              \"start_index\": 463,\n              \"end_index\": 464,\n              \"node_id\": \"0185\"\n            },\n            {\n              \"title\": \"Mixtures of Bernoulli distributions\",\n              \"start_index\": 464,\n              \"end_index\": 468,\n              \"node_id\": \"0186\"\n            },\n            {\n              \"title\": \"EM for Bayesian linear regression\",\n              \"start_index\": 468,\n              \"end_index\": 470,\n              \"node_id\": \"0187\"\n            }\n          ],\n          \"node_id\": \"0183\"\n        },\n        {\n          \"title\": \"The EM Algorithm in General\",\n          \"start_index\": 470,\n          \"end_index\": 475,\n          \"node_id\": \"0188\"\n        }\n      ],\n      \"node_id\": \"0177\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 475,\n      \"end_index\": 480,\n      \"node_id\": \"0189\"\n    },\n    {\n      \"title\": \"Approximate Inference\",\n      \"start_index\": 481,\n      \"end_index\": 482,\n      \"nodes\": [\n        {\n          \"title\": \"Variational Inference\",\n          \"start_index\": 482,\n          \"end_index\": 484,\n          \"nodes\": [\n            {\n              \"title\": \"Factorized distributions\",\n              \"start_index\": 484,\n              \"end_index\": 486,\n              \"node_id\": \"0192\"\n            },\n            {\n              \"title\": \"Properties of factorized approximations\",\n              \"start_index\": 486,\n              \"end_index\": 490,\n              \"node_id\": \"0193\"\n            },\n            {\n              \"title\": \"Example: The univariate Gaussian\",\n              \"start_index\": 490,\n              \"end_index\": 493,\n              \"node_id\": \"0194\"\n            },\n            {\n              \"title\": \"Model comparison\",\n              \"start_index\": 493,\n              \"end_index\": 494,\n              \"node_id\": \"0195\"\n            }\n          ],\n          \"node_id\": \"0191\"\n        },\n        {\n          \"title\": \"Illustration: Variational Mixture of Gaussians\",\n          \"start_index\": 494,\n          \"end_index\": 495,\n          \"nodes\": [\n            {\n              \"title\": \"Variational distribution\",\n              \"start_index\": 495,\n              \"end_index\": 501,\n              \"node_id\": \"0197\"\n            },\n            {\n              \"title\": \"Variational lower bound\",\n              \"start_index\": 501,\n              \"end_index\": 502,\n              \"node_id\": \"0198\"\n            },\n            {\n              \"title\": \"Predictive density\",\n              \"start_index\": 502,\n              \"end_index\": 503,\n              \"node_id\": \"0199\"\n            },\n            {\n              \"title\": \"Determining the number of components\",\n              \"start_index\": 503,\n              \"end_index\": 505,\n              \"node_id\": \"0200\"\n            },\n            {\n              \"title\": \"Induced factorizations\",\n              \"start_index\": 505,\n              \"end_index\": 506,\n              \"node_id\": \"0201\"\n            }\n          ],\n          \"node_id\": \"0196\"\n        },\n        {\n          \"title\": \"Variational Linear Regression\",\n          \"start_index\": 506,\n          \"end_index\": 506,\n          \"nodes\": [\n            {\n              \"title\": \"Variational distribution\",\n              \"start_index\": 506,\n              \"end_index\": 508,\n              \"node_id\": \"0203\"\n            },\n            {\n              \"title\": \"Predictive distribution\",\n              \"start_index\": 508,\n              \"end_index\": 509,\n              \"node_id\": \"0204\"\n            },\n            {\n              \"title\": \"Lower bound\",\n              \"start_index\": 509,\n              \"end_index\": 510,\n              \"node_id\": \"0205\"\n            }\n          ],\n          \"node_id\": \"0202\"\n        },\n        {\n          \"title\": \"Exponential Family Distributions\",\n          \"start_index\": 510,\n          \"end_index\": 511,\n          \"nodes\": [\n            {\n              \"title\": \"Variational message passing\",\n              \"start_index\": 511,\n              \"end_index\": 512,\n              \"node_id\": \"0207\"\n            }\n          ],\n          \"node_id\": \"0206\"\n        },\n        {\n          \"title\": \"Local Variational Methods\",\n          \"start_index\": 513,\n          \"end_index\": 518,\n          \"node_id\": \"0208\"\n        },\n        {\n          \"title\": \"Variational Logistic Regression\",\n          \"start_index\": 518,\n          \"end_index\": 518,\n          \"nodes\": [\n            {\n              \"title\": \"Variational posterior distribution\",\n              \"start_index\": 518,\n              \"end_index\": 520,\n              \"node_id\": \"0210\"\n            },\n            {\n              \"title\": \"Optimizing the variational parameters\",\n              \"start_index\": 520,\n              \"end_index\": 522,\n              \"node_id\": \"0211\"\n            },\n            {\n              \"title\": \"Inference of hyperparameters\",\n              \"start_index\": 522,\n              \"end_index\": 525,\n              \"node_id\": \"0212\"\n            }\n          ],\n          \"node_id\": \"0209\"\n        },\n        {\n          \"title\": \"Expectation Propagation\",\n          \"start_index\": 525,\n          \"end_index\": 531,\n          \"nodes\": [\n            {\n              \"title\": \"Example: The clutter problem\",\n              \"start_index\": 531,\n              \"end_index\": 533,\n              \"node_id\": \"0214\"\n            },\n            {\n              \"title\": \"Expectation propagation on graphs\",\n              \"start_index\": 533,\n              \"end_index\": 537,\n              \"node_id\": \"0215\"\n            }\n          ],\n          \"node_id\": \"0213\"\n        }\n      ],\n      \"node_id\": \"0190\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 537,\n      \"end_index\": 542,\n      \"node_id\": \"0216\"\n    },\n    {\n      \"title\": \"Sampling Methods\",\n      \"start_index\": 543,\n      \"end_index\": 546,\n      \"nodes\": [\n        {\n          \"title\": \"Basic Sampling Algorithms\",\n          \"start_index\": 546,\n          \"end_index\": 546,\n          \"nodes\": [\n            {\n              \"title\": \"Standard distributions\",\n              \"start_index\": 546,\n              \"end_index\": 548,\n              \"node_id\": \"0219\"\n            },\n            {\n              \"title\": \"Rejection sampling\",\n              \"start_index\": 548,\n              \"end_index\": 550,\n              \"node_id\": \"0220\"\n            },\n            {\n              \"title\": \"Adaptive rejection sampling\",\n              \"start_index\": 550,\n              \"end_index\": 552,\n              \"node_id\": \"0221\"\n            },\n            {\n              \"title\": \"Importance sampling\",\n              \"start_index\": 552,\n              \"end_index\": 554,\n              \"node_id\": \"0222\"\n            },\n            {\n              \"title\": \"Sampling-importance-resampling\",\n              \"start_index\": 554,\n              \"end_index\": 556,\n              \"node_id\": \"0223\"\n            },\n            {\n              \"title\": \"Sampling and the EM algorithm\",\n              \"start_index\": 556,\n              \"end_index\": 556,\n              \"node_id\": \"0224\"\n            }\n          ],\n          \"node_id\": \"0218\"\n        },\n        {\n          \"title\": \"Markov Chain Monte Carlo\",\n          \"start_index\": 557,\n          \"end_index\": 559,\n          \"nodes\": [\n            {\n              \"title\": \"Markov chains\",\n              \"start_index\": 559,\n              \"end_index\": 561,\n              \"node_id\": \"0226\"\n            },\n            {\n              \"title\": \"The Metropolis-Hastings algorithm\",\n              \"start_index\": 561,\n              \"end_index\": 562,\n              \"node_id\": \"0227\"\n            }\n          ],\n          \"node_id\": \"0225\"\n        },\n        {\n          \"title\": \"Gibbs Sampling\",\n          \"start_index\": 562,\n          \"end_index\": 566,\n          \"node_id\": \"0228\"\n        },\n        {\n          \"title\": \"Slice Sampling\",\n          \"start_index\": 566,\n          \"end_index\": 568,\n          \"node_id\": \"0229\"\n        },\n        {\n          \"title\": \"The Hybrid Monte Carlo Algorithm\",\n          \"start_index\": 568,\n          \"end_index\": 568,\n          \"nodes\": [\n            {\n              \"title\": \"Dynamical systems\",\n              \"start_index\": 568,\n              \"end_index\": 572,\n              \"node_id\": \"0231\"\n            },\n            {\n              \"title\": \"Hybrid Monte Carlo\",\n              \"start_index\": 572,\n              \"end_index\": 574,\n              \"node_id\": \"0232\"\n            }\n          ],\n          \"node_id\": \"0230\"\n        },\n        {\n          \"title\": \"Estimating the Partition Function\",\n          \"start_index\": 574,\n          \"end_index\": 576,\n          \"node_id\": \"0233\"\n        }\n      ],\n      \"node_id\": \"0217\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 576,\n      \"end_index\": 579,\n      \"node_id\": \"0234\"\n    },\n    {\n      \"title\": \"Continuous Latent Variables\",\n      \"start_index\": 579,\n      \"end_index\": 581,\n      \"nodes\": [\n        {\n          \"title\": \"Principal Component Analysis\",\n          \"start_index\": 581,\n          \"end_index\": 581,\n          \"nodes\": [\n            {\n              \"title\": \"Maximum variance formulation\",\n              \"start_index\": 581,\n              \"end_index\": 583,\n              \"node_id\": \"0237\"\n            },\n            {\n              \"title\": \"Minimum-error formulation\",\n              \"start_index\": 583,\n              \"end_index\": 585,\n              \"node_id\": \"0238\"\n            },\n            {\n              \"title\": \"Applications of PCA\",\n              \"start_index\": 585,\n              \"end_index\": 589,\n              \"node_id\": \"0239\"\n            },\n            {\n              \"title\": \"PCA for high-dimensional data\",\n              \"start_index\": 589,\n              \"end_index\": 590,\n              \"node_id\": \"0240\"\n            }\n          ],\n          \"node_id\": \"0236\"\n        },\n        {\n          \"title\": \"Probabilistic PCA\",\n          \"start_index\": 590,\n          \"end_index\": 594,\n          \"nodes\": [\n            {\n              \"title\": \"Maximum likelihood PCA\",\n              \"start_index\": 594,\n              \"end_index\": 597,\n              \"node_id\": \"0242\"\n            },\n            {\n              \"title\": \"EM algorithm for PCA\",\n              \"start_index\": 597,\n              \"end_index\": 600,\n              \"node_id\": \"0243\"\n            },\n            {\n              \"title\": \"Bayesian PCA\",\n              \"start_index\": 600,\n              \"end_index\": 603,\n              \"node_id\": \"0244\"\n            },\n            {\n              \"title\": \"Factor analysis\",\n              \"start_index\": 603,\n              \"end_index\": 606,\n              \"node_id\": \"0245\"\n            }\n          ],\n          \"node_id\": \"0241\"\n        },\n        {\n          \"title\": \"Kernel PCA\",\n          \"start_index\": 606,\n          \"end_index\": 610,\n          \"node_id\": \"0246\"\n        },\n        {\n          \"title\": \"Nonlinear Latent Variable Models\",\n          \"start_index\": 611,\n          \"end_index\": 611,\n          \"nodes\": [\n            {\n              \"title\": \"Independent component analysis\",\n              \"start_index\": 611,\n              \"end_index\": 612,\n              \"node_id\": \"0248\"\n            },\n            {\n              \"title\": \"Autoassociative neural networks\",\n              \"start_index\": 612,\n              \"end_index\": 615,\n              \"node_id\": \"0249\"\n            },\n            {\n              \"title\": \"Modelling nonlinear manifolds\",\n              \"start_index\": 615,\n              \"end_index\": 619,\n              \"node_id\": \"0250\"\n            }\n          ],\n          \"node_id\": \"0247\"\n        }\n      ],\n      \"node_id\": \"0235\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 619,\n      \"end_index\": 624,\n      \"node_id\": \"0251\"\n    },\n    {\n      \"title\": \"Sequential Data\",\n      \"start_index\": 625,\n      \"end_index\": 627,\n      \"nodes\": [\n        {\n          \"title\": \"Markov Models\",\n          \"start_index\": 627,\n          \"end_index\": 630,\n          \"node_id\": \"0253\"\n        },\n        {\n          \"title\": \"Hidden Markov Models\",\n          \"start_index\": 630,\n          \"end_index\": 635,\n          \"nodes\": [\n            {\n              \"title\": \"Maximum likelihood for the HMM\",\n              \"start_index\": 635,\n              \"end_index\": 638,\n              \"node_id\": \"0255\"\n            },\n            {\n              \"title\": \"The forward-backward algorithm\",\n              \"start_index\": 638,\n              \"end_index\": 645,\n              \"node_id\": \"0256\"\n            },\n            {\n              \"title\": \"The sum-product algorithm for the HMM\",\n              \"start_index\": 645,\n              \"end_index\": 647,\n              \"node_id\": \"0257\"\n            },\n            {\n              \"title\": \"Scaling factors\",\n              \"start_index\": 647,\n              \"end_index\": 649,\n              \"node_id\": \"0258\"\n            },\n            {\n              \"title\": \"The Viterbi algorithm\",\n              \"start_index\": 649,\n              \"end_index\": 651,\n              \"node_id\": \"0259\"\n            },\n            {\n              \"title\": \"Extensions of the hidden Markov model\",\n              \"start_index\": 651,\n              \"end_index\": 655,\n              \"node_id\": \"0260\"\n            }\n          ],\n          \"node_id\": \"0254\"\n        },\n        {\n          \"title\": \"Linear Dynamical Systems\",\n          \"start_index\": 655,\n          \"end_index\": 658,\n          \"nodes\": [\n            {\n              \"title\": \"Inference in LDS\",\n              \"start_index\": 658,\n              \"end_index\": 662,\n              \"node_id\": \"0262\"\n            },\n            {\n              \"title\": \"Learning in LDS\",\n              \"start_index\": 662,\n              \"end_index\": 664,\n              \"node_id\": \"0263\"\n            },\n            {\n              \"title\": \"Extensions of LDS\",\n              \"start_index\": 664,\n              \"end_index\": 665,\n              \"node_id\": \"0264\"\n            },\n            {\n              \"title\": \"Particle filters\",\n              \"start_index\": 665,\n              \"end_index\": 666,\n              \"node_id\": \"0265\"\n            }\n          ],\n          \"node_id\": \"0261\"\n        }\n      ],\n      \"node_id\": \"0252\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 666,\n      \"end_index\": 672,\n      \"node_id\": \"0266\"\n    },\n    {\n      \"title\": \"Combining Models\",\n      \"start_index\": 673,\n      \"end_index\": 674,\n      \"nodes\": [\n        {\n          \"title\": \"Bayesian Model Averaging\",\n          \"start_index\": 674,\n          \"end_index\": 675,\n          \"node_id\": \"0268\"\n        },\n        {\n          \"title\": \"Committees\",\n          \"start_index\": 675,\n          \"end_index\": 677,\n          \"node_id\": \"0269\"\n        },\n        {\n          \"title\": \"Boosting\",\n          \"start_index\": 677,\n          \"end_index\": 679,\n          \"nodes\": [\n            {\n              \"title\": \"Minimizing exponential error\",\n              \"start_index\": 679,\n              \"end_index\": 681,\n              \"node_id\": \"0271\"\n            },\n            {\n              \"title\": \"Error functions for boosting\",\n              \"start_index\": 681,\n              \"end_index\": 683,\n              \"node_id\": \"0272\"\n            }\n          ],\n          \"node_id\": \"0270\"\n        },\n        {\n          \"title\": \"Tree-based Models\",\n          \"start_index\": 683,\n          \"end_index\": 686,\n          \"node_id\": \"0273\"\n        },\n        {\n          \"title\": \"Conditional Mixture Models\",\n          \"start_index\": 686,\n          \"end_index\": 687,\n          \"nodes\": [\n            {\n              \"title\": \"Mixtures of linear regression models\",\n              \"start_index\": 687,\n              \"end_index\": 690,\n              \"node_id\": \"0275\"\n            },\n            {\n              \"title\": \"Mixtures of logistic models\",\n              \"start_index\": 690,\n              \"end_index\": 692,\n              \"node_id\": \"0276\"\n            },\n            {\n              \"title\": \"Mixtures of experts\",\n              \"start_index\": 692,\n              \"end_index\": 694,\n              \"node_id\": \"0277\"\n            }\n          ],\n          \"node_id\": \"0274\"\n        }\n      ],\n      \"node_id\": \"0267\"\n    },\n    {\n      \"title\": \"Exercises\",\n      \"start_index\": 694,\n      \"end_index\": 696,\n      \"node_id\": \"0278\"\n    },\n    {\n      \"title\": \"Appendix A Data Sets\",\n      \"start_index\": 697,\n      \"end_index\": 704,\n      \"node_id\": \"0279\"\n    },\n    {\n      \"title\": \"Appendix B Probability Distributions\",\n      \"start_index\": 705,\n      \"end_index\": 714,\n      \"node_id\": \"0280\"\n    },\n    {\n      \"title\": \"Appendix C Properties of Matrices\",\n      \"start_index\": 715,\n      \"end_index\": 722,\n      \"node_id\": \"0281\"\n    },\n    {\n      \"title\": \"Appendix D Calculus of Variations\",\n      \"start_index\": 723,\n      \"end_index\": 726,\n      \"node_id\": \"0282\"\n    },\n    {\n      \"title\": \"Appendix E Lagrange Multipliers\",\n      \"start_index\": 727,\n      \"end_index\": 730,\n      \"node_id\": \"0283\"\n    },\n    {\n      \"title\": \"References\",\n      \"start_index\": 731,\n      \"end_index\": 749,\n      \"node_id\": \"0284\"\n    },\n    {\n      \"title\": \"Index\",\n      \"start_index\": 749,\n      \"end_index\": 758,\n      \"node_id\": \"0285\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/results/Regulation Best Interest_Interpretive release_structure.json",
    "content": "{\n  \"doc_name\": \"Regulation Best Interest_Interpretive release.pdf\",\n  \"doc_description\": \"A detailed analysis of the SEC's interpretation of the \\\"solely incidental\\\" prong of the broker-dealer exclusion under the Investment Advisers Act of 1940, including its historical context, application guidance, economic implications, and regulatory considerations.\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end_index\": 2,\n      \"node_id\": \"0000\",\n      \"summary\": \"The partial document outlines an interpretation by the Securities and Exchange Commission (SEC) regarding the \\\"solely incidental\\\" prong of the broker-dealer exclusion under the Investment Advisers Act of 1940. It clarifies that brokers or dealers providing advisory services that are incidental to their primary business and for which they receive no special compensation are excluded from the definition of \\\"investment adviser\\\" under the Act. The document includes a historical and legislative context, the scope of the \\\"solely incidental\\\" prong, guidance on its application, and economic considerations related to the interpretation. It also provides contact information for further inquiries and specifies the effective date of the interpretation as July 12, 2019.\"\n    },\n    {\n      \"title\": \"Introduction\",\n      \"start_index\": 2,\n      \"end_index\": 6,\n      \"node_id\": \"0001\",\n      \"summary\": \"The partial document discusses the regulation of investment advisers under the Advisers Act, specifically focusing on the \\\"broker-dealer exclusion,\\\" which exempts brokers and dealers from being classified as investment advisers under certain conditions. Key points include:\\n\\n1. **Introduction to the Advisers Act**: Overview of the regulation of investment advisers and the broker-dealer exclusion, which applies when advisory services are \\\"solely incidental\\\" to brokerage business and no special compensation is received.\\n\\n2. **Historical Context and Legislative History**: Examination of the historical practices of broker-dealers providing investment advice, distinguishing between auxiliary advice as part of brokerage services and separate advisory services.\\n\\n3. **Interpretation of the Solely Incidental Prong**: Clarification of the \\\"solely incidental\\\" condition of the broker-dealer exclusion, including its application to activities like investment discretion and account monitoring.\\n\\n4. **Economic Considerations**: Discussion of the potential economic effects of the interpretation and application of the broker-dealer exclusion.\\n\\n5. **Regulatory Developments**: Reference to the Commission's 2018 proposals, including Regulation Best Interest (Reg. BI), the Proposed Fiduciary Interpretation, and the Relationship Summary Proposal, aimed at enhancing standards of conduct and investor understanding.\\n\\n6. **Public Comments and Feedback**: Summary of public comments on the scope and interpretation of the broker-dealer exclusion, highlighting disagreements and requests for clarification on the \\\"solely incidental\\\" prong.\\n\\n7. **Adoption of Interpretation**: The Commission's adoption of an interpretation to confirm and clarify its position on the \\\"solely incidental\\\" prong, complementing related rules and forms to improve investor understanding of broker-dealer and adviser relationships.\"\n    },\n    {\n      \"title\": \"Interpretation and Application\",\n      \"start_index\": 6,\n      \"end_index\": 8,\n      \"nodes\": [\n        {\n          \"title\": \"Historical Context and Legislative History\",\n          \"start_index\": 8,\n          \"end_index\": 10,\n          \"node_id\": \"0003\",\n          \"summary\": \"The partial document discusses the historical context and legislative development of the Investment Advisers Act of 1940. It highlights the findings of a congressional study conducted by the SEC between 1935 and 1939, which identified issues with distinguishing legitimate investment counselors from unregulated \\\"tipster\\\" organizations and problems in the organization and operation of investment counsel institutions. The document explains how these findings led to the passage of the Advisers Act, which broadly defined \\\"investment adviser\\\" and established regulatory oversight for those providing investment advice for compensation. It also addresses the exclusion of certain professionals, such as broker-dealers, from the definition of \\\"investment adviser\\\" if their advice is incidental to their primary business and not specially compensated. Additionally, the document explores the scope of the \\\"solely incidental\\\" prong of the broker-dealer exclusion, referencing interpretations and rules by the SEC, including a 2005 rule regarding fee-based brokerage accounts.\"\n        },\n        {\n          \"title\": \"Scope of the Solely Incidental Prong of the Broker-Dealer Exclusion\",\n          \"start_index\": 10,\n          \"end_index\": 14,\n          \"node_id\": \"0004\",\n          \"summary\": \"The partial document discusses the \\\"broker-dealer exclusion\\\" under the Investment Advisers Act, specifically focusing on the \\\"solely incidental\\\" prong. It examines the scope of this exclusion, emphasizing that investment advice provided by broker-dealers is considered \\\"solely incidental\\\" if it is connected to and reasonably related to their primary business of effecting securities transactions. The document references historical interpretations, court rulings (e.g., Financial Planning Association v. SEC and Thomas v. Metropolitan Life Insurance Company), and legislative history to clarify this standard. It highlights that the frequency or importance of advice does not determine whether it meets the \\\"solely incidental\\\" standard, but rather its relationship to the broker-dealer's primary business. The document also provides guidance on applying this interpretation to specific practices, such as exercising investment discretion and account monitoring, noting that certain discretionary activities may fall outside the scope of the exclusion.\"\n        },\n        {\n          \"title\": \"Guidance on Applying the Interpretation of the Solely Incidental Prong\",\n          \"start_index\": 14,\n          \"end_index\": 22,\n          \"node_id\": \"0005\",\n          \"summary\": \"The partial document provides guidance on the application of the \\\"solely incidental\\\" prong of the broker-dealer exclusion under the Advisers Act. It focuses on two key areas: (1) the exercise of investment discretion by broker-dealers over customer accounts and (2) account monitoring. The document discusses the Commission's interpretation that unlimited investment discretion is not \\\"solely incidental\\\" to a broker-dealer's business, as it indicates a primarily advisory relationship. However, temporary or limited discretion in specific scenarios (e.g., cash management, tax-loss sales, or margin requirements) may be consistent with the \\\"solely incidental\\\" prong. It also addresses account monitoring, stating that agreed-upon periodic monitoring for buy, sell, or hold recommendations may align with the broker-dealer exclusion, while continuous monitoring or advisory-like services would not. The document includes examples, refinements to prior interpretations, and considerations for broker-dealers to adopt policies ensuring compliance. It concludes with economic considerations, highlighting the potential impact on broker-dealers, customers, and the financial advice market.\"\n        }\n      ],\n      \"node_id\": \"0002\",\n      \"summary\": \"The partial document discusses the historical context and legislative history of the Advisers Act of 1940, focusing on the roles of broker-dealers in providing investment advice. It highlights two distinct ways broker-dealers offered advice: as part of traditional brokerage services with fixed commissions and as separate advisory services for a fee. The document examines the concept of \\\"brokerage house advice,\\\" detailing the types of information and services provided, such as market analyses, tax information, and investment recommendations. It also references a congressional study conducted between 1935 and 1939, which identified issues with distinguishing legitimate investment counselors from \\\"tipster\\\" organizations and problems in the organization and operation of investment counsel institutions. These findings led to the enactment of the Advisers Act, which broadly defined \\\"investment adviser\\\" to regulate those providing investment advice for compensation. The document also references various reports, hearings, and literature that informed the development of the Act.\"\n    },\n    {\n      \"title\": \"Economic Considerations\",\n      \"start_index\": 22,\n      \"end_index\": 22,\n      \"nodes\": [\n        {\n          \"title\": \"Background\",\n          \"start_index\": 22,\n          \"end_index\": 23,\n          \"node_id\": \"0007\",\n          \"summary\": \"The partial document discusses the U.S. Securities and Exchange Commission's (SEC) interpretation of the \\\"solely incidental\\\" prong of the broker-dealer exclusion, clarifying its understanding without creating new legal obligations. It examines the potential economic effects of this interpretation on broker-dealers, their associated persons, customers, and the broader financial advice market. The document provides background data on broker-dealers, including their assets, customer accounts, and dual registration as investment advisers. It highlights compliance costs for broker-dealers to align with the interpretation and notes the limited circumstances under which broker-dealers exercise temporary or limited investment discretion. The document also references the lack of data received during the Reg. BI Proposal to analyze the economic impact further.\"\n        },\n        {\n          \"title\": \"Potential Economic Effects\",\n          \"start_index\": 23,\n          \"end_index\": 28,\n          \"node_id\": \"0008\",\n          \"summary\": \"The partial document discusses the economic effects and regulatory implications of the SEC's interpretation of the \\\"solely incidental\\\" prong of the broker-dealer exclusion from the definition of an investment adviser. Key points include:\\n\\n1. **Compliance Costs**: Broker-dealers currently incur costs to align their practices with the \\\"solely incidental\\\" prong, and the interpretation may lead to additional costs for evaluating and adjusting practices.\\n\\n2. **Impact on Broker-Dealer Practices**: Broker-dealers providing advisory services beyond the scope of the interpretation may need to adjust their practices, potentially resulting in reduced services, loss of customers, or a shift to advisory accounts.\\n\\n3. **Market Effects**: The interpretation could lead to decreased competition, increased fees, and a diminished number of broker-dealers offering commission-based services. It may also shift demand from broker-dealers to investment advisers.\\n\\n4. **Regulatory Adjustments**: Broker-dealers may choose to register as investment advisers, incurring new compliance costs, or migrate customers to advisory accounts of affiliates.\\n\\n5. **Potential Benefits**: Some broker-dealers may expand limited discretionary services or monitoring activities, benefiting investors with more efficient access to these services.\\n\\n6. **Regulatory Arbitrage Risks**: The interpretation raises concerns about regulatory arbitrage, though these risks may be mitigated by enhanced standards of conduct for broker-dealers.\\n\\n7. **Amendments to Regulations**: The document includes amendments to the Code of Federal Regulations, adding an interpretive release regarding the \\\"solely incidental\\\" prong, dated June 5, 2019.\"\n        }\n      ],\n      \"node_id\": \"0006\",\n      \"summary\": \"The partial document discusses the SEC's interpretation of the \\\"solely incidental\\\" prong of the broker-dealer exclusion, clarifying that it does not impose new legal obligations but may have economic implications if broker-dealer practices deviate from this interpretation. It provides background on the potential effects on broker-dealers, their associated persons, customers, and the broader financial advice market. The document includes data on the number of registered broker-dealers, their customer accounts, total assets, and the prevalence of dual registrants (firms registered as both broker-dealers and investment advisers) as of December 2018.\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/results/Regulation Best Interest_proposed rule_structure.json",
    "content": "{\n  \"doc_name\": \"Regulation Best Interest_proposed rule.pdf\",\n  \"doc_description\": \"The document provides a comprehensive analysis of the SEC's proposed \\\"Regulation Best Interest,\\\" detailing its objectives, obligations for broker-dealers, economic impacts, compliance requirements, and public feedback to establish a standard of conduct prioritizing retail customers' interests in securities recommendations.\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end_index\": 6,\n      \"node_id\": \"0000\",\n      \"summary\": \"The partial document outlines the Securities and Exchange Commission's (SEC) proposed rule under the Securities Exchange Act of 1934, referred to as \\\"Regulation Best Interest.\\\" The rule aims to establish a standard of conduct for broker-dealers and their associated persons when making securities transaction or investment strategy recommendations to retail customers. The proposed standard requires acting in the best interest of the retail customer without prioritizing the financial or other interests of the broker-dealer or associated person. The document includes details on the rule's objectives, key terms, obligations (disclosure, care, and conflict of interest), recordkeeping requirements, and economic analysis of the rule's impact. It also invites public comments and provides instructions for submitting feedback. Additionally, the document discusses the regulatory framework, alternatives considered, and compliance requirements, particularly for small entities.\"\n    },\n    {\n      \"title\": \"INTRODUCTION\",\n      \"start_index\": 6,\n      \"end_index\": 12,\n      \"nodes\": [\n        {\n          \"title\": \"Background\",\n          \"start_index\": 12,\n          \"end_index\": 22,\n          \"nodes\": [\n            {\n              \"title\": \"Evaluation of Standards of Conduct Applicable to Investment Advice\",\n              \"start_index\": 22,\n              \"end_index\": 26,\n              \"node_id\": \"0003\",\n              \"summary\": \"The partial document discusses the evaluation and development of standards of conduct for investment advice, focusing on investor protection and addressing conflicts of interest. It highlights the blurring lines between broker-dealers and investment advisers, emphasizing the need for a uniform fiduciary standard to ensure firms act in the best interest of customers. The document references the 913 Study, mandated by the Dodd-Frank Act, which recommended rulemaking to adopt such a standard, including eliminating or disclosing conflicts of interest and specifying uniform duty of care standards. It also details public feedback, with most commenters supporting a fiduciary standard but expressing concerns about implementation and preserving investor choice. The Investor Advisory Committee (IAC) recommended imposing a fiduciary duty on broker-dealers, either by narrowing the broker-dealer exclusion under the Advisers Act or adopting a principles-based fiduciary duty. Additionally, the document mentions the Department of Labor's rulemaking to broaden the definition of \\\"fiduciary\\\" under ERISA and the Internal Revenue Code.\"\n            },\n            {\n              \"title\": \"DOL Rulemaking\",\n              \"start_index\": 26,\n              \"end_index\": 32,\n              \"node_id\": \"0004\",\n              \"summary\": \"The partial document discusses regulatory approaches and developments related to fiduciary duties for broker-dealers and investment advisers. It covers recommendations from the Investor Advisory Committee (IAC) to the SEC, including narrowing the broker-dealer exclusion under the Investment Advisers Act or adopting a principles-based fiduciary duty under Section 913. It also details the Department of Labor's (DOL) rulemaking efforts to expand the definition of \\\"fiduciary\\\" under ERISA and the Internal Revenue Code, including the adoption and subsequent vacating of the DOL Fiduciary Rule. The document explains the implications of the DOL Fiduciary Rule, such as restrictions on broker-dealers' compensation and transactions, and the introduction of exemptions like the Best Interest Contract (BIC) Exemption and Principal Transactions Exemption to allow certain forms of compensation and transactions under specific conditions. It highlights the requirements of these exemptions, including adherence to Impartial Conduct Standards, written contracts, and disclosures. Additionally, it references a statement by SEC Chairman Jay Clayton seeking public input on standards of conduct for investment advisers and broker-dealers in light of these developments.\"\n            },\n            {\n              \"title\": \"Statement by Chairman Clayton\",\n              \"start_index\": 32,\n              \"end_index\": 36,\n              \"node_id\": \"0005\",\n              \"summary\": \"The partial document discusses the revised definition of \\\"fiduciary\\\" and the Impartial Conduct Standards, which became effective on June 9, 2017, with compliance for additional conditions delayed until July 1, 2019. It highlights a statement by SEC Chairman Jay Clayton, issued on June 1, 2017, seeking public input on standards of conduct for investment advisers and broker-dealers, resulting in over 250 comments. The document outlines varying public opinions, with many supporting a fiduciary or best interest standard for broker-dealers or a uniform standard for both broker-dealers and investment advisers. It also addresses the effects of the DOL Fiduciary Rule and related exemptions, including concerns about reduced product choice, increased costs, and restricted access to advice for retirement investors, as well as positive outcomes like lower fees, minimized conflicts, and new product offerings. The document further considers the regulatory landscape, investor protections, and the potential impact of conflicts on investor outcomes.\"\n            }\n          ],\n          \"node_id\": \"0002\",\n          \"summary\": \"The partial document discusses the principles and regulatory framework surrounding investment advice, focusing on enhancing investor protection while preserving choice across products and advice models. It introduces the proposed Regulation Best Interest, aiming to establish a standard of conduct for broker-dealers under the Exchange Act to ensure clarity, consistency, and efficiency in their obligations. The document provides background on broker-dealer regulations, including their duty of fair dealing, suitability requirements, and obligations to address conflicts of interest through elimination, mitigation, or disclosure. It highlights concerns about conflicts of interest inherent in broker-dealer compensation structures, such as transaction-based models, and their potential to harm retail customers. The document also addresses customer confusion regarding the differences between broker-dealers and investment advisers, emphasizing the need for a best interest standard to mitigate conflicts and improve investor trust. Additionally, it acknowledges the benefits of the broker-dealer model, such as access to advice, product variety, and payment options, while exploring ways to balance investor protection with preserving these advantages. The evaluation of standards of conduct applicable to investment advice is also discussed, focusing on the blurred lines between broker-dealers and investment advisers and the need for regulatory alignment based on services provided.\"\n        },\n        {\n          \"title\": \"General Objectives of Proposed Approach\",\n          \"start_index\": 36,\n          \"end_index\": 44,\n          \"node_id\": \"0006\",\n          \"summary\": \"The partial document discusses the impact and objectives of the DOL Fiduciary Rule and the proposed Regulation Best Interest. Key points include:\\n\\n1. **Impact of the DOL Fiduciary Rule**: The rule has led to positive outcomes for retirement investors, such as lower fees, advice in the best interest of clients, reduced conflicts of interest, and the development of new products like \\\"clean shares\\\" without sales loads or distribution fees.\\n\\n2. **Objectives of the Proposed Regulation Best Interest**: The proposal aims to enhance broker-dealer conduct obligations when making recommendations to retail customers. It seeks to:\\n   - Address conflicts of interest and investor harm caused by misaligned advice.\\n   - Reduce investor confusion about broker-dealer obligations.\\n   - Align broker-dealer standards with investor expectations and other advice relationships.\\n   - Preserve investor choice and access to products, services, and payment options, including commission-based models.\\n\\n3. **Proposed Best Interest Obligation**: The regulation would require broker-dealers to act in the best interest of retail customers without prioritizing their own financial interests. This obligation includes:\\n   - Disclosure of material facts and conflicts of interest.\\n   - Exercising diligence, care, skill, and prudence in recommendations.\\n   - Enhancing investor protection while maintaining access to affordable advice and products.\\n\\n4. **Regulatory Considerations**: The proposal builds on existing broker-dealer obligations and SRO rules, avoiding regulatory conflicts and redundancies. It does not create new private rights of action or alter existing antifraud provisions.\\n\\n5. **Investor Protection and Choice**: The regulation aims to improve the quality of recommendations, enhance disclosure, and align legal obligations with investor expectations, while minimizing costs and preserving access to advice and products. It acknowledges potential impacts on broker-dealer business models and investor access but justifies these by the benefits of enhanced investor protection.\"\n        }\n      ],\n      \"node_id\": \"0001\",\n      \"summary\": \"The partial document discusses the role of broker-dealers in assisting retail customers with financial planning, retirement savings, and investment goals. It highlights the services broker-dealers provide, ranging from execution-only services to full-service brokerage, and the inherent conflicts of interest in their principal-agent relationship with investors. The document introduces a proposed rule, \\\"Regulation Best Interest,\\\" aimed at enhancing the standard of conduct for broker-dealers when making recommendations to retail customers. Key points include:\\n\\n1. Establishing a \\\"best interest\\\" obligation requiring broker-dealers to prioritize retail customers' interests over their own financial incentives.\\n2. Requiring written disclosure of material facts, conflicts of interest, and the scope of the broker-dealer relationship.\\n3. Mandating reasonable diligence, care, and skill in making recommendations tailored to customers' investment profiles.\\n4. Implementing policies to identify, disclose, mitigate, or eliminate material conflicts of interest, particularly those arising from financial incentives.\\n5. Enhancing investor protection by improving the quality of recommendations, disclosure, and addressing conflicts of interest beyond existing suitability obligations.\\n\\nThe document also emphasizes preserving investor choice and access to advice while fostering clarity and consistency in broker-dealer standards of conduct. It references the broader regulatory context and efforts to align principles across investment advice frameworks.\"\n    },\n    {\n      \"title\": \"DISCUSSION OF REGULATION BEST INTEREST\",\n      \"start_index\": 44,\n      \"end_index\": 44,\n      \"nodes\": [\n        {\n          \"title\": \"Overview of Regulation Best Interest\",\n          \"start_index\": 44,\n          \"end_index\": 50,\n          \"node_id\": \"0008\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest by the Commission, which aims to establish a best interest obligation for broker-dealers when making recommendations to retail customers. Key points include:\\n\\n1. **Best Interest Obligation**: Broker-dealers must act in the best interest of retail customers without prioritizing their own financial interests. This obligation is satisfied through:\\n   - **Disclosure Obligation**: Written disclosure of material facts about the relationship and conflicts of interest.\\n   - **Care Obligation**: Exercising diligence, care, skill, and prudence to ensure recommendations align with the customer\\u2019s investment profile and are not excessive.\\n   - **Conflict of Interest Obligations**: Establishing policies to identify, disclose, mitigate, or eliminate material conflicts of interest, including those arising from financial incentives.\\n\\n2. **Investor Protection**: The regulation aims to enhance investor protection by improving the quality of recommendations, fostering customer awareness, enhancing conflict disclosures, and requiring mitigation of financial conflicts.\\n\\n3. **Alignment with Other Standards**: The proposal draws from existing regulatory frameworks, including SRO rules, state laws, the Advisers Act, and the DOL Fiduciary Rule, to ensure consistency and ease of compliance.\\n\\n4. **Clarification and Guidance**: The Commission provides guidance on the requirements of the best interest obligation, defines key terms, and specifies compliance components to assist broker-dealers.\\n\\n5. **Intent and Language**: The proposal avoids requiring conflict-free recommendations but emphasizes that broker-dealers must not place their interests ahead of customers. It seeks to balance investor protection with preserving business models and customer choice.\"\n        },\n        {\n          \"title\": \"Best Interest, Generally\",\n          \"start_index\": 50,\n          \"end_index\": 58,\n          \"nodes\": [\n            {\n              \"title\": \"Consistency with Other Approaches\",\n              \"start_index\": 58,\n              \"end_index\": 66,\n              \"node_id\": \"0010\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest, focusing on the obligations of broker-dealers to act in the best interest of retail customers. Key points include:\\n\\n1. **Care and Conflict of Interest Obligations**: Broker-dealers must avoid recommendations motivated by self-interest (e.g., self-enrichment or firm sales targets) and ensure recommendations align with the customer\\u2019s investment profile and available alternatives.\\n\\n2. **Permissible Recommendations**: Broker-dealers can recommend higher-cost or riskier products if they comply with Disclosure, Care, and Conflict of Interest Obligations.\\n\\n3. **Alignment with DOL Fiduciary Rule**: The proposed best interest obligation draws on principles from the Department of Labor\\u2019s (DOL) best interest standard, such as acting with care, skill, and prudence without regard to the broker-dealer\\u2019s financial interests.\\n\\n4. **Exemptions and Limitations**: The proposal does not prohibit broker-dealers from receiving commissions, selling proprietary products, or engaging in principal transactions, provided conflicts are disclosed and managed.\\n\\n5. **Comparison to 913 Study Recommendations**: The proposal diverges from the 913 Study\\u2019s recommendation for a uniform fiduciary standard for broker-dealers and investment advisers. Instead, it focuses on enhancing broker-dealer obligations while reflecting principles of loyalty and care.\\n\\n6. **Specific Obligations**: The proposed rule includes Disclosure, Care, and Conflict of Interest Obligations to provide clarity and address material conflicts of interest, particularly financial incentives.\\n\\n7. **Request for Comment**: The Commission seeks feedback on defining the best interest obligation and its alignment with existing regulatory frameworks.\"\n            },\n            {\n              \"title\": \"Request for Comment on the Best Interest Obligation\",\n              \"start_index\": 66,\n              \"end_index\": 71,\n              \"node_id\": \"0011\",\n              \"summary\": \"The partial document discusses the proposed \\\"Regulation Best Interest\\\" obligation for broker-dealers, focusing on ensuring that broker-dealers act in the best interest of retail customers without prioritizing their own financial or other interests. Key points include:\\n\\n1. **Core Obligations**: The proposal outlines specific requirements for broker-dealers, including Disclosure, Care, and Conflict of Interest Obligations, to provide clarity and address material conflicts of interest.\\n\\n2. **Alignment with Existing Standards**: The proposed obligation builds on existing broker-dealer requirements (e.g., suitability) and incorporates principles from the Advisers Act and the 913 Study recommendations.\\n\\n3. **Request for Comments**: The document solicits feedback on various aspects, such as the definition of \\\"best interest,\\\" the sufficiency of the proposed rule, its impact on retail customer protection, and its alignment with other standards like the DOL\\u2019s Impartial Conduct Standards.\\n\\n4. **Retail Customer Protection**: The proposal aims to clarify that broker-dealers cannot put their interests ahead of retail customers and seeks input on whether the rule sufficiently protects customers and avoids confusion.\\n\\n5. **Scope and Monitoring**: The document addresses whether broker-dealers should monitor customer accounts and whether ongoing monitoring would classify them as investment advisers.\\n\\n6. **Legal and Regulatory Implications**: It examines the potential impact on fiduciary obligations under other standards and whether additional requirements, such as fair compensation or prohibitions on misleading statements, should be incorporated.\\n\\n7. **Tailored vs. Uniform Standards**: The Commission proposes a tailored standard for broker-dealers rather than a uniform standard for both broker-dealers and investment advisers, seeking feedback on this approach.\\n\\n8. **Definition of Key Terms**: The document proposes defining terms like \\\"natural person who is an associated person\\\" to clarify the scope of the obligations.\\n\\nThe document emphasizes enhancing retail customer protection, clarifying broker-dealer obligations, and seeking public input on the proposed rule's effectiveness and potential improvements.\"\n            }\n          ],\n          \"node_id\": \"0009\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest, which aims to ensure that broker-dealers act in the best interest of retail customers when making recommendations. Key points include:\\n\\n1. **Best Interest Obligation**: Broker-dealers must prioritize retail customers' interests over their own financial or other interests. The obligation is defined by three components:\\n   - **Disclosure Obligation**: Requires clear communication of material facts about recommendations and conflicts of interest.\\n   - **Care Obligation**: Mandates that recommendations align with the retail customer\\u2019s investment profile, considering factors like cost, risks, benefits, and other characteristics.\\n   - **Conflict of Interest Obligation**: Requires broker-dealers to identify, disclose, and mitigate conflicts of interest.\\n\\n2. **Guidance and Compliance**: The document provides guidance on how broker-dealers can comply with these obligations, emphasizing that cost and financial incentives are important but not the sole factors in determining the best interest of the customer.\\n\\n3. **Flexibility in Recommendations**: The regulation does not prohibit broker-dealers from recommending higher-cost or riskier products if justified by the customer\\u2019s investment profile and other factors. It also does not require recommending the least expensive or least remunerative option.\\n\\n4. **Prohibited Practices**: Recommendations motivated predominantly by the broker-dealer\\u2019s self-interest, such as maximizing compensation or meeting sales quotas, would violate the regulation.\\n\\n5. **Consistency with Other Standards**: The proposed regulation aligns with principles from other regulatory frameworks, such as the DOL Fiduciary Rule, while addressing conflicts of interest and enhancing existing suitability obligations.\\n\\n6. **Product Diversity**: The regulation does not intend to limit the diversity of investment products available to retail customers but seeks to address harm caused by broker-dealer incentives that conflict with customer interests.\"\n        },\n        {\n          \"title\": \"Key Terms and Scope of Best Interest Obligation\",\n          \"start_index\": 71,\n          \"end_index\": 71,\n          \"nodes\": [\n            {\n              \"title\": \"Natural Person who is an Associated Person\",\n              \"start_index\": 71,\n              \"end_index\": 72,\n              \"node_id\": \"0013\",\n              \"summary\": \"The partial document discusses the proposed obligations and standards for broker-dealers when making recommendations to retail customers under Regulation Best Interest. Key points include:\\n\\n1. The Commission's decision not to impose additional requirements, such as fair compensation or prohibition of misleading statements, as these are already broker-dealer obligations, while seeking feedback on whether such requirements should be incorporated or modified to enhance investor protection.\\n2. Consideration of a tailored standard for broker-dealers versus a uniform standard for both broker-dealers and investment advisers, and whether FINRA\\u2019s suitability standard should be explicitly adopted with enhancements.\\n3. Definition of a \\\"natural person who is an associated person\\\" to include individuals like registered representatives, ensuring compliance with Regulation Best Interest while excluding affiliated entities not intended to be covered.\\n4. Application of Regulation Best Interest at the time a recommendation is made regarding securities transactions or investment strategies, aiming to provide clarity, maintain existing compliance infrastructures, and ensure retail customers receive appropriate protections.\"\n            },\n            {\n              \"title\": \"When Making a Recommendation, At Time Recommendation is Made\",\n              \"start_index\": 72,\n              \"end_index\": 82,\n              \"node_id\": \"0014\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest (Reg BI) by the SEC, focusing on broker-dealer obligations when making recommendations to retail customers. Key points include:\\n\\n1. **Definition of Associated Persons**: The document clarifies that Reg BI applies to natural persons associated with broker-dealers, such as registered representatives, but excludes affiliated entities and clerical staff.\\n\\n2. **Application of Reg BI**: Reg BI applies at the time a recommendation is made regarding securities transactions or investment strategies to retail customers. It emphasizes clarity and consistency with existing broker-dealer regulations, particularly the concept of \\\"recommendation.\\\"\\n\\n3. **Scope of Recommendations**: The term \\\"recommendation\\\" is interpreted based on existing broker-dealer regulations and facts and circumstances, including implicit recommendations and discretionary transactions. General investor education and non-specific communications are excluded.\\n\\n4. **Duration of Obligation**: The best interest obligation applies only at the time of the recommendation and does not impose ongoing monitoring duties unless explicitly agreed upon by the broker-dealer.\\n\\n5. **Standards of Care**: The rule aligns with the Dodd-Frank Act's Section 913(f) and existing suitability obligations, ensuring broker-dealers act in the best interest of retail customers without altering fiduciary duties or existing supervisory obligations.\\n\\n6. **Types of Transactions Covered**: Reg BI applies to recommendations involving any securities transaction (purchase, sale, exchange) and investment strategies, including explicit hold recommendations or strategies involving the manner of purchase or sale.\\n\\n7. **Consistency with Other Regulations**: The rule is designed to integrate seamlessly with existing federal securities laws, SRO rules, and the Department of Labor's Fiduciary Rule, ensuring no conflict or redundancy in regulatory obligations.\"\n            },\n            {\n              \"title\": \"Any Securities Transaction or Investment Strategy\",\n              \"start_index\": 82,\n              \"end_index\": 83,\n              \"node_id\": \"0015\",\n              \"summary\": \"The partial document discusses the proposed application of Regulation Best Interest by the Commission to recommendations involving securities transactions and investment strategies for retail customers. It highlights that Regulation Best Interest applies to recommendations, not the execution of transactions, and aligns with existing broker-dealer suitability obligations. The document elaborates on the broad interpretation of investment strategies, including recommendations to hold securities, purchase on margin, or transfer assets between accounts (e.g., ERISA to IRA rollovers). It also addresses the potential antifraud implications of unsuitable recommendations. Additionally, the document proposes a definition of \\\"retail customer\\\" and seeks comments on the obligations of broker-dealers and investment advisers regarding account type recommendations.\"\n            },\n            {\n              \"title\": \"Retail Customer\",\n              \"start_index\": 83,\n              \"end_index\": 90,\n              \"node_id\": \"0016\",\n              \"summary\": \"The partial document discusses the proposed regulations and definitions under Regulation Best Interest, focusing on recommendations for rolling over or transferring assets between account types, such as from ERISA accounts to IRAs. It highlights the obligations of broker-dealers and investment advisers in making account recommendations tied to securities transactions. The document defines \\\"retail customer\\\" as individuals or their legal representatives receiving recommendations primarily for personal, family, or household purposes, excluding business or commercial purposes. It differentiates between brokerage and advisory relationships, emphasizing that Regulation Best Interest applies only to broker-dealer recommendations and not to investment adviser advice. The document also addresses dual-registrants, clarifying their obligations based on the capacity in which they act. Additionally, it compares the proposed definition of \\\"retail customer\\\" with \\\"retail investor\\\" under the Relationship Summary Proposal, noting differences in scope and application. The Commission seeks public comments on key terms, scope, and definitions, including the applicability to natural persons associated with broker-dealers.\"\n            },\n            {\n              \"title\": \"Request for Comment on Key Terms and Scope of Best Interest Obligation\",\n              \"start_index\": 90,\n              \"end_index\": 96,\n              \"node_id\": \"0017\",\n              \"summary\": \"The partial document discusses the scope and key terms of Regulation Best Interest, focusing on its applicability, definitions, and obligations. Key points include:\\n\\n1. **Scope and Applicability**: Regulation Best Interest is intended to apply to recommendations made to retail customers for personal, family, or household purposes, excluding business or institutional recommendations. The document seeks feedback on whether the scope should be broadened or narrowed, including its application to small business entities or sole proprietorships.\\n\\n2. **Key Definitions**: The document requests comments on definitions such as \\\"natural person who is an associated person,\\\" \\\"recommendation,\\\" \\\"investment strategy involving securities,\\\" and \\\"retail customer.\\\" It explores whether these definitions are clear, appropriate, and comprehensive, and whether alternative definitions should be considered.\\n\\n3. **Standards of Care**: The document examines differing standards of care for retail and institutional customers, questioning whether such distinctions are appropriate and whether they might cause confusion or compliance challenges.\\n\\n4. **Dual-Registrants**: It addresses the roles of dual-registrants (acting as both broker-dealers and investment advisers) and seeks input on how firms determine their capacity when making recommendations.\\n\\n5. **Component Obligations**: Regulation Best Interest includes four component obligations\\u2014Disclosure Obligation, Care Obligation, and two Conflict of Interest Obligations. These are designed to ensure broker-dealers act in the best interest of retail customers without prioritizing their own financial interests.\\n\\n6. **Request for Comments**: The document extensively solicits feedback on various aspects, including the appropriateness of definitions, the scope of recommendations covered, the need for additional guidance, and the adequacy of protections provided under the rule.\\n\\nThe overall aim is to clarify and refine the requirements of Regulation Best Interest while ensuring it aligns with existing laws and provides adequate protections for retail customers.\"\n            }\n          ],\n          \"node_id\": \"0012\",\n          \"summary\": \"The partial document discusses the obligations of broker-dealers when making recommendations to retail customers, focusing on whether additional requirements, such as fair compensation and prohibition of misleading statements, should be incorporated into the proposed rule. It raises questions about tailoring a standard specifically for broker-dealers versus adopting a uniform standard for both broker-dealers and investment advisers. The document also considers whether FINRA\\u2019s suitability standard should be explicitly adopted and enhanced to simplify the best interest obligation. Additionally, it proposes a definition for a \\\"natural person who is an associated person\\\" under the Exchange Act.\"\n        },\n        {\n          \"title\": \"Components of Regulation Best Interest\",\n          \"start_index\": 96,\n          \"end_index\": 97,\n          \"nodes\": [\n            {\n              \"title\": \"Disclosure Obligation\",\n              \"start_index\": 97,\n              \"end_index\": 133,\n              \"node_id\": \"0019\",\n              \"summary\": \"The partial document discusses the proposed Disclosure Obligation under Regulation Best Interest, which aims to enhance transparency and protect retail investors in their relationships with broker-dealers. Key points include:\\n\\n1. **Disclosure Obligation**: Broker-dealers must disclose, in writing, material facts about the scope and terms of their relationship with retail customers and all material conflicts of interest associated with recommendations. This includes acting capacity, fees, services, and conflicts of interest.\\n\\n2. **Layered Disclosure Approach**: The document emphasizes a layered approach to disclosure, starting with high-level summaries (e.g., Relationship Summary) and followed by more detailed, specific disclosures tailored to recommendations.\\n\\n3. **Material Conflicts of Interest**: The obligation requires disclosure of material conflicts, including financial incentives, proprietary products, limited product ranges, and conflicts arising from compensation structures.\\n\\n4. **Timing and Flexibility**: Disclosures must be made \\\"prior to or at the time of\\\" recommendations, with flexibility in form, timing, and delivery methods to accommodate different business practices and customer interactions.\\n\\n5. **Consistency with Other Regulations**: The proposed rule aligns with existing antifraud provisions, the BIC Exemption, and recommendations from the 913 Study, aiming to reduce investor confusion and ensure informed decision-making.\\n\\n6. **Care Obligation**: The document also introduces a Care Obligation, requiring broker-dealers to exercise diligence, care, and prudence in understanding risks and rewards, ensuring recommendations are in the best interest of retail customers based on their investment profiles.\\n\\n7. **Request for Comments**: The document solicits feedback on various aspects of the proposed rules, including the adequacy of disclosures, timing, materiality thresholds, and the interaction with existing regulations.\"\n            },\n            {\n              \"title\": \"Care Obligation\",\n              \"start_index\": 133,\n              \"end_index\": 166,\n              \"node_id\": \"0020\",\n              \"summary\": \"The partial document discusses proposed regulations under \\\"Regulation Best Interest\\\" aimed at enhancing broker-dealer obligations to act in the best interest of retail customers. Key points include:\\n\\n1. **Disclosure of Conflicts of Interest**: The document emphasizes the need for broker-dealers to disclose material conflicts arising from financial incentives and other factors, potentially requiring advance customer consent for certain conflicts.\\n\\n2. **Care Obligation**: Broker-dealers must exercise reasonable diligence, care, skill, and prudence when making recommendations. This includes:\\n   - Understanding the risks and rewards of recommendations.\\n   - Ensuring recommendations align with the retail customer\\u2019s investment profile and are in their best interest.\\n   - Avoiding excessive transactions that are not in the customer\\u2019s best interest when viewed collectively.\\n\\n3. **Enhanced Suitability Standards**: The Care Obligation builds upon existing suitability requirements by incorporating a \\\"best interest\\\" standard, ensuring broker-dealers do not prioritize their own financial interests over those of retail customers.\\n\\n4. **Evaluation of Recommendations**: Broker-dealers must consider factors such as costs, risks, liquidity, and financial incentives when recommending securities or investment strategies. They are not required to recommend the least expensive option but must justify higher costs based on customer benefits.\\n\\n5. **Series of Transactions**: The regulation introduces a requirement to evaluate whether a series of recommended transactions is excessive and in the customer\\u2019s best interest, removing the need to prove \\\"control\\\" over the customer\\u2019s account.\\n\\n6. **Consistency with Other Standards**: The proposed Care Obligation aligns with principles from the Department of Labor\\u2019s fiduciary rulemaking and the SEC\\u2019s 913 Study, emphasizing professional standards of care and investor protection.\\n\\n7. **Request for Comments**: The document seeks public input on various aspects of the proposed regulations, including the clarity of terms, the scope of obligations, and the treatment of conflicts of interest.\\n\\nThe overarching goal is to enhance investor protection by ensuring broker-dealers act in the best interest of retail customers while addressing conflicts of interest and improving the quality of recommendations.\"\n            },\n            {\n              \"title\": \"Conflict of Interest Obligations\",\n              \"start_index\": 166,\n              \"end_index\": 196,\n              \"node_id\": \"0021\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest by the SEC, focusing on broker-dealers' obligations to act in the best interest of retail customers. Key points include:\\n\\n1. **Quantitative Suitability and Best Interest Standard**: The document compares FINRA's quantitative suitability rule with the SEC's proposed best interest obligation, emphasizing the need for broker-dealers to ensure that a series of transactions is not excessive and aligns with the retail customer's best interest.\\n\\n2. **Conflict of Interest Obligations**: The proposal introduces requirements for broker-dealers to establish, maintain, and enforce written policies and procedures to identify, disclose, mitigate, or eliminate material conflicts of interest, particularly those arising from financial incentives. This includes addressing compensation practices, proprietary products, and third-party payments.\\n\\n3. **Policies and Procedures**: Broker-dealers are expected to implement risk-based compliance systems tailored to their business models, including processes for identifying, managing, and mitigating conflicts of interest. The document outlines components such as training, monitoring, and periodic reviews.\\n\\n4. **Material Conflicts of Interest**: The proposal defines material conflicts as those that could incline a broker-dealer to make biased recommendations. It emphasizes the need for clear identification, disclosure, and mitigation of such conflicts, especially those related to financial incentives.\\n\\n5. **Mitigation Measures**: Examples of mitigation practices include avoiding disproportionate compensation thresholds, minimizing incentives to favor certain products, and implementing enhanced supervision for high-risk transactions.\\n\\n6. **Flexibility and Principles-Based Approach**: The proposal allows broker-dealers flexibility in designing policies and procedures to address conflicts, avoiding a one-size-fits-all approach, and focusing on areas of greatest risk.\\n\\n7. **Alignment with Other Standards**: The document compares the proposed obligations with the DOL Fiduciary Rule and the 913 Study, highlighting consistency in addressing conflicts of interest and promoting investor protection.\\n\\n8. **Request for Comments**: The SEC seeks feedback on various aspects of the proposal, including the scope of obligations, effectiveness of mitigation measures, and potential impacts on broker-dealer practices and retail customers.\\n\\nThe document emphasizes balancing investor protection with flexibility for broker-dealers while addressing conflicts of interest to ensure recommendations are in the best interest of retail customers.\"\n            }\n          ],\n          \"node_id\": \"0018\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest by the Commission, which outlines the obligation of broker-dealers to act in the best interest of retail customers without prioritizing their own financial or other interests. The regulation specifies four component requirements: Disclosure Obligation, Care Obligation, and two Conflict of Interest Obligations. The document emphasizes that compliance with these components is necessary to meet the best interest obligation and does not replace existing antifraud provisions or other broker-dealer obligations under federal securities laws.\\n\\nThe Disclosure Obligation is detailed, requiring broker-dealers to provide written disclosure of material facts about the scope and terms of their relationship with retail customers, as well as any material conflicts of interest associated with their recommendations. The document highlights the importance of transparency to address consumer confusion and improve customer awareness. It references feedback from commenters who support clear and comprehensive disclosures regarding services, compensation, and conflicts of interest.\"\n        },\n        {\n          \"title\": \"Recordkeeping and Retention\",\n          \"start_index\": 196,\n          \"end_index\": 199,\n          \"node_id\": \"0022\",\n          \"summary\": \"The partial document discusses proposed regulations and requirements under Regulation Best Interest, focusing on conflicts of interest, recordkeeping, and the scope of broker-dealer activities. Key points include:\\n\\n1. **Conflicts of Interest**: The document seeks public comments on whether certain conflicts of interest, such as non-cash compensation (e.g., sales contests, trips, prizes), should be prohibited and whether retail customer consent should be required for specific conflicts. It also addresses the need for guidance on mitigating conflicts and whether neutral compensation across product types is appropriate.\\n\\n2. **Recordkeeping and Retention**: Proposed amendments to Exchange Act Rules 17a-3 and 17a-4 would require broker-dealers to create and retain records related to retail customer information and disclosures under Regulation Best Interest. This includes maintaining records of material facts, conflicts of interest, and customer account information for six years. The document also discusses existing requirements for retaining compliance and supervisory manuals.\\n\\n3. **Request for Comments**: The Commission invites feedback on whether additional record-making and retention requirements should be imposed and what specific records should be included.\\n\\n4. **Investment Discretion and Broker-Dealer Activities**: The document explores whether the exercise of investment discretion by broker-dealers should be considered incidental to their business, distinguishing their role from that of investment advisers under the Advisers Act.\"\n        },\n        {\n          \"title\": \"Whether the Exercise of Investment Discretion Should be Viewed as Solely Incidental to the Business of a Broker or Dealer\",\n          \"start_index\": 199,\n          \"end_index\": 209,\n          \"node_id\": \"0023\",\n          \"summary\": \"The partial document primarily discusses the following main points:\\n\\n1. **Recordkeeping and Retention Requirements**: The document outlines the requirements under Exchange Act Rule 17a-4(e)(7) for broker-dealers to retain compliance, supervisory, and procedural manuals, including updates, for a specified period. It also seeks comments on whether additional record-making and retention requirements related to Regulation Best Interest should be imposed.\\n\\n2. **Broker-Dealer Exclusion under the Advisers Act**: The document examines the scope of the broker-dealer exclusion under the Advisers Act, which excludes broker-dealers from being considered investment advisers if their advisory services are solely incidental to their brokerage business and they receive no special compensation for such services.\\n\\n3. **Investment Discretion and Fiduciary Duty**: The document discusses the exercise of investment discretion by broker-dealers, its implications under the Advisers Act, and the fiduciary duty owed to customers. It highlights the distinction between discretionary and non-discretionary accounts and the regulatory considerations for discretionary brokerage services.\\n\\n4. **Historical Interpretations and Proposals**: The document reviews past Commission interpretations and proposals regarding broker-dealers\\u2019 exercise of investment discretion, including the 2005 interpretive rule and the 2007 proposal, and their subsequent vacating or non-adoption.\\n\\n5. **Request for Comments**: The document solicits public comments on various issues, including:\\n   - Whether discretionary investment advice by broker-dealers should be considered solely incidental to their business.\\n   - The appropriateness of placing limits on investment discretion under the broker-dealer exclusion.\\n   - The potential risks, benefits, and investor protections related to broker-dealers offering discretionary services.\\n   - The impact of Regulation Best Interest on broker-dealers\\u2019 behavior, investor choice, and the distinction between advisory and brokerage accounts.\\n\\n6. **Investor Protection and Regulatory Concerns**: The document raises concerns about potential risks, such as account churning, associated with broker-dealers exercising unlimited investment discretion and seeks input on regulatory measures to mitigate such risks.\\n\\n7. **Future Opportunities for Discretionary Brokerage Services**: The document explores potential opportunities for broker-dealers to expand discretionary brokerage services and seeks feedback on how this could impact investor choice and regulatory clarity.\"\n        }\n      ],\n      \"node_id\": \"0007\",\n      \"summary\": \"The partial document discusses the proposed Regulation Best Interest by the Commission, which aims to establish a best interest obligation for broker-dealers when making recommendations to retail customers. The regulation requires broker-dealers to act in the best interest of the customer without prioritizing their own financial or other interests. The best interest obligation is satisfied through: (1) written disclosure of material facts and conflicts of interest (Disclosure Obligation), and (2) exercising reasonable diligence, care, skill, and prudence to understand the risks and rewards of recommendations.\"\n    },\n    {\n      \"title\": \"REQUEST FOR COMMENT\",\n      \"start_index\": 209,\n      \"end_index\": 210,\n      \"nodes\": [\n        {\n          \"title\": \"Generally\",\n          \"start_index\": 210,\n          \"end_index\": 212,\n          \"node_id\": \"0025\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its implications for broker-dealers. It raises questions about the clarity and sufficiency of the obligations defined under the regulation, including whether additional clarifications, instructions, or compliance mechanisms (e.g., safe harbors, policies, and procedures) are needed. The document explores the relationship between different provisions of the regulation, the potential impact on retail customers, investor confusion, and the range of choices available for financial advice and products. It also examines the regulation's consistency with existing standards, such as those of FINRA, SROs, and the DOL, and whether it addresses deficiencies in current broker-dealer standards. Additionally, it considers the regulation's alignment with recommendations from the 913 Study and its interactions with other federal, state, and self-regulatory requirements.\"\n        },\n        {\n          \"title\": \"Interactions with Other Standards of Conduct\",\n          \"start_index\": 212,\n          \"end_index\": 214,\n          \"node_id\": \"0026\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its alignment with existing regulatory frameworks, including SRO (Self-Regulatory Organization) obligations, DOL (Department of Labor) regulations, and state securities laws. It raises questions about potential conflicts, redundancies, and harmonization between these standards and the duties of loyalty and care under the Advisers Act. The document also explores the impact of regulatory harmonization on investor understanding, choice, and outcomes, as well as the consistency of the proposed regulation with broker-dealers' current obligations. Additionally, it addresses interactions with non-securities statutes like ERISA and the Code, and seeks input on the economic implications of the proposed regulation, including its effects on efficiency, competition, capital formation, and investor protection, as required under the Exchange Act.\"\n        }\n      ],\n      \"node_id\": \"0024\",\n      \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its implications for broker-dealers and retail investors. Key points include:\\n\\n1. **Risk Reduction and Investor Choice**: Examination of how specific provisions, such as subparagraph (a)(2)(i)(C), could mitigate risks and how broker-dealers' investment discretion impacts investor choice, benefits, and risks.\\n\\n2. **Discretionary Brokerage Services**: Consideration of broker-dealers offering more discretionary services and whether distinguishing between discretionary and non-discretionary accounts could reduce investor confusion.\\n\\n3. **Request for Comments**: The Commission seeks feedback on the overall impact of Regulation Best Interest, its interaction with other regulations (e.g., FINRA rules, federal securities laws, ERISA), and its effect on broker-dealer behavior and retail customer recommendations.\\n\\n4. **Clarifications and Compliance**: Requests for input on whether the obligations under Regulation Best Interest are clearly defined, the relationship between its provisions, and whether compliance mechanisms (e.g., safe harbors, policies, and procedures) should be established or enhanced.\\n\\n5. **Additional Requirements**: Exploration of whether broker-dealers should face additional obligations under the best interest standard and how these might align with existing regulatory frameworks.\"\n    },\n    {\n      \"title\": \"ECONOMIC ANALYSIS\",\n      \"start_index\": 214,\n      \"end_index\": 214,\n      \"nodes\": [\n        {\n          \"title\": \"Introduction, Primary Goals of Proposed Regulations and Broad Economic Considerations\",\n          \"start_index\": 214,\n          \"end_index\": 214,\n          \"nodes\": [\n            {\n              \"title\": \"Introduction and Primary Goals of Proposed Regulation\",\n              \"start_index\": 214,\n              \"end_index\": 215,\n              \"node_id\": \"0029\",\n              \"summary\": \"The partial document discusses the potential impacts of regulatory harmonization on investors, including both positive and negative effects, and how it might influence their choice of financial firms and payment options for financial advice. It also examines interactions between Regulation Best Interest and state fiduciary standards, comparing current state standards with the proposed regulation and seeking commenters' views on these standards. Additionally, the document includes an economic analysis of the proposed regulation, focusing on its primary goals, costs, benefits, and broader economic considerations such as efficiency, competition, and capital formation. It highlights the challenges of quantifying economic effects due to limited information and the unpredictability of market participants' behavior, while encouraging public input to better assess the regulation's impacts. The analysis also explores the principal-agent relationship between retail customers and broker-dealers in the context of economic theory.\"\n            },\n            {\n              \"title\": \"Broad Economic Considerations\",\n              \"start_index\": 215,\n              \"end_index\": 225,\n              \"node_id\": \"0030\",\n              \"summary\": \"The partial document discusses the economic implications of the proposed Regulation Best Interest, focusing on its potential benefits, costs, and broader impacts on efficiency, competition, and capital formation. It examines the principal-agent relationship between retail customers and broker-dealers, highlighting agency problems that arise due to conflicting interests. The document explores mechanisms to address these conflicts, such as explicit contracts, monitoring, bonding, and regulatory standards of conduct. It emphasizes the limitations of private contracting in financial markets due to high costs, complexity, and information asymmetry, and argues that a regulatory standard of conduct, like Regulation Best Interest, could effectively reduce agency costs and align broker-dealer actions with retail customer interests.\\n\\nThe document also analyzes the potential effects of the best interest standard on agency relationships, including its ability to improve trust, reduce conflicts of interest, and enhance the quality of financial advice. It discusses how the proposed rule could shift the distribution of gains from trade between broker-dealers and retail customers, depending on market competitiveness. Additionally, the document provides an economic baseline for the market for advice services, focusing on broker-dealers and their diverse roles in providing financial services to retail customers. It acknowledges the challenges in quantifying certain economic effects and encourages public input to refine the analysis.\"\n            }\n          ],\n          \"node_id\": \"0028\",\n          \"summary\": \"The partial document discusses the potential impacts of regulatory harmonization on investors, including both positive and negative effects, and how it might influence their choice of financial firms and payment options for financial advice. It also examines interactions between Regulation Best Interest and state fiduciary standards, comparing current state standards for broker-dealers with the proposed regulations. Additionally, the document introduces the economic analysis of the proposed regulations, focusing on their primary goals, including promoting efficiency, competition, capital formation, and investor protection, while considering the costs, benefits, and competitive impacts as required by the Exchange Act.\"\n        },\n        {\n          \"title\": \"Economic Baseline\",\n          \"start_index\": 225,\n          \"end_index\": 225,\n          \"nodes\": [\n            {\n              \"title\": \"Market for Advice Services\",\n              \"start_index\": 225,\n              \"end_index\": 246,\n              \"node_id\": \"0032\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its impact on broker-dealers and retail customers. It provides an economic baseline analysis of the market for advice services, focusing on broker-dealers and investment advisers. Key points include:\\n\\n1. **Market Analysis**: Examination of broker-dealer services, including managing orders, providing advice, holding funds, and other financial activities. It highlights the diversity of services offered and the segmentation of the market.\\n\\n2. **Broker-Dealer Statistics**: Data on registered broker-dealers, customer accounts, and assets as of December 2017, including the concentration of assets among large firms and the prevalence of dual-registered broker-dealers.\\n\\n3. **Investment Advisers**: Analysis of SEC-registered and state-registered investment advisers, their assets under management (AUM), and their services to retail and institutional clients. It also discusses trends in the number of investment advisers and broker-dealers over time.\\n\\n4. **Trends and Shifts**: Observations on the decline in broker-dealers and the rise in investment advisers, driven by regulatory changes, technological innovation, and shifts toward fee-based advisory models.\\n\\n5. **Compensation Structures**: Overview of financial incentives for broker-dealers and investment advisers, including commission-based payouts, asset-based fees, and bonuses tied to performance and customer retention.\\n\\n6. **Regulatory Baseline**: Description of existing obligations for broker-dealers under federal securities laws, FINRA rules, and state regulations, including suitability obligations and disclosure of conflicts of interest.\\n\\nThe document provides a detailed foundation for understanding the regulatory and economic environment surrounding the proposed Regulation Best Interest.\"\n            },\n            {\n              \"title\": \"Regulatory Baseline\",\n              \"start_index\": 246,\n              \"end_index\": 255,\n              \"node_id\": \"0033\",\n              \"summary\": \"The partial document discusses the following main points:\\n\\n1. **Variable Compensation and Incentives for Financial Professionals**: It highlights how financial professionals' compensation could increase when enrolling retail customers in advisory accounts versus other account types, and mentions transition bonuses and non-cash incentives like trophies, dinners, and travel for meeting performance goals.\\n\\n2. **Regulation Best Interest**: The document outlines the requirements of Regulation Best Interest, which mandates broker-dealers to act in the best interest of retail customers when making recommendations, without prioritizing their own interests. It also describes how this regulation builds upon existing broker-dealer regulatory frameworks.\\n\\n3. **Suitability Obligations**: It explains the suitability obligations under federal securities laws and FINRA rules, requiring broker-dealers to ensure recommendations are suitable for customers based on their investment profiles. It details three primary suitability requirements: reasonable-basis, customer-specific, and quantitative suitability.\\n\\n4. **Disclosure Obligations**: The document discusses broker-dealers' obligations to disclose material information and conflicts of interest under antifraud provisions and FINRA rules, emphasizing the importance of honest and complete communication with customers.\\n\\n5. **Fiduciary Obligations and DOL Fiduciary Rule**: It examines fiduciary obligations imposed on broker-dealers under state common law and the Department of Labor\\u2019s Fiduciary Rule, which expands fiduciary status for broker-dealers providing investment advice to retirement accounts. It also describes the Best Interest Contract (BIC) Exemption and related compliance requirements.\\n\\n6. **Impact of DOL Fiduciary Rule**: The document reviews the industry\\u2019s response to the DOL Fiduciary Rule, including changes in product offerings, migration to fee-based models, and compliance costs. It highlights survey findings on reduced brokerage services, increased fees, and compliance expenses.\\n\\n7. **Benefits and Costs of Regulation Best Interest**: It evaluates the potential benefits of Regulation Best Interest in improving the quality of investment advice, enhancing retail customer protection, and helping customers evaluate advice, alongside the associated compliance costs for firms and customers.\"\n            }\n          ],\n          \"node_id\": \"0031\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its impact on the market for broker-dealer services and the gains from trade shared between broker-dealers and retail customers. It provides an analysis of the market for broker-dealer services, treating it as a broad market with multiple segments, and outlines the various services broker-dealers provide, such as managing orders, providing financial advice, holding customer funds, handling trade settlements, and dealing in securities. The document also mentions other entities, such as state-registered investment advisers, commercial banks, and insurance companies, that provide financial advice services, and provides data on the number of such entities as of January 2018.\"\n        },\n        {\n          \"title\": \"Benefits, Costs, and Effects on Efficiency, Competition, and Capital Formation\",\n          \"start_index\": 255,\n          \"end_index\": 258,\n          \"nodes\": [\n            {\n              \"title\": \"Benefits\",\n              \"start_index\": 258,\n              \"end_index\": 272,\n              \"node_id\": \"0035\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest, which establishes a best interest obligation for broker-dealers under the Exchange Act. The main points covered include:\\n\\n1. **Best Interest Obligation**: The rule introduces three key components\\u2014Disclosure Obligation, Care Obligation, and Conflict of Interest Obligations\\u2014to ensure broker-dealers act in the best interest of retail customers, enhancing customer protection and addressing agency conflicts.\\n\\n2. **Disclosure Obligation**: Requires broker-dealers to provide written disclosures about their capacity, fees, services, and material conflicts of interest. This aims to reduce informational gaps, improve customer understanding, and enhance the quality of recommendations.\\n\\n3. **Care Obligation**: Mandates broker-dealers to act with diligence, care, skill, and prudence, ensuring recommendations align with the retail customer\\u2019s best interest. This goes beyond existing suitability rules and promotes better-aligned recommendations.\\n\\n4. **Conflict of Interest Obligations**: Requires broker-dealers to establish, maintain, and enforce written policies to identify, disclose, mitigate, or eliminate material conflicts of interest, including those arising from financial incentives. This aims to reduce conflicts, improve recommendation quality, and build customer trust.\\n\\n5. **Benefits**: The regulation is expected to enhance the quality of recommendations, reduce agency conflicts, and improve retail customer welfare. However, the magnitude of these benefits is difficult to quantify due to data limitations and the complexity of assumptions.\\n\\n6. **Costs**: The document also acknowledges potential costs associated with implementing the best interest standard and its components, though specific cost estimates are not detailed.\\n\\nThe document emphasizes the flexibility provided to broker-dealers in complying with the obligations and the challenges in quantifying the benefits and costs due to data limitations.\"\n            },\n            {\n              \"title\": \"Costs\",\n              \"start_index\": 272,\n              \"end_index\": 275,\n              \"nodes\": [\n                {\n                  \"title\": \"Standard of Conduct Defined as Best Interest\",\n                  \"start_index\": 275,\n                  \"end_index\": 275,\n                  \"nodes\": [\n                    {\n                      \"title\": \"Operational Costs\",\n                      \"start_index\": 275,\n                      \"end_index\": 277,\n                      \"node_id\": \"0038\",\n                      \"summary\": \"The partial document discusses the proposed Regulation Best Interest, which establishes a best interest standard of conduct for broker-dealers when making recommendations to retail customers. It outlines the operational and programmatic costs associated with implementing the rule, including the need for additional training for broker-dealers and their employees, particularly for those not already adhering to the best interest standard. The document highlights potential incremental costs for firms already aligned with the standard and substantial costs for those that are not. It also addresses the overlap and discrepancies between Regulation Best Interest and other regulations, such as the DOL Fiduciary Rule and the BIC Exemption, and the associated costs of compliance. Additionally, it notes that the proposed rule aims to reduce costs related to discrepancies between regulations for retirement and non-retirement accounts and mitigate costs for broker-dealers subject to overlapping regulations.\"\n                    },\n                    {\n                      \"title\": \"Programmatic Costs\",\n                      \"start_index\": 278,\n                      \"end_index\": 280,\n                      \"node_id\": \"0039\",\n                      \"summary\": \"The partial document discusses the potential programmatic costs and legal implications of the proposed Regulation Best Interest rule on broker-dealers. Key points include:\\n\\n1. **Programmatic Costs**: The rule may limit broker-dealers' ability to make certain recommendations, potentially leading to revenue losses if they can no longer recommend higher-cost products that are inconsistent with the proposed best interest obligation but align with FINRA\\u2019s suitability rule. The difficulty in quantifying these losses is noted due to the variability in recommendations based on customer profiles and circumstances.\\n\\n2. **Increased Legal Exposure**: Broker-dealers may face higher costs due to enhanced legal exposure, including potential increases in retail customer arbitrations. The rule introduces an enhanced standard of conduct, which could lead to additional costs for preparation and compliance, as well as enforcement actions.\\n\\n3. **Disclosure Obligation**: The proposed rule establishes explicit disclosure requirements for broker-dealers under the Exchange Act. It aims to create a more uniform level of disclosure regarding the material scope, terms of the broker-dealer and customer relationship, and conflicts of interest. Compliance with the Disclosure Obligation may overlap with requirements of the proposed Relationship Summary and Regulatory Status Disclosure.\\n\\n4. **Arbitration Implications**: The document highlights the role of arbitration clauses in brokerage agreements and the potential impact of the rule on the frequency of retail customer arbitrations, though it remains unclear to what extent the rule would affect arbitration numbers.\"\n                    }\n                  ],\n                  \"node_id\": \"0037\",\n                  \"summary\": \"The partial document discusses the establishment of a \\\"best interest\\\" standard of conduct for broker-dealers when making recommendations to retail customers. It highlights that while the rule aims to address conflicts of interest and enhance existing regulatory standards, it does not prohibit recommending higher-cost products if they align with customer needs. The document also examines the operational and programmatic costs associated with implementing the rule, including the need for additional training for broker-dealers. It references existing practices like face-to-face and computer-based training and notes the potential financial implications of compliance, citing related cost estimates from other regulatory frameworks.\"\n                },\n                {\n                  \"title\": \"Disclosure Obligation\",\n                  \"start_index\": 280,\n                  \"end_index\": 286,\n                  \"node_id\": \"0040\",\n                  \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its implications for broker-dealers. Key points include:\\n\\n1. **Disclosure Obligation**: The regulation introduces enhanced disclosure requirements for broker-dealers, including providing detailed information about the scope, terms, fees, and material conflicts of interest in their relationships with retail customers. It aims to improve transparency and uniformity in disclosures, going beyond existing obligations. Compliance may involve additional costs and record-keeping requirements, with flexibility in the form, timing, and method of disclosures.\\n\\n2. **Record-Making and Record-Keeping Requirements**: Proposed amendments to Exchange Act Rules 17a-3 and 17a-4 would require broker-dealers to create and retain records of information collected from and provided to retail customers. This imposes significant initial and ongoing costs and burdens on broker-dealers.\\n\\n3. **Care Obligation**: The regulation extends broker-dealers' obligations by requiring recommendations to be in the best interest of retail customers based on their investment profiles. It also mandates that a series of transactions must not be excessive and must align with the customer\\u2019s best interest, even if the broker-dealer does not have control over the account.\\n\\n4. **Cost Implications**: The document provides detailed estimates of the initial and ongoing costs and burdens associated with compliance, including preparation, delivery, and record-keeping efforts, as well as the financial impact on broker-dealers.\"\n                },\n                {\n                  \"title\": \"Obligation to Exercise Reasonable Diligence, Care, Skill, and Prudence in Making a Recommendation\",\n                  \"start_index\": 286,\n                  \"end_index\": 290,\n                  \"node_id\": \"0041\",\n                  \"summary\": \"The partial document discusses the proposed \\\"Care Obligation\\\" under Regulation Best Interest, which enhances broker-dealer responsibilities beyond existing FINRA suitability rules. Key points include:\\n\\n1. **Enhanced Standards for Recommendations**: Broker-dealers must ensure recommendations are in the retail customer\\u2019s best interest, not just suitable, and that a series of transactions is not excessive, regardless of account control.\\n\\n2. **Customer Investment Profile**: Broker-dealers are required to collect and evaluate detailed customer investment profile information (e.g., age, financial situation, risk tolerance) to meet the best interest standard.\\n\\n3. **Recordkeeping Requirements**: Proposed amendments to Rule 17a-4(e)(5) mandate broker-dealers retain customer investment profile information and conflict disclosures for six years, imposing additional compliance costs.\\n\\n4. **Conflict of Interest Obligations**: Broker-dealers must establish, maintain, and enforce written policies to identify, disclose, or eliminate material conflicts of interest associated with recommendations, such as proprietary products, share class selection, or account rollovers.\\n\\n5. **Cost Implications**: The proposed rule may increase costs for broker-dealers due to compliance and legal exposure, with potential cost pass-through to retail customers.\\n\\n6. **Comparison to Existing Standards**: The Care Obligation introduces a best interest requirement absent in current suitability rules and removes the control element for evaluating excessive transactions, potentially increasing arbitration risks.\\n\\n7. **Regulatory Enhancements**: Regulation Best Interest imposes stricter obligations compared to existing antifraud provisions, as it does not require an element of fraud or deceit to enforce compliance.\"\n                },\n                {\n                  \"title\": \"Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and at a Minimum Disclose, or Eliminate, All Material Conflicts of Interest Associated with a Recommendation\",\n                  \"start_index\": 290,\n                  \"end_index\": 295,\n                  \"nodes\": [\n                    {\n                      \"title\": \"Eliminate Material Conflicts of Interest Associated with a Recommendation\",\n                      \"start_index\": 295,\n                      \"end_index\": 297,\n                      \"node_id\": \"0043\",\n                      \"summary\": \"The partial document discusses the obligations of broker-dealers to address material conflicts of interest associated with their recommendations to retail customers. It outlines two main approaches: \\n\\n1. **Eliminating Material Conflicts of Interest**: Broker-dealers are required to establish policies to eliminate conflicts of interest tied to financial incentives, such as removing incentives for recommending certain products, not offering products with associated incentives, or altering how transactions are executed. This may impact broker-dealer revenue, the range of recommended securities, market liquidity, and the quality of execution.\\n\\n2. **Disclosing Material Conflicts of Interest**: If conflicts are not eliminated, broker-dealers must disclose them through written policies and procedures. The document references existing disclosure requirements under antifraud obligations, Exchange Act rules, and FINRA rules, including Rule 10b-5 and Rule 10b-10, which mandate transparency about pricing, markups, and the broker-dealer's role in transactions.\\n\\nThe document emphasizes the importance of compliance with these obligations to mitigate or disclose conflicts and the potential market and operational impacts of these measures.\"\n                    },\n                    {\n                      \"title\": \"At a Minimum Disclose Material Conflicts of Interest Associated with a Recommendation\",\n                      \"start_index\": 297,\n                      \"end_index\": 299,\n                      \"node_id\": \"0044\",\n                      \"summary\": \"The partial document discusses the obligations of broker-dealers under proposed Regulation Best Interest to address material conflicts of interest associated with recommendations. Key points include:\\n\\n1. **Disclosure of Material Conflicts of Interest**: Broker-dealers must establish, maintain, and enforce written policies and procedures to disclose material conflicts of interest that are not eliminated. This includes compliance with existing antifraud obligations, Exchange Act rules, and FINRA rules.\\n\\n2. **Flexibility in Disclosure**: Regulation Best Interest does not prescribe a specific process for disclosure, allowing broker-dealers flexibility to comply in ways consistent with their business practices. Disclosure is seen as a cost-effective alternative to eliminating conflicts, preserving beneficial recommendations for retail customers.\\n\\n3. **Costs of Compliance**: The document acknowledges potential higher costs for broker-dealers to meet enhanced disclosure obligations but notes challenges in quantifying these costs due to variability in current practices and compliance methods.\\n\\n4. **Conflict of Interest Obligation**: Broker-dealers must establish, maintain, and enforce written policies and procedures to identify, disclose, mitigate, or eliminate material conflicts of interest arising from financial incentives. Examples include fee structures, employee compensation, sales contests, and third-party compensation practices.\\n\\n5. **Examples of Financial Incentives**: Material conflicts may arise from differential or variable compensation, fees on proprietary products, and principal transactions. Policies should outline how firms identify and address such conflicts.\"\n                    }\n                  ],\n                  \"node_id\": \"0042\",\n                  \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its requirements for broker-dealers, focusing on the Care Obligation and Conflict of Interest Obligations. Key points include:\\n\\n1. **Record-Making and Recordkeeping Obligations**: Broker-dealers must create or modify documents, such as standardized questionnaires, to reflect customer investment profiles, with associated costs detailed in other sections.\\n\\n2. **Conflict of Interest Obligations**: Broker-dealers are required to establish, maintain, and enforce written policies and procedures to identify, disclose, or eliminate material conflicts of interest associated with recommendations. These conflicts may arise from financial incentives, proprietary products, affiliated products, share class recommendations, securities underwriting, account rollovers, and allocation of investment opportunities.\\n\\n3. **Disclosure or Elimination of Conflicts**: Broker-dealers must provide retail customers with specific written disclosures to help them understand conflicts or eliminate conflicts by removing incentives or avoiding certain products.\\n\\n4. **Compliance and Supervision**: Broker-dealers must develop risk-based compliance systems to enforce these policies, leveraging existing supervisory systems where possible.\\n\\n5. **Costs and Burdens**: The document outlines significant initial and ongoing costs and burdens for broker-dealers to comply with these obligations, including updates to policies, training, and technology.\\n\\n6. **Dealer Activities and Conflicts**: The document highlights how dealer activities, such as selling proprietary products or acting as market makers, may create conflicts of interest that must be addressed under the proposed regulation.\"\n                },\n                {\n                  \"title\": \"Obligation to Establish, Maintain, and Enforce Written Policies and Procedures Reasonably Designed to Identify and Disclose and Mitigate, or Eliminate, Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation\",\n                  \"start_index\": 299,\n                  \"end_index\": 301,\n                  \"nodes\": [\n                    {\n                      \"title\": \"Eliminate Material Conflicts Arising from Financial Incentives Associated with a Recommendation\",\n                      \"start_index\": 301,\n                      \"end_index\": 304,\n                      \"node_id\": \"0046\",\n                      \"summary\": \"The partial document discusses the conflicts of interest arising from financial incentives in broker-dealer operations, particularly in the context of compensation arrangements with third-party product sponsors. It highlights the financial incentives and conflicts that broker-dealers face when recommending products to retail customers and the potential measures to mitigate or eliminate these conflicts, such as crediting compensation to customers or ceasing recommendations for certain products. The document also examines the potential revenue losses for broker-dealers and the impact on retail customers' access to advice if conflicts are eliminated. Additionally, it addresses internal compensation structures for registered representatives, their alignment with broker-dealer incentives, and the potential costs and consequences of eliminating such structures. The document emphasizes the challenges in quantifying these costs and the importance of establishing policies to disclose and mitigate material conflicts of interest, particularly those related to financial incentives, under regulatory obligations.\"\n                    },\n                    {\n                      \"title\": \"Disclose and Mitigate Material Conflicts of Interest Arising from Financial Incentives Associated with a Recommendation\",\n                      \"start_index\": 304,\n                      \"end_index\": 316,\n                      \"node_id\": \"0047\",\n                      \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its implications for broker-dealers, retail customers, and product sponsors. Key points include:\\n\\n1. **Conflict of Interest Obligations**: Broker-dealers are required to establish, maintain, and enforce written policies and procedures to disclose, mitigate, or eliminate material conflicts of interest arising from financial incentives. This includes conflicts related to internal compensation structures and arrangements with product sponsors.\\n\\n2. **Disclosure Requirements**: The regulation mandates broker-dealers to provide retail customers with specific information about material conflicts of interest, enabling informed decision-making. These disclosure obligations go beyond existing requirements.\\n\\n3. **Cost Implications**: The document highlights the potential costs for broker-dealers in implementing conflict mitigation measures, such as revenue loss, compliance costs, and changes to compensation structures. Retail customers may also bear costs, including reduced investment choices and potentially lower-quality advice.\\n\\n4. **Flexibility in Compliance**: Broker-dealers are given flexibility to tailor conflict mitigation measures to their business practices, which may vary based on firm size, customer base, and product complexity.\\n\\n5. **Impact on Product Sponsors**: The regulation may affect product sponsors by reducing the availability of certain products through broker-dealers, potentially impacting funding for these products.\\n\\n6. **Market Effects**: The regulation's impact on efficiency, competition, and capital formation is discussed, with a focus on the tradeoff between benefits and costs. It aims to improve the alignment of broker-dealer recommendations with retail customers' best interests while considering potential market disruptions.\\n\\n7. **Challenges in Quantification**: The document notes difficulties in quantifying costs and impacts due to a lack of data and the wide range of assumptions required.\\n\\n8. **Examples of Mitigation Measures**: Examples include \\\"product agnostic\\\" compensation structures, clean shares, and surveillance mechanisms to address conflicts of interest.\\n\\nThe document emphasizes the balance between protecting retail customers and the operational and financial implications for broker-dealers and product sponsors.\"\n                    }\n                  ],\n                  \"node_id\": \"0045\",\n                  \"summary\": \"The partial document discusses the obligations of broker-dealers to establish, maintain, and enforce written policies and procedures designed to identify, disclose, and mitigate or eliminate material conflicts of interest arising from financial incentives associated with recommendations. It highlights the types of financial incentives that create conflicts, such as compensation structures, fees, commissions, and third-party arrangements. The document outlines potential policies and procedures broker-dealers could adopt, including compliance reviews, monitoring systems, conflict escalation processes, and training. It also addresses the costs and revenue implications of eliminating such conflicts, including the potential loss of revenue from compensation arrangements with product sponsors and the impact on retail customers' access to advice. The document emphasizes the need for broker-dealers to adapt supervisory systems to meet these requirements.\"\n                }\n              ],\n              \"node_id\": \"0036\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest, which establishes a best interest standard of conduct for broker-dealers when making recommendations to retail customers. Key points include:\\n\\n1. **Flexibility for Broker-Dealers**: Broker-dealers are allowed flexibility in addressing conflicts of interest arising from financial incentives, either through disclosure and mitigation or elimination, and in developing supervisory systems tailored to their business practices.\\n\\n2. **Benefits**: The document highlights potential benefits of the regulation, such as improved alignment of broker-dealer recommendations with retail customers' best interests. However, the Commission is unable to quantify these benefits due to a lack of data and the wide range of assumptions required.\\n\\n3. **Costs**: The regulation would impose direct and indirect costs on broker-dealers, retail customers, and other stakeholders. Costs include compliance with Disclosure, Care, and Conflict of Interest Obligations, operational and legal expenses, potential revenue loss from avoiding certain recommendations, and possible limitations on retail customer choice.\\n\\n4. **Operational Costs**: Broker-dealers may incur additional costs for training employees to comply with the enhanced best interest standard, which builds upon existing federal securities laws and SRO rules.\\n\\n5. **Tension and Trade-offs**: The regulation may create tension between broker-dealers' regulatory requirements and their incentives to provide high-quality recommendations, particularly for costly or complex products. While the regulation aims to address conflicts of interest, it does not restrict broker-dealers from recommending higher-cost products if they meet the best interest standard.\\n\\n6. **Standard of Conduct**: The best interest standard is designed to enhance existing broker-dealer obligations, ensuring recommendations align with retail customers' needs and goals.\"\n            }\n          ],\n          \"node_id\": \"0034\",\n          \"summary\": \"The partial document discusses the compliance costs and benefits associated with Regulation Best Interest, a standard of conduct for broker-dealers. It highlights the significant compliance costs incurred by firms of varying sizes, with large firms facing higher start-up and ongoing costs. The document outlines the potential benefits of the regulation, including improved investment advice quality, enhanced retail customer protection, and better evaluation of broker-dealer recommendations. It details the three components of the best interest obligation: the Disclosure Obligation, which reduces informational gaps and improves customer understanding of broker-dealer practices; the Care Obligation, which ensures higher-quality advice; and the Conflict of Interest Obligations, which address material conflicts and enhance customer decision-making. The document also acknowledges potential costs, such as reduced product offerings, compliance burdens, and challenges in quantifying the regulation's benefits and costs due to limited data and varying broker-dealer practices.\"\n        },\n        {\n          \"title\": \"Effects on Efficiency, Competition, and Capital Formation\",\n          \"start_index\": 316,\n          \"end_index\": 324,\n          \"node_id\": \"0048\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its potential impacts on broker-dealers, retail customers, product sponsors, and the broader financial market. Key points include:\\n\\n1. **Funding Costs for Product Sponsors**: The rule may impose funding costs on product sponsors due to changes in broker-dealer recommendations, but the magnitude of these costs is difficult to quantify due to data limitations and varying compliance approaches.\\n\\n2. **Impact on Efficiency, Competition, and Capital Formation**:\\n   - **Efficiency**: The rule aims to improve the quality of broker-dealer recommendations, potentially enhancing retail customers' portfolio efficiency and capital allocation in the economy.\\n   - **Competition**: The rule could increase competition among broker-dealers by improving customer trust, but it may also impose costs that could reduce competition or lead to higher prices for advice. Dual-registrants may gain a competitive advantage over standalone broker-dealers.\\n   - **Capital Formation**: Enhanced recommendations may lead to increased retail investment, promoting capital formation. However, reduced broker-dealer recommendations for certain products could negatively impact capital allocation efficiency.\\n\\n3. **Product-Specific Impacts**: The rule may lead to increased demand for certain products where gains from trade improve, while reducing recommendations for others, potentially affecting pricing, availability, and competition among product sponsors.\\n\\n4. **Mitigation Measures and Product Sponsor Competition**: Compliance with the rule may shift product sponsor competition from compensation arrangements to product quality, potentially improving capital allocation efficiency.\\n\\n5. **Reasonable Alternatives**: Alternatives to the proposed rule, such as a disclosure-only approach or a principles-based standard, are considered to address the rule's objectives.\"\n        },\n        {\n          \"title\": \"Reasonable Alternatives\",\n          \"start_index\": 324,\n          \"end_index\": 325,\n          \"nodes\": [\n            {\n              \"title\": \"Disclosure-Only Alternative\",\n              \"start_index\": 325,\n              \"end_index\": 327,\n              \"node_id\": \"0050\",\n              \"summary\": \"The partial document discusses alternatives to the proposed Regulation Best Interest, focusing on two main approaches: \\n\\n1. **Disclosure-Only Alternative**: This approach would require broker-dealers to disclose all material facts and conflicts of interest without mandating the establishment of policies to mitigate or eliminate such conflicts. It emphasizes increased transparency through disclosures like a relationship summary and regulatory status disclosure. However, it is considered less effective in protecting retail customers as it lacks a best interest standard and places the burden on customers to interpret disclosures.\\n\\n2. **Principles-Based Standard of Conduct Obligation**: This alternative would allow broker-dealers to develop their own standards based on their business models, focusing on providing recommendations in the best interest of customers without explicit requirements to disclose or mitigate conflicts. While offering flexibility and lower compliance costs, it is deemed less effective in reducing harm to retail customers compared to the proposed Regulation Best Interest, which includes explicit obligations for care, conflict mitigation, and acting in the customer\\u2019s best interest.\"\n            },\n            {\n              \"title\": \"Principles-Based Standard of Conduct Obligation\",\n              \"start_index\": 327,\n              \"end_index\": 328,\n              \"node_id\": \"0051\",\n              \"summary\": \"The partial document discusses the evaluation of alternatives to the proposed Regulation Best Interest by the Commission. It covers three main points:\\n\\n1. **Disclosure-Only Rule**: The Commission believes a disclosure-only rule would be less effective in protecting retail customers and reducing investor harm compared to the proposed Regulation Best Interest, which includes additional obligations.\\n\\n2. **Principles-Based Standard of Conduct**: This alternative would allow broker-dealers to develop their own standards based on their business models without explicit requirements to disclose, mitigate, or eliminate conflicts of interest. While it offers flexibility and potentially lower compliance costs, the Commission finds it less effective in reducing harm to retail customers due to potential inconsistencies and lack of clear guidance.\\n\\n3. **Fiduciary Standard for Broker-Dealers**: The document briefly mentions the possibility of imposing a fiduciary standard on broker-dealers for retail customers, noting that fiduciary standards vary across different financial institutions.\\n\\nThe Commission concludes that the proposed Regulation Best Interest, with its specific Disclosure, Care, and Conflict of Interest Obligations, is more effective in enhancing investor protection and reducing harm than the alternatives discussed.\"\n            },\n            {\n              \"title\": \"A Fiduciary Standard for Broker-Dealers\",\n              \"start_index\": 328,\n              \"end_index\": 332,\n              \"node_id\": \"0052\",\n              \"summary\": \"The partial document discusses the regulatory standards for broker-dealers and investment advisers, focusing on retail customer protection. It compares principles-based standards, Regulation Best Interest, and fiduciary standards, highlighting their implications for conflicts of interest, investor harm, and market dynamics. The document emphasizes the need for tailored regulatory approaches to address the distinct business models of broker-dealers and investment advisers, noting the episodic nature of broker-dealer relationships versus the ongoing monitoring by investment advisers. It evaluates the potential benefits and drawbacks of a uniform fiduciary standard, including its impact on customer choice, market differentiation, and legal certainty. Additionally, it explores an alternative approach involving enhanced standards akin to the DOL\\u2019s BIC Exemption, considering its tradeoffs for retail customers, broker-dealers, and market participants. The document ultimately supports maintaining separate regulatory standards while enhancing protections through Regulation Best Interest and related disclosures.\"\n            },\n            {\n              \"title\": \"Enhanced Standards Akin to Conditions of the BIC Exemption\",\n              \"start_index\": 332,\n              \"end_index\": 335,\n              \"node_id\": \"0053\",\n              \"summary\": \"The partial document discusses the regulatory standards for broker-dealers and investment advisers, focusing on the potential adoption of a fiduciary standard and disclosure requirements similar to the Department of Labor's (DOL) Best Interest Contract (BIC) Exemption. It evaluates the economic effects, tradeoffs, and potential impacts on broker-dealers, retail customers, and the market for investment advice. Key points include:\\n\\n1. Maintaining separate regulatory standards for broker-dealers and investment advisers while enhancing retail customer protections through Regulation Best Interest and Form CRS Relationship Summary Disclosure.\\n2. Considering an alternative fiduciary standard for broker-dealers, akin to the BIC Exemption, applicable to all retail accounts, not just retirement accounts.\\n3. Analyzing the potential costs and benefits of such a standard, including increased compliance costs for broker-dealers, potential price increases for retail customers, and possible market exits or consolidations among broker-dealers and investment advisers.\\n4. Exploring competitive effects between broker-dealers, investment advisers, and other financial advice providers, as well as the potential shift from commission-based to fee-based accounts.\\n5. Highlighting challenges in quantifying costs and benefits and acknowledging differences in regulatory focus between the Commission and the DOL.\\n6. Requesting public comments on the economic analysis, including the identification of problems, benefits, costs, and alternative approaches.\"\n            }\n          ],\n          \"node_id\": \"0049\",\n          \"summary\": \"The partial document discusses the potential impacts of the \\\"best interest\\\" standard on broker-dealer recommendations, capital formation, and portfolio allocation efficiency. It highlights how compliance with the best interest obligation could shift competition among product sponsors toward product quality, potentially improving capital allocation efficiency. The document also explores alternatives to the proposed Regulation Best Interest, including a disclosure-only alternative, a principles-based standard, a fiduciary standard, and enhanced standards similar to the BIC Exemption. The disclosure-only alternative is detailed, emphasizing increased transparency through material fact and conflict disclosures, which could benefit retail customers by providing more information about broker-dealer relationships and conflicts of interest.\"\n        },\n        {\n          \"title\": \"Request for Comment\",\n          \"start_index\": 335,\n          \"end_index\": 338,\n          \"node_id\": \"0054\",\n          \"summary\": \"The partial document discusses the potential economic impacts, costs, and benefits of requiring broker-dealers to comply with a fiduciary standard and conditions similar to the BIC Exemption. It highlights the challenges in quantifying these impacts and notes differences in regulatory approaches between the Commission and the Department of Labor. The document includes a detailed request for public comments on various aspects of the proposed regulations, including the characterization of broker-dealer and retail customer relationships, financial incentives, benefits, costs, and assumptions underlying the analysis. It seeks input on the effects of the proposed rule on efficiency, competition, and capital formation, as well as alternative approaches and their potential impacts. Additionally, it raises questions about the treatment of discretionary investment advice and its implications for broker-dealers and retail customers.\"\n        }\n      ],\n      \"node_id\": \"0027\",\n      \"summary\": \"The partial document discusses the potential impacts of regulatory harmonization on investors, including their choices of financial firms and payment options for financial advice. It explores interactions between Regulation Best Interest and state fiduciary standards, comparing current state standards with proposed regulations. Additionally, the document introduces the economic analysis of proposed regulations, focusing on their primary goals, including promoting efficiency, competition, capital formation, and investor protection, while considering the costs, benefits, and competitive impacts as required by the Exchange Act.\"\n    },\n    {\n      \"title\": \"PAPERWORK REDUCTION ACT ANALYSIS\",\n      \"start_index\": 338,\n      \"end_index\": 340,\n      \"nodes\": [\n        {\n          \"title\": \"Respondents Subject to Proposed Regulation Best Interest and Proposed Amendments to Rule 17a-3(a)(25), Rule 17a-4(e)(5)\",\n          \"start_index\": 340,\n          \"end_index\": 340,\n          \"nodes\": [\n            {\n              \"title\": \"Broker-Dealers\",\n              \"start_index\": 340,\n              \"end_index\": 340,\n              \"node_id\": \"0057\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest, which aims to impose a best interest obligation on broker-dealers and their associated persons when making securities recommendations to retail customers. It highlights the flexibility provided to broker-dealers in meeting these obligations and outlines assumptions regarding compliance with Regulation Best Interest and amendments to Rules 17a-3(a)(25) and 17a-4(e)(5). The document provides data on the number of broker-dealers registered with the Commission as of December 31, 2017, noting that approximately 74.4% of them have retail customers and would likely be subject to the proposed regulations. It also addresses the application of the best interest obligation to natural persons associated with broker-dealers.\"\n            },\n            {\n              \"title\": \"Natural Persons Who Are Associated Persons of Broker-Dealers\",\n              \"start_index\": 340,\n              \"end_index\": 341,\n              \"node_id\": \"0058\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest and its implications for broker-dealers and associated persons. It outlines the best interest obligation imposed on broker-dealers and their representatives when making recommendations to retail customers regarding securities transactions or investment strategies. The document provides data on the number of broker-dealers and associated persons likely affected by the regulation, including standalone broker-dealers, dually-registered firms, and retail-facing licensed representatives. It also details the requirements for compliance, such as disclosing material facts and conflicts of interest in writing to retail customers. Additionally, it references proposed amendments to Rules 17a-3(a)(25) and 17a-4(e)(5) and includes preliminary estimates of the affected population based on regulatory filings.\"\n            }\n          ],\n          \"node_id\": \"0056\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest, which aims to impose a best interest obligation on broker-dealers and their associated persons when making securities or investment strategy recommendations to retail customers. It highlights the flexibility provided to broker-dealers in meeting these obligations and includes assumptions about compliance with Regulation Best Interest and amendments to Rules 17a-3(a)(25) and 17a-4(e)(5). The document provides data on the number of broker-dealers registered with the Commission as of December 31, 2017, noting that approximately 74.4% of them have retail customers and would likely be subject to the proposed regulations. It also extends the best interest obligation to natural persons associated with broker-dealers.\"\n        },\n        {\n          \"title\": \"Summary of Collections of Information\",\n          \"start_index\": 341,\n          \"end_index\": 342,\n          \"nodes\": [\n            {\n              \"title\": \"Conflict of Interest Obligations\",\n              \"start_index\": 342,\n              \"end_index\": 352,\n              \"node_id\": \"0060\",\n              \"summary\": \"The partial document discusses the obligations and requirements under Regulation Best Interest for broker-dealers, focusing on conflict of interest policies, record-making and retention obligations, and associated costs and burdens. Key points include:\\n\\n1. **Conflict of Interest Obligations**: Broker-dealers must establish, maintain, and enforce written policies to identify, disclose, mitigate, or eliminate material conflicts of interest, including those arising from financial incentives. These policies aim to ensure recommendations are in the best interest of retail customers.\\n\\n2. **Record-Making and Retention Requirements**: Proposed amendments to Rules 17a-3(a)(25) and 17a-4(e)(5) introduce new obligations for broker-dealers to document and retain compliance-related records.\\n\\n3. **Costs and Burdens**: The document estimates initial and ongoing costs and burdens for broker-dealers to comply with these obligations, including:\\n   - Developing and updating written policies and procedures.\\n   - Identifying and managing material conflicts of interest.\\n   - Modifying technological infrastructure for conflict identification.\\n   - Training registered representatives on compliance with Regulation Best Interest.\\n\\n4. **Training Programs**: Broker-dealers are expected to develop and implement training modules for registered representatives, with initial and ongoing training requirements.\\n\\nThe document provides detailed cost and burden estimates for both small and large broker-dealers, highlighting variations based on size and complexity of operations.\"\n            },\n            {\n              \"title\": \"Disclosure Obligation\",\n              \"start_index\": 353,\n              \"end_index\": 370,\n              \"node_id\": \"0061\",\n              \"summary\": \"The partial document discusses the proposed Regulation Best Interest, focusing on the Disclosure Obligation for broker-dealers when recommending securities transactions or strategies to retail customers. Key points include:\\n\\n1. **Disclosure Obligation**: Broker-dealers must disclose, in writing, material facts about the scope and terms of their relationship with retail customers and all material conflicts of interest associated with recommendations. This aims to enhance customer understanding of services, fees, and conflicts of interest.\\n\\n2. **Disclosure of Capacity, Fees, and Services**: Broker-dealers must provide standardized account disclosures, including their capacity (e.g., broker-dealer or dual-registrant), comprehensive fee schedules, and the types and scope of services offered. These disclosures must be updated and delivered to customers at the beginning of the relationship or when material changes occur.\\n\\n3. **Disclosure of Conflicts of Interest**: Broker-dealers are required to disclose all material conflicts of interest through standardized documents, updated annually or as needed, and delivered to customers.\\n\\n4. **Costs and Burdens**: The document estimates the initial and ongoing costs and burdens for broker-dealers to comply with these obligations, including drafting, reviewing, and delivering disclosures. Costs vary based on the size of the broker-dealer and the complexity of their services.\\n\\n5. **Record-Making and Recordkeeping**: Proposed amendments to Rules 17a-3(a)(25) and 17a-4(e)(5) require broker-dealers to maintain records of information collected from and provided to retail customers, aiding compliance, supervision, and regulatory examinations.\"\n            },\n            {\n              \"title\": \"Care Obligation\",\n              \"start_index\": 370,\n              \"end_index\": 370,\n              \"node_id\": \"0062\",\n              \"summary\": \"The partial document discusses the estimated ongoing burden hours for broker-dealers under proposed Regulation Best Interest, specifically focusing on the Care Obligation and Record-making and Recordkeeping Obligations. It outlines the requirements for broker-dealers to assess the risks and rewards of recommendations to ensure they are in the best interest of retail customers. Additionally, it details the record-making requirements under proposed Rule 17a-3(a)(25), which include maintaining records of information collected from and provided to retail customers. The document also highlights the purpose of these records in aiding compliance, supervision, and regulatory examinations or investigations, and provides calculations for the estimated burden hours associated with these obligations.\"\n            },\n            {\n              \"title\": \"Record-Making and Recordkeeping Obligations\",\n              \"start_index\": 370,\n              \"end_index\": 375,\n              \"node_id\": \"0063\",\n              \"summary\": \"The partial document discusses the estimated costs, burdens, and obligations associated with proposed Regulation Best Interest and amendments to Rules 17a-3(a)(25) and 17a-4(e)(5) for broker-dealers. Key points include:\\n\\n1. **Care Obligation**: Broker-dealers must assess the risks and rewards of recommendations to ensure they align with the best interests of retail customers. Related costs and burdens are addressed under Rule 17a-3(a)(25).\\n\\n2. **Record-Making Obligations**: Broker-dealers are required to document information collected from and provided to retail customers, including the identity of associated persons responsible for accounts. Initial and ongoing costs for compliance, including updates to account disclosure documents, are detailed.\\n\\n3. **Recordkeeping Obligations**: Broker-dealers must retain records for at least six years, leveraging existing systems for compliance. Initial and ongoing burdens for maintaining and updating records, including account documents, fee schedules, and conflict disclosures, are quantified.\\n\\n4. **Cost Estimates**: The document provides detailed calculations of aggregate and per-broker-dealer costs and burden hours for compliance with the proposed rules.\\n\\n5. **Mandatory Compliance**: The collection of information is mandatory for all broker-dealers, with certain disclosures not kept confidential.\\n\\n6. **Request for Comments**: The document seeks feedback on assumptions regarding costs, storage requirements, and compliance burdens.\"\n            }\n          ],\n          \"node_id\": \"0059\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest, which requires broker-dealers to act in the best interest of retail customers when recommending securities transactions or investment strategies. Key points include: \\n\\n1. The regulation applies to approximately 435,071 retail-facing, licensed representatives at standalone broker-dealers or dually-registered firms.\\n2. The best interest obligation is satisfied through reasonable disclosure of material facts, exercising diligence and care in recommendations, and establishing written policies to identify, disclose, mitigate, or eliminate material conflicts of interest.\\n3. Proposed amendments to Rules 17a-3(a)(25) and 17a-4(e)(5) introduce new record-making and record-retention obligations for broker-dealers.\\n4. The regulation imposes distinct information collection requirements and associated costs for broker-dealers, particularly regarding conflict of interest obligations, which require broker-dealer entities to maintain policies addressing material conflicts of interest.\"\n        },\n        {\n          \"title\": \"Collection of Information is Mandatory\",\n          \"start_index\": 375,\n          \"end_index\": 375,\n          \"node_id\": \"0064\",\n          \"summary\": \"The partial document discusses the ongoing costs and burdens associated with the proposed amendments to Rule 17a-4(e)(5) and Rule 17a-3(a)(25), estimating an annual burden of 3.17 million hours for recordkeeping. It highlights that compliance costs for the retention schedule are not expected to change from current levels but seeks comments on the frequency and additional costs of record collection, updates, and retention. The document also notes that the collection of information under \\\"Regulation Best Interest\\\" and the proposed amendments to Rules 17a-3 and 17a-4 is mandatory for broker-dealers. Additionally, it specifies that written disclosures to retail customers under Regulation Best Interest would not be confidential, while other information may be.\"\n        },\n        {\n          \"title\": \"Confidentiality\",\n          \"start_index\": 375,\n          \"end_index\": 376,\n          \"node_id\": \"0065\",\n          \"summary\": \"The partial document discusses the ongoing costs and burdens associated with proposed amendments to Rule 17a-4(e)(5) and related recordkeeping requirements, estimating an annual burden of 3.17 million hours. It addresses compliance costs, the frequency of record updates, and requests comments on potential additional costs. The document outlines mandatory information collection requirements under \\\"Regulation Best Interest\\\" and amendments to Rules 17a-3 and 17a-4, noting that certain disclosures to retail customers are not confidential, while information provided to the Commission during examinations or investigations is confidential. The Commission seeks public comments on burden estimates, associated costs, and ways to improve the quality and utility of the information collected, as well as feedback on other issues related to Regulation Best Interest.\"\n        },\n        {\n          \"title\": \"Request for Comment\",\n          \"start_index\": 376,\n          \"end_index\": 377,\n          \"node_id\": \"0066\",\n          \"summary\": \"The partial document discusses the confidentiality of information provided to the Commission during examinations or investigations, subject to applicable law. It includes a request for public comments on the estimated reporting burdens and associated costs of Regulation Best Interest, as well as proposed amendments to Rules 17a-3 and 17a-4. The Commission seeks feedback on various aspects, including the number of associated persons and broker-dealers making securities-related recommendations, unaddressed costs or burdens, and ways to improve the quality and clarity of information collection. Additionally, it invites comments on minimizing the burden of information collection through technology. The document also addresses the Small Business Regulatory Enforcement Fairness Act (SBREFA), requiring the Commission to determine if the proposed regulation qualifies as a \\\"major\\\" rule, defined by significant economic impact, such as an annual effect of $100 million or more.\"\n        }\n      ],\n      \"node_id\": \"0055\",\n      \"summary\": \"The partial document discusses the proposed rules and amendments under the Regulation Best Interest framework, focusing on the obligations of broker-dealers and their associated persons when making recommendations to retail customers. It seeks public comments on the costs, benefits, and potential alternatives to the proposed rules, as well as their impact on efficiency, competition, and capital formation. The document also addresses the Paperwork Reduction Act (PRA) analysis, detailing new \\\"collection of information\\\" requirements and their submission to the Office of Management and Budget (OMB) for approval. Key provisions include improving disclosure about broker-dealer relationships, enhancing recommendation quality, mitigating conflicts of interest, and providing flexibility for broker-dealers in compliance. The document provides data on the number of broker-dealers and associated persons potentially affected by the proposed rules.\"\n    },\n    {\n      \"title\": \"SMALL BUSINESS REGULATORY ENFORCEMENT FAIRNESS ACT\",\n      \"start_index\": 377,\n      \"end_index\": 378,\n      \"node_id\": \"0067\",\n      \"summary\": \"The partial document discusses the evaluation of methods to minimize the burden of information collection, including automated techniques, and provides instructions for submitting comments on Regulation Best Interest to the Office of Management and Budget (OMB) and the Securities and Exchange Commission (SEC). It outlines the requirements under the Small Business Regulatory Enforcement Fairness Act (SBREFA) to determine if a proposed regulation is a \\\"major\\\" rule based on its economic impact, cost implications, or effects on competition, investment, or innovation. The document also requests public comments on the potential economic and industry impacts of Regulation Best Interest and includes an Initial Regulatory Flexibility Act (RFA) analysis, which requires federal agencies to assess the impact of proposed rules on small entities.\"\n    },\n    {\n      \"title\": \"INITIAL REGULATORY FLEXIBILITY ACT ANALYSIS\",\n      \"start_index\": 378,\n      \"end_index\": 379,\n      \"nodes\": [\n        {\n          \"title\": \"Reasons for and Objectives of the Proposed Action\",\n          \"start_index\": 379,\n          \"end_index\": 381,\n          \"node_id\": \"0069\",\n          \"summary\": \"The partial document discusses the proposed Regulation Best Interest by the Commission, which aims to establish a standard of conduct for broker-dealers and associated persons when making recommendations to retail customers. Key points include:\\n\\n1. **Proposed Standard of Conduct**: Broker-dealers must act in the best interest of retail customers, avoiding prioritization of their own financial interests. This includes disclosing material facts, exercising diligence, and addressing conflicts of interest through written policies.\\n\\n2. **Objectives of Regulation Best Interest**: \\n   - Enhance the quality of broker-dealer recommendations.\\n   - Improve disclosure of conflicts of interest and relationship terms.\\n   - Reduce investor confusion and align broker-dealer obligations with investor expectations.\\n   - Facilitate consistent regulation across retirement and non-retirement assets.\\n   - Preserve investor choice and access to affordable advice and products.\\n\\n3. **Record-Making and Retention Obligations**: Amendments to Rules 17a-3 and 17a-4 would impose new requirements for broker-dealers to document and retain information related to recommendations made under Regulation Best Interest.\\n\\n4. **Legal Basis**: The proposal is grounded in the Dodd-Frank Act and various sections of the Exchange Act.\\n\\n5. **Impact on Small Entities**: The document outlines criteria for small broker-dealers subject to the proposed rule, focusing on those with total capital below $500,000.\"\n        },\n        {\n          \"title\": \"Legal Basis\",\n          \"start_index\": 381,\n          \"end_index\": 381,\n          \"node_id\": \"0070\",\n          \"summary\": \"The partial document discusses proposed amendments to SEC rules impacting broker-dealers under Regulation Best Interest. It outlines new record-making obligations under Rule 17a-3(a)(25) and new record retention requirements under Rule 17a-4(e)(5). These amendments would require broker-dealers to document and retain all information collected from and provided to retail customers, including the identity of associated persons responsible for accounts, for six years. The legal basis for these changes is rooted in the Dodd-Frank Act and various sections of the Exchange Act. Additionally, the document addresses the applicability of these rules to small entities, defining criteria for broker-dealers considered small entities under the Regulatory Flexibility Act (RFA).\"\n        },\n        {\n          \"title\": \"Small Entities Subject to the Proposed Rule\",\n          \"start_index\": 381,\n          \"end_index\": 382,\n          \"node_id\": \"0071\",\n          \"summary\": \"The partial document discusses proposed amendments to SEC rules under Regulation Best Interest, specifically the addition of paragraph (a)(25) to Rule 17a-3 and revisions to Rule 17a-4(e)(5). These amendments would impose new record-making and record retention obligations on broker-dealers, requiring them to document and retain information collected from and provided to retail customers for six years. The legal basis for these changes is rooted in the Dodd-Frank Act and various sections of the Exchange Act. The document also addresses the impact on small entities, defining criteria for small broker-dealers and estimating that approximately 802 small entities would be affected. Additionally, it outlines the projected compliance requirements for small entities, including reporting, recordkeeping, and other obligations under the proposed rules.\"\n        },\n        {\n          \"title\": \"Projected Compliance Requirements of the Proposed Rule for Small Entities\",\n          \"start_index\": 382,\n          \"end_index\": 383,\n          \"nodes\": [\n            {\n              \"title\": \"Conflict of Interest Obligations\",\n              \"start_index\": 383,\n              \"end_index\": 386,\n              \"node_id\": \"0073\",\n              \"summary\": \"The partial document discusses amendments to Rules 17a-3(a)(25) and 17a-4(e)(5) and their impact on small entities, focusing on compliance with proposed Regulation Best Interest. Key points include:\\n\\n1. **Conflict of Interest Obligations**:  \\n   - Updating written policies and procedures with the help of outside and in-house legal counsel, with associated costs and burdens.  \\n   - Identifying material conflicts of interest through technology modifications and ongoing reviews, involving costs for programmers and compliance personnel.  \\n\\n2. **Training Requirements**:  \\n   - Development of computerized training modules for registered representatives, including costs for external analysts and programmers.  \\n   - Initial and ongoing training for representatives, with associated time and cost burdens.  \\n\\nThe document provides detailed cost estimates and burden hours for small entities to comply with these obligations.\"\n            },\n            {\n              \"title\": \"Disclosure Obligations\",\n              \"start_index\": 387,\n              \"end_index\": 394,\n              \"node_id\": \"0074\",\n              \"summary\": \"The partial document discusses the disclosure obligations under the proposed Regulation Best Interest, focusing on the requirements for small entities to disclose material facts about their relationship with retail customers, including capacity, fees, charges, types, and scope of services, as well as material conflicts of interest. It provides detailed estimates of the initial and ongoing costs and burdens for small entities, including internal and external costs for drafting, reviewing, and delivering standardized disclosure documents. The document also addresses the obligations for updating disclosures annually and delivering amended documents in case of material changes. Additionally, it covers the record-making and recordkeeping obligations under proposed amendments to Rule 17a-3(a)(25) and Rule 17a-4(e)(5), noting that small entities are already making relevant records and would not face significant additional burdens. The document emphasizes compliance with the enhanced best interest standard and provides detailed calculations of time and cost estimates for various compliance activities.\"\n            },\n            {\n              \"title\": \"Obligation to Exercise Reasonable Diligence, Care, Skill and Prudence\",\n              \"start_index\": 394,\n              \"end_index\": 394,\n              \"node_id\": \"0075\",\n              \"summary\": \"The partial document discusses the obligations of small entities under proposed regulations, specifically focusing on the duty to exercise reasonable diligence, care, skill, and prudence when making recommendations, which is not expected to impose additional costs or burdens. It also addresses record-making and recordkeeping obligations under proposed amendments to Rule 17a-3(a)(25) and Rule 17a-4(e)(5). The document highlights that small entities are already maintaining records of customer investment profiles and would not face additional record-making obligations, except for ensuring compliance with the enhanced best interest standard of Regulation Best Interest.\"\n            },\n            {\n              \"title\": \"Record-Making and Recordkeeping Obligations\",\n              \"start_index\": 394,\n              \"end_index\": 397,\n              \"node_id\": \"0076\",\n              \"summary\": \"The partial document discusses the obligations of small entities under proposed amendments to regulations, specifically focusing on the following main points:\\n\\n1. **Obligation to Exercise Reasonable Diligence, Care, Skill, and Prudence**: The document emphasizes that this obligation would not impose additional costs or burdens on small entities beyond their current practices.\\n\\n2. **Record-Making Obligations**: Proposed Rule 17a-3(a)(25) would require broker-dealers, including small entities, to document information collected from and provided to retail customers under Regulation Best Interest. The document estimates the costs and time burdens for small entities to comply with these requirements, including amending existing account disclosure documents and identifying associated persons responsible for accounts.\\n\\n3. **Recordkeeping Obligations**: Small entities would need to retain specific records, such as relationship summaries, account disclosures, fee schedules, and conflict disclosures, for six years. The document outlines the initial and ongoing time burdens for small entities to integrate these requirements into their existing recordkeeping systems.\\n\\n4. **Consistency with Other Federal Rules**: The document analyzes potential overlaps or conflicts with other federal rules, such as the DOL Fiduciary Rule and related exemptions, concluding that the principles of Regulation Best Interest are generally consistent with these existing rules.\"\n            }\n          ],\n          \"node_id\": \"0072\",\n          \"summary\": \"The partial document discusses the compliance requirements and associated costs for small entities under the proposed Regulation Best Interest and amendments to Rules 17a-3 and 17a-4. It estimates the number of small retail broker-dealers affected and outlines the projected reporting, recordkeeping, and compliance obligations. Key points include the need for small entities to update written policies and procedures, identify material conflicts of interest, and develop training programs to ensure compliance. The document provides cost estimates for these obligations, including reliance on outside legal counsel and in-house review, and highlights the aggregate financial and time burdens for small entities.\"\n        },\n        {\n          \"title\": \"Duplicative, Overlapping, or Conflicting Federal Rules\",\n          \"start_index\": 397,\n          \"end_index\": 398,\n          \"node_id\": \"0077\",\n          \"summary\": \"The partial document discusses the estimated ongoing burden for small entities associated with the proposed amendment to Rule 17a-4(e)(5), calculated at 261.5 burden hours per year. It analyzes duplicative, overlapping, or conflicting federal rules, particularly comparing the principles of Regulation Best Interest with the DOL Fiduciary Rule and related exemptions, concluding they are generally consistent. The document also explores significant alternatives under the Regulatory Flexibility Act (RFA) to minimize the impact on small entities, such as differing compliance requirements, simplification of reporting, or exemptions. However, the Commission preliminarily concludes that exempting small broker-dealers or establishing different requirements would not achieve the proposal's objectives, emphasizing the importance of investor protection benefits for retail customers of both small and large broker-dealers. The proposal aims to enhance the quality of recommendations through a \\\"best interest\\\" obligation under the Exchange Act.\"\n        },\n        {\n          \"title\": \"Significant Alternatives\",\n          \"start_index\": 398,\n          \"end_index\": 401,\n          \"nodes\": [\n            {\n              \"title\": \"Disclosure-Only Alternative\",\n              \"start_index\": 401,\n              \"end_index\": 401,\n              \"node_id\": \"0079\",\n              \"summary\": \"The partial document discusses two alternative approaches to regulatory obligations for broker-dealers. The first is the \\\"Disclosure-only alternative,\\\" which would require broker-dealers to disclose all material facts and conflicts but would not mandate acting in the best interest of customers. This approach is considered less effective in protecting retail customers and reducing investor harm compared to the proposed Regulation Best Interest. The second is the \\\"Principles-based alternative,\\\" which would allow broker-dealers to develop their own conduct standards based on their business models without specific regulatory requirements. This approach would rely on existing regulatory baselines, including disclosure obligations under antifraud provisions.\"\n            },\n            {\n              \"title\": \"Principles-Based Alternative\",\n              \"start_index\": 401,\n              \"end_index\": 402,\n              \"node_id\": \"0080\",\n              \"summary\": \"The partial document discusses three alternative regulatory approaches to the proposed Regulation Best Interest for broker-dealers:\\n\\n1. **Disclosure-Only Alternative**: This approach would require broker-dealers to disclose all material facts and conflicts of interest but would not mandate acting in the best interest of customers. While compliance costs for small entities would be lower than the proposed rule, this alternative is considered less effective in protecting retail customers and mitigating investor harm.\\n\\n2. **Principles-Based Alternative**: This approach would allow broker-dealers to develop their own conduct standards based on their business models, offering flexibility and potentially lower compliance costs. However, it is deemed less effective in providing clear standards for customer protection and could increase liability costs due to lack of clarity.\\n\\n3. **Enhanced Standards Akin to BIC Exemption**: This alternative would impose a fiduciary standard with disclosure and other requirements similar to the DOL\\u2019s Best Interest Contract (BIC) Exemption, applying to all retail accounts. While it may reduce economic effects for broker-dealers already complying with the BIC Exemption, it could significantly increase costs for others.\\n\\nThe document evaluates these alternatives in terms of effectiveness, compliance costs, and customer protection compared to the proposed Regulation Best Interest.\"\n            },\n            {\n              \"title\": \"Enhanced Standards Akin to BIC Exemption\",\n              \"start_index\": 402,\n              \"end_index\": 403,\n              \"node_id\": \"0081\",\n              \"summary\": \"The partial document discusses the regulatory considerations and potential impacts of proposed Regulation Best Interest on broker-dealers, including small entities. It evaluates different approaches, such as a less prescriptive, principles-based standard and an enhanced fiduciary standard akin to the DOL\\u2019s BIC Exemption. The document highlights the potential benefits and drawbacks of these approaches, including compliance costs, liability risks, and economic effects on retail customers and broker-dealers. It emphasizes the need for a clear and consistent best interest standard to protect retail customers while minimizing adverse impacts on small entities. Additionally, the document includes a request for public comments on the potential effects of Regulation Best Interest on small entities, compliance burdens, and related economic impacts, encouraging empirical data to support feedback.\"\n            }\n          ],\n          \"node_id\": \"0078\",\n          \"summary\": \"The partial document discusses the analysis of regulatory alternatives under the Regulatory Flexibility Act (RFA) to minimize the impact on small entities while achieving the objectives of proposed Regulation Best Interest and related amendments. Key points include:\\n\\n1. **Alternatives for Small Entities**: The document evaluates alternatives such as differing compliance requirements, simplification of reporting, performance-based standards, and exemptions for small entities. However, the Commission does not support exemptions or differing requirements for small broker-dealers, emphasizing consistent investor protection across all entities.\\n\\n2. **Investor Protection Goals**: The proposal aims to enhance the quality of broker-dealer recommendations to retail customers by establishing a \\\"best interest\\\" obligation, applicable to both small and large broker-dealers.\\n\\n3. **Flexibility in Compliance**: The proposal allows broker-dealers flexibility in meeting obligations, such as tailoring systems to their business models and focusing on areas of greatest risk. Small entities with fewer conflicts may require simpler policies.\\n\\n4. **Regulatory Alternatives Considered**: The Commission considered alternatives like a disclosure-only approach, a principles-based standard, a fiduciary standard, and an enhanced standard akin to the BIC Exemption. These alternatives were deemed less effective in protecting retail customers compared to the proposed rule.\\n\\n5. **Disclosure-Only Alternative**: This approach would require broker-dealers to disclose material facts and conflicts but would not mandate acting in the customer's best interest, making it less effective in reducing investor harm.\\n\\n6. **Principles-Based Alternative**: This would allow broker-dealers to develop their own conduct standards based on their business models but lacks the direct requirements of the proposed rule, potentially reducing its effectiveness in ensuring investor protection.\"\n        },\n        {\n          \"title\": \"General Request for Comment\",\n          \"start_index\": 403,\n          \"end_index\": 403,\n          \"node_id\": \"0082\",\n          \"summary\": \"The partial document discusses the potential economic and regulatory impacts of requiring broker-dealers to comply with a fiduciary standard and conditions similar to the BIC Exemption. It highlights concerns about costs to broker-dealers, including small entities, and the potential effects on retail customers and the investment advice market. The document also includes a general request for public comments on the impact of Regulation Best Interest, particularly on small entities, compliance burdens, and any unconsidered effects, encouraging empirical data to support feedback.\"\n        }\n      ],\n      \"node_id\": \"0068\",\n      \"summary\": \"The partial document discusses the following main points:\\n\\n1. **Major Rule Implications**: It outlines the criteria for a rule to be considered \\\"major,\\\" including significant cost increases for consumers or industries or adverse effects on competition, investment, or innovation. Major rules are subject to a 60-day delay for Congressional review.\\n\\n2. **Request for Comments**: The Commission seeks public comments on the potential impact of Regulation Best Interest and a proposed amendment to Rule 17a-4(e)(5) on the U.S. economy, costs for consumers or industries, and effects on competition, investment, or innovation. Commenters are encouraged to provide empirical data.\\n\\n3. **Regulatory Flexibility Act (RFA) Analysis**: The document highlights the RFA requirement for federal agencies to assess the impact of proposed rules on small entities. It notes that a regulatory flexibility analysis is not required if the proposed rules do not significantly impact a substantial number of small entities.\\n\\n4. **Proposed Regulation Best Interest**: The Commission proposes a standard of conduct for broker-dealers and associated persons when recommending securities transactions or investment strategies to retail customers. The standard requires acting in the best interest of the customer, disclosing material facts and conflicts of interest, and exercising reasonable diligence, care, and skill.\"\n    },\n    {\n      \"title\": \"STATUTORY AUTHORITY AND TEXT OF PROPOSED RULE\",\n      \"start_index\": 403,\n      \"end_index\": 408,\n      \"node_id\": \"0083\",\n      \"summary\": \"The partial document outlines the proposed \\\"Regulation Best Interest\\\" by the SEC, which establishes a fiduciary standard for broker-dealers when providing investment advice to retail customers. Key points include:\\n\\n1. **Best Interest Obligation**: Brokers and dealers must act in the best interest of retail customers, prioritizing the customer's interests over their own financial or other interests. This obligation is satisfied through:\\n   - **Disclosure Obligation**: Providing written disclosure of material facts, including conflicts of interest.\\n   - **Care Obligation**: Exercising diligence, care, and prudence to ensure recommendations align with the customer's investment profile and are not excessive.\\n   - **Conflict of Interest Obligation**: Establishing policies to identify, disclose, mitigate, or eliminate material conflicts of interest.\\n\\n2. **Definitions**: The document defines key terms such as \\\"Retail Customer\\\" and \\\"Retail Customer Investment Profile,\\\" which include factors like age, financial situation, risk tolerance, and investment objectives.\\n\\n3. **Recordkeeping Requirements**: Amendments to existing rules (\\u00a7 240.17a-3 and \\u00a7 240.17a-4) require brokers to maintain detailed records of customer information, recommendations, and associated persons responsible for accounts, with a retention period of six years.\\n\\n4. **Request for Comments**: The SEC seeks public input on the economic impact of the regulation, particularly on small entities, and invites empirical data on compliance burdens.\\n\\n5. **Statutory Authority**: The proposal is based on authority granted under the Dodd-Frank Act and the Securities Exchange Act of 1934.\\n\\nThe document emphasizes the regulatory framework's goal of enhancing investor protection while considering the economic implications for brokers, dealers, and small entities.\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/results/earthmover_structure.json",
    "content": "{\n  \"doc_name\": \"earthmover.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Earth Mover\\u2019s Distance based Similarity Search at Scale\",\n      \"start_index\": 1,\n      \"end_index\": 1,\n      \"node_id\": \"0000\"\n    },\n    {\n      \"title\": \"ABSTRACT\",\n      \"start_index\": 1,\n      \"end_index\": 1,\n      \"node_id\": \"0001\"\n    },\n    {\n      \"title\": \"INTRODUCTION\",\n      \"start_index\": 1,\n      \"end_index\": 2,\n      \"node_id\": \"0002\"\n    },\n    {\n      \"title\": \"PRELIMINARIES\",\n      \"start_index\": 2,\n      \"end_index\": 2,\n      \"nodes\": [\n        {\n          \"title\": \"Computing the EMD\",\n          \"start_index\": 3,\n          \"end_index\": 3,\n          \"node_id\": \"0004\"\n        },\n        {\n          \"title\": \"Filter-and-Refinement Framework\",\n          \"start_index\": 3,\n          \"end_index\": 4,\n          \"node_id\": \"0005\"\n        }\n      ],\n      \"node_id\": \"0003\"\n    },\n    {\n      \"title\": \"SCALING UP SSP\",\n      \"start_index\": 4,\n      \"end_index\": 5,\n      \"node_id\": \"0006\"\n    },\n    {\n      \"title\": \"BOOSTING THE REFINEMENT PHASE\",\n      \"start_index\": 5,\n      \"end_index\": 5,\n      \"nodes\": [\n        {\n          \"title\": \"Analysis of EMD Calculation\",\n          \"start_index\": 5,\n          \"end_index\": 6,\n          \"node_id\": \"0008\"\n        },\n        {\n          \"title\": \"Progressive Bounding\",\n          \"start_index\": 6,\n          \"end_index\": 6,\n          \"node_id\": \"0009\"\n        },\n        {\n          \"title\": \"Sensitivity to Refinement Order\",\n          \"start_index\": 6,\n          \"end_index\": 7,\n          \"node_id\": \"0010\"\n        },\n        {\n          \"title\": \"Dynamic Refinement Ordering\",\n          \"start_index\": 7,\n          \"end_index\": 8,\n          \"node_id\": \"0011\"\n        },\n        {\n          \"title\": \"Running Upper Bound\",\n          \"start_index\": 8,\n          \"end_index\": 8,\n          \"node_id\": \"0012\"\n        }\n      ],\n      \"node_id\": \"0007\"\n    },\n    {\n      \"title\": \"EXPERIMENTAL EVALUATION\",\n      \"start_index\": 8,\n      \"end_index\": 9,\n      \"nodes\": [\n        {\n          \"title\": \"Performance Improvement\",\n          \"start_index\": 9,\n          \"end_index\": 10,\n          \"node_id\": \"0014\"\n        },\n        {\n          \"title\": \"Scalability Experiments\",\n          \"start_index\": 10,\n          \"end_index\": 11,\n          \"node_id\": \"0015\"\n        },\n        {\n          \"title\": \"Parameter Tuning in DRO\",\n          \"start_index\": 11,\n          \"end_index\": 12,\n          \"node_id\": \"0016\"\n        }\n      ],\n      \"node_id\": \"0013\"\n    },\n    {\n      \"title\": \"RELATED WORK\",\n      \"start_index\": 12,\n      \"end_index\": 12,\n      \"node_id\": \"0017\"\n    },\n    {\n      \"title\": \"CONCLUSION\",\n      \"start_index\": 12,\n      \"end_index\": 12,\n      \"node_id\": \"0018\"\n    },\n    {\n      \"title\": \"ACKNOWLEDGMENT\",\n      \"start_index\": 12,\n      \"end_index\": 12,\n      \"node_id\": \"0019\"\n    },\n    {\n      \"title\": \"REFERENCES\",\n      \"start_index\": 12,\n      \"end_index\": 12,\n      \"node_id\": \"0020\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/results/four-lectures_structure.json",
    "content": "{\n  \"doc_name\": \"four-lectures.pdf\",\n  \"structure\": [\n    {\n      \"title\": \"Preface\",\n      \"start_index\": 1,\n      \"end_index\": 1,\n      \"node_id\": \"0000\"\n    },\n    {\n      \"title\": \"ML at a Glance\",\n      \"start_index\": 2,\n      \"end_index\": 2,\n      \"nodes\": [\n        {\n          \"title\": \"An ML session\",\n          \"start_index\": 2,\n          \"end_index\": 3,\n          \"node_id\": \"0002\"\n        },\n        {\n          \"title\": \"Types and Values\",\n          \"start_index\": 3,\n          \"end_index\": 4,\n          \"node_id\": \"0003\"\n        },\n        {\n          \"title\": \"Recursive Functions\",\n          \"start_index\": 4,\n          \"end_index\": 4,\n          \"node_id\": \"0004\"\n        },\n        {\n          \"title\": \"Raising Exceptions\",\n          \"start_index\": 4,\n          \"end_index\": 5,\n          \"node_id\": \"0005\"\n        },\n        {\n          \"title\": \"Structures\",\n          \"start_index\": 5,\n          \"end_index\": 6,\n          \"node_id\": \"0006\"\n        },\n        {\n          \"title\": \"Signatures\",\n          \"start_index\": 6,\n          \"end_index\": 7,\n          \"node_id\": \"0007\"\n        },\n        {\n          \"title\": \"Coercive Signature Matching\",\n          \"start_index\": 7,\n          \"end_index\": 8,\n          \"node_id\": \"0008\"\n        },\n        {\n          \"title\": \"Functor Declaration\",\n          \"start_index\": 8,\n          \"end_index\": 9,\n          \"node_id\": \"0009\"\n        },\n        {\n          \"title\": \"Functor Application\",\n          \"start_index\": 9,\n          \"end_index\": 9,\n          \"node_id\": \"0010\"\n        },\n        {\n          \"title\": \"Summary\",\n          \"start_index\": 9,\n          \"end_index\": 9,\n          \"node_id\": \"0011\"\n        }\n      ],\n      \"node_id\": \"0001\"\n    },\n    {\n      \"title\": \"Programming with ML Modules\",\n      \"start_index\": 10,\n      \"end_index\": 10,\n      \"nodes\": [\n        {\n          \"title\": \"Introduction\",\n          \"start_index\": 10,\n          \"end_index\": 11,\n          \"node_id\": \"0013\"\n        },\n        {\n          \"title\": \"Signatures\",\n          \"start_index\": 11,\n          \"end_index\": 12,\n          \"node_id\": \"0014\"\n        },\n        {\n          \"title\": \"Structures\",\n          \"start_index\": 12,\n          \"end_index\": 13,\n          \"node_id\": \"0015\"\n        },\n        {\n          \"title\": \"Functors\",\n          \"start_index\": 13,\n          \"end_index\": 14,\n          \"node_id\": \"0016\"\n        },\n        {\n          \"title\": \"Substructures\",\n          \"start_index\": 14,\n          \"end_index\": 15,\n          \"node_id\": \"0017\"\n        },\n        {\n          \"title\": \"Sharing\",\n          \"start_index\": 15,\n          \"end_index\": 16,\n          \"node_id\": \"0018\"\n        },\n        {\n          \"title\": \"Building the System\",\n          \"start_index\": 16,\n          \"end_index\": 17,\n          \"node_id\": \"0019\"\n        },\n        {\n          \"title\": \"Separate Compilation\",\n          \"start_index\": 17,\n          \"end_index\": 18,\n          \"node_id\": \"0020\"\n        },\n        {\n          \"title\": \"Good Style\",\n          \"start_index\": 18,\n          \"end_index\": 18,\n          \"node_id\": \"0021\"\n        },\n        {\n          \"title\": \"Bad Style\",\n          \"start_index\": 18,\n          \"end_index\": 19,\n          \"node_id\": \"0022\"\n        }\n      ],\n      \"node_id\": \"0012\"\n    },\n    {\n      \"title\": \"The Static Semantics of Modules\",\n      \"start_index\": 20,\n      \"end_index\": 20,\n      \"nodes\": [\n        {\n          \"title\": \"Elaboration\",\n          \"start_index\": 20,\n          \"end_index\": 21,\n          \"node_id\": \"0024\"\n        },\n        {\n          \"title\": \"Names\",\n          \"start_index\": 21,\n          \"end_index\": 21,\n          \"node_id\": \"0025\"\n        },\n        {\n          \"title\": \"Decorating Structures\",\n          \"start_index\": 21,\n          \"end_index\": 21,\n          \"node_id\": \"0026\"\n        },\n        {\n          \"title\": \"Decorating Signatures\",\n          \"start_index\": 22,\n          \"end_index\": 23,\n          \"node_id\": \"0027\"\n        },\n        {\n          \"title\": \"Signature Instantiation\",\n          \"start_index\": 23,\n          \"end_index\": 24,\n          \"node_id\": \"0028\"\n        },\n        {\n          \"title\": \"Signature Matching\",\n          \"start_index\": 24,\n          \"end_index\": 25,\n          \"node_id\": \"0029\"\n        },\n        {\n          \"title\": \"Signature Constraints\",\n          \"start_index\": 25,\n          \"end_index\": 25,\n          \"node_id\": \"0030\"\n        },\n        {\n          \"title\": \"Decorating Functors\",\n          \"start_index\": 26,\n          \"end_index\": 26,\n          \"node_id\": \"0031\"\n        },\n        {\n          \"title\": \"External Sharing\",\n          \"start_index\": 26,\n          \"end_index\": 27,\n          \"node_id\": \"0032\"\n        },\n        {\n          \"title\": \"Functors with Arguments\",\n          \"start_index\": 27,\n          \"end_index\": 28,\n          \"node_id\": \"0033\"\n        },\n        {\n          \"title\": \"Sharing Between Argument and Result\",\n          \"start_index\": 28,\n          \"end_index\": 28,\n          \"node_id\": \"0034\"\n        },\n        {\n          \"title\": \"Explicit Result Signatures\",\n          \"start_index\": 28,\n          \"end_index\": 29,\n          \"node_id\": \"0035\"\n        }\n      ],\n      \"node_id\": \"0023\"\n    },\n    {\n      \"title\": \"Implementing an Interpreter in ML\",\n      \"start_index\": 30,\n      \"end_index\": 32,\n      \"nodes\": [\n        {\n          \"title\": \"Version 1: The Bare Typechecker\",\n          \"start_index\": 32,\n          \"end_index\": 33,\n          \"node_id\": \"0037\"\n        },\n        {\n          \"title\": \"Version 2: Adding Lists and Polymorphism\",\n          \"start_index\": 33,\n          \"end_index\": 37,\n          \"node_id\": \"0038\"\n        },\n        {\n          \"title\": \"Version 3: A Different Implementation of Types\",\n          \"start_index\": 37,\n          \"end_index\": 39,\n          \"node_id\": \"0039\"\n        },\n        {\n          \"title\": \"Version 4: Introducing Variables and Let\",\n          \"start_index\": 39,\n          \"end_index\": 43,\n          \"node_id\": \"0040\"\n        },\n        {\n          \"title\": \"Acknowledgement\",\n          \"start_index\": 43,\n          \"end_index\": 43,\n          \"node_id\": \"0041\"\n        }\n      ],\n      \"node_id\": \"0036\"\n    },\n    {\n      \"title\": \"Appendix A: The Bare Interpreter\",\n      \"start_index\": 44,\n      \"end_index\": 44,\n      \"nodes\": [\n        {\n          \"title\": \"Syntax\",\n          \"start_index\": 44,\n          \"end_index\": 44,\n          \"node_id\": \"0043\"\n        },\n        {\n          \"title\": \"Parsing\",\n          \"start_index\": 44,\n          \"end_index\": 45,\n          \"node_id\": \"0044\"\n        },\n        {\n          \"title\": \"Environments\",\n          \"start_index\": 45,\n          \"end_index\": 45,\n          \"node_id\": \"0045\"\n        },\n        {\n          \"title\": \"Evaluation\",\n          \"start_index\": 45,\n          \"end_index\": 46,\n          \"node_id\": \"0046\"\n        },\n        {\n          \"title\": \"Type Checking\",\n          \"start_index\": 46,\n          \"end_index\": 46,\n          \"node_id\": \"0047\"\n        },\n        {\n          \"title\": \"The Interpreter\",\n          \"start_index\": 46,\n          \"end_index\": 47,\n          \"node_id\": \"0048\"\n        },\n        {\n          \"title\": \"The Evaluator\",\n          \"start_index\": 47,\n          \"end_index\": 48,\n          \"node_id\": \"0049\"\n        },\n        {\n          \"title\": \"The Typechecker\",\n          \"start_index\": 48,\n          \"end_index\": 49,\n          \"node_id\": \"0050\"\n        },\n        {\n          \"title\": \"The Basics\",\n          \"start_index\": 50,\n          \"end_index\": 52,\n          \"node_id\": \"0051\"\n        }\n      ],\n      \"node_id\": \"0042\"\n    },\n    {\n      \"title\": \"Appendix B: Files\",\n      \"start_index\": 53,\n      \"end_index\": 53,\n      \"node_id\": \"0052\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/results/q1-fy25-earnings_structure.json",
    "content": "{\n  \"doc_name\": \"q1-fy25-earnings.pdf\",\n  \"doc_description\": \"A comprehensive financial report detailing The Walt Disney Company's first-quarter fiscal 2025 performance, including revenue growth, segment highlights, guidance for fiscal 2025, and key financial metrics such as adjusted EPS, operating income, and cash flow.\",\n  \"structure\": [\n    {\n      \"title\": \"THE WALT DISNEY COMPANY REPORTS FIRST QUARTER EARNINGS FOR FISCAL 2025\",\n      \"start_index\": 1,\n      \"end_index\": 1,\n      \"nodes\": [\n        {\n          \"title\": \"Financial Results for the Quarter\",\n          \"start_index\": 1,\n          \"end_index\": 1,\n          \"nodes\": [\n            {\n              \"title\": \"Key Points\",\n              \"start_index\": 1,\n              \"end_index\": 1,\n              \"node_id\": \"0002\",\n              \"summary\": \"The partial document outlines The Walt Disney Company's financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\\n\\n1. **Financial Results**: \\n   - Revenue increased by 5% to $24.7 billion.\\n   - Income before taxes rose by 27% to $3.7 billion.\\n   - Diluted EPS grew by 35% to $1.40.\\n   - Total segment operating income increased by 31% to $5.1 billion, with adjusted EPS up 44% to $1.76.\\n\\n2. **Entertainment Segment**:\\n   - Operating income increased by $0.8 billion to $1.7 billion.\\n   - Direct-to-Consumer operating income rose by $431 million to $293 million, with advertising revenue (excluding Disney+ Hotstar in India) up 16%.\\n   - Disney+ and Hulu subscriptions increased by 0.9 million, while Disney+ subscribers decreased by 0.7 million.\\n   - Content sales/licensing income grew by $536 million, driven by the success of *Moana 2*.\\n\\n3. **Sports Segment**:\\n   - Operating income increased by $350 million to $247 million.\\n   - Domestic ESPN advertising revenue grew by 15%.\\n\\n4. **Experiences Segment**:\\n   - Operating income remained at $3.1 billion, with a 6 percentage-point adverse impact due to Hurricanes Milton and Helene and pre-opening expenses for the Disney Treasure.\\n   - Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income increased by 28%.\"\n            }\n          ],\n          \"node_id\": \"0001\",\n          \"summary\": \"The partial document is a report from The Walt Disney Company detailing its financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\\n\\n1. **Financial Performance**:\\n   - Revenue increased by 5% to $24.7 billion.\\n   - Income before taxes rose by 27% to $3.7 billion.\\n   - Diluted EPS grew by 35% to $1.40.\\n   - Total segment operating income increased by 31% to $5.1 billion, with adjusted EPS up 44% to $1.76.\\n\\n2. **Segment Highlights**:\\n   - **Entertainment**: Operating income increased by $0.8 billion to $1.7 billion. Direct-to-Consumer income rose by $431 million, though advertising revenue declined 2% (up 16% excluding Disney+ Hotstar in India). Disney+ and Hulu subscriptions increased slightly, while Disney+ subscribers decreased by 0.7 million. Content sales/licensing income grew, driven by the success of *Moana 2*.\\n   - **Sports**: Operating income increased by $350 million to $247 million, with ESPN domestic advertising revenue up 15%.\\n   - **Experiences**: Operating income remained at $3.1 billion, with adverse impacts from hurricanes and pre-opening expenses for the Disney Treasure. Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income rose by 28%.\\n\\n3. **Additional Notes**:\\n   - Non-GAAP financial measures are used for certain metrics.\\n   - Disney+ Hotstar in India saw a significant decline in advertising revenue compared to the previous year.\"\n        },\n        {\n          \"title\": \"Guidance and Outlook\",\n          \"start_index\": 2,\n          \"end_index\": 2,\n          \"nodes\": [\n            {\n              \"title\": \"Star India deconsolidated in Q1\",\n              \"start_index\": 2,\n              \"end_index\": 2,\n              \"node_id\": \"0004\",\n              \"summary\": \"The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For fiscal 2025, the company projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\\u2019s digital strategy, and continued investments in the Experiences segment, expressing confidence in Disney's growth strategy.\"\n            },\n            {\n              \"title\": \"Q2 Fiscal 2025\",\n              \"start_index\": 2,\n              \"end_index\": 2,\n              \"node_id\": \"0005\",\n              \"summary\": \"The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes Disney's strong start to the fiscal year, citing achievements in box office performance, improved streaming profitability, ESPN's digital strategy, and the enduring appeal of the Experiences segment.\"\n            },\n            {\n              \"title\": \"Fiscal Year 2025\",\n              \"start_index\": 2,\n              \"end_index\": 2,\n              \"node_id\": \"0006\",\n              \"summary\": \"The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes Disney's creative and financial strength, strong box office performance, improved streaming profitability, advancements in ESPN's digital strategy, and continued global investments in the Experiences segment.\"\n            }\n          ],\n          \"node_id\": \"0003\",\n          \"summary\": \"The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\\u2019s digital strategy, and continued investment in global experiences.\"\n        },\n        {\n          \"title\": \"Message From Our CEO\",\n          \"start_index\": 2,\n          \"end_index\": 2,\n          \"node_id\": \"0007\",\n          \"summary\": \"The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\\u2019s digital strategy, and continued investment in global experiences.\"\n        }\n      ],\n      \"node_id\": \"0000\",\n      \"summary\": \"The partial document is a report from The Walt Disney Company detailing its financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\\n\\n1. **Financial Results**:  \\n   - Revenue increased by 5% to $24.7 billion.  \\n   - Income before taxes rose by 27% to $3.7 billion.  \\n   - Diluted EPS grew by 35% to $1.40.  \\n   - Total segment operating income increased by 31% to $5.1 billion, and adjusted EPS rose by 44% to $1.76.  \\n\\n2. **Entertainment Segment**:  \\n   - Operating income increased by $0.8 billion to $1.7 billion.  \\n   - Direct-to-Consumer operating income rose by $431 million to $293 million, with advertising revenue up 16% (excluding Disney+ Hotstar in India).  \\n   - Disney+ and Hulu subscriptions increased by 0.9 million, while Disney+ subscribers decreased by 0.7 million.  \\n   - Content sales/licensing income grew by $536 million, driven by the success of *Moana 2*.  \\n\\n3. **Sports Segment**:  \\n   - Operating income increased by $350 million to $247 million.  \\n   - Domestic ESPN advertising revenue grew by 15%.  \\n\\n4. **Experiences Segment**:  \\n   - Operating income remained at $3.1 billion, with a 6 percentage-point adverse impact due to Hurricanes Milton and Helene and pre-opening expenses for the Disney Treasure.  \\n   - Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income increased by 28%.  \\n\\nThe report also includes non-GAAP financial measures and notes the impact of Disney+ Hotstar's advertising revenue in India.\"\n    },\n    {\n      \"title\": \"SUMMARIZED FINANCIAL RESULTS\",\n      \"start_index\": 3,\n      \"end_index\": 3,\n      \"nodes\": [\n        {\n          \"title\": \"SUMMARIZED SEGMENT FINANCIAL RESULTS\",\n          \"start_index\": 3,\n          \"end_index\": 3,\n          \"node_id\": \"0009\",\n          \"summary\": \"The partial document provides a summarized overview of financial results for the first quarter of fiscal years 2025 and 2024. Key points include:\\n\\n1. **Overall Financial Performance**:\\n   - Revenues increased by 5% from $23,549 million in 2024 to $24,690 million in 2025.\\n   - Income before income taxes rose by 27%.\\n   - Total segment operating income grew by 31%.\\n   - Diluted EPS increased by 35%, and diluted EPS excluding certain items rose by 44%.\\n   - Cash provided by operations increased by 47%, while free cash flow decreased by 17%.\\n\\n2. **Segment Financial Results**:\\n   - Revenue growth was observed in the Entertainment segment (9%) and Experiences segment (3%), while Sports revenue remained flat.\\n   - Segment operating income for Entertainment increased significantly by 95%, while Sports shifted from a loss to a positive income. Experiences segment operating income remained stable.\\n\\n3. **Non-GAAP Measures**:\\n   - The document highlights the use of non-GAAP financial measures such as total segment operating income, diluted EPS excluding certain items, and free cash flow, with references to further details and reconciliations provided elsewhere in the report.\"\n        }\n      ],\n      \"node_id\": \"0008\",\n      \"summary\": \"The partial document provides a summarized overview of financial results for the first quarter of fiscal years 2025 and 2024. Key points include:\\n\\n1. **Overall Financial Performance**:\\n   - Revenues increased by 5% from $23,549 million in 2024 to $24,690 million in 2025.\\n   - Income before income taxes rose by 27%.\\n   - Total segment operating income grew by 31%.\\n   - Diluted EPS increased by 35%, and diluted EPS excluding certain items rose by 44%.\\n   - Cash provided by operations increased by 47%, while free cash flow decreased by 17%.\\n\\n2. **Segment Financial Results**:\\n   - Revenue growth was observed in the Entertainment segment (9%) and Experiences segment (3%), while Sports revenue remained flat.\\n   - Segment operating income for Entertainment increased significantly by 95%, while Sports shifted from a loss to a positive income. Experiences segment operating income remained stable.\\n\\n3. **Non-GAAP Measures**:\\n   - The document highlights the use of non-GAAP financial measures such as total segment operating income, diluted EPS excluding certain items, and free cash flow, with references to further details and reconciliations provided in later sections.\"\n    },\n    {\n      \"title\": \"DISCUSSION OF FIRST QUARTER SEGMENT RESULTS\",\n      \"start_index\": 4,\n      \"end_index\": 4,\n      \"nodes\": [\n        {\n          \"title\": \"Star India\",\n          \"start_index\": 4,\n          \"end_index\": 4,\n          \"node_id\": \"0011\",\n          \"summary\": \"The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels, Disney+ Hotstar, and certain RIL-controlled media businesses, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\\u2019s results under \\\"Equity in the income of investees.\\\" Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues and a 95% increase in operating income compared to the prior-year quarter. The growth in operating income is attributed to improved results in Content Sales/Licensing and Direct-to-Consumer, partially offset by a decline in Linear Networks.\"\n        },\n        {\n          \"title\": \"Entertainment\",\n          \"start_index\": 4,\n          \"end_index\": 4,\n          \"nodes\": [\n            {\n              \"title\": \"Linear Networks\",\n              \"start_index\": 5,\n              \"end_index\": 5,\n              \"node_id\": \"0013\",\n              \"summary\": \"The partial document provides financial performance details for Linear Networks and Direct-to-Consumer segments for the quarters ending December 28, 2024, and December 30, 2023. Key points include:\\n\\n1. **Linear Networks**:\\n   - Revenue decreased by 7%, with domestic revenue remaining flat and international revenue declining by 31%.\\n   - Operating income decreased by 11%, with domestic income stable and international income dropping by 39%.\\n   - Domestic operating income was impacted by higher programming costs (due to the 2023 guild strikes), lower affiliate revenue (fewer subscribers), lower technology costs, and higher advertising revenue (driven by political advertising but offset by lower viewership).\\n   - International operating income decline was attributed to the Star India Transaction.\\n   - Equity income from investees decreased due to lower income from A+E Television Networks, reduced advertising and affiliate revenue, and the absence of a prior-year gain from an investment sale.\\n\\n2. **Direct-to-Consumer**:\\n   - Revenue increased by 9%, driven by higher subscription revenue due to increased pricing and more subscribers, partially offset by unfavorable foreign exchange impacts.\\n   - Operating income improved significantly, moving from a loss in the prior year to a profit, reflecting subscription revenue growth.\"\n            },\n            {\n              \"title\": \"Direct-to-Consumer\",\n              \"start_index\": 5,\n              \"end_index\": 7,\n              \"node_id\": \"0014\",\n              \"summary\": \"The partial document provides a financial performance overview of various segments for the quarter ended December 28, 2024, compared to the prior-year quarter. Key points include:\\n\\n1. **Linear Networks**:\\n   - Revenue decreased by 7%, with domestic revenue flat and international revenue down 31%.\\n   - Operating income decreased by 11%, with domestic income flat and international income down 39%, primarily due to the Star India transaction.\\n   - Equity income from investees declined by 29%, driven by lower income from A+E Television Networks and the absence of a prior-year gain on an investment sale.\\n\\n2. **Direct-to-Consumer (DTC)**:\\n   - Revenue increased by 9%, and operating income improved significantly from a loss of $138 million to a profit of $293 million.\\n   - Growth was driven by higher subscription revenue due to pricing increases and more subscribers, partially offset by higher costs and lower advertising revenue.\\n   - Key metrics showed slight changes in Disney+ and Hulu subscriber numbers, with increases in average monthly revenue per paid subscriber due to pricing adjustments.\\n\\n3. **Content Sales/Licensing and Other**:\\n   - Revenue increased by 34%, and operating income improved significantly, driven by strong theatrical performance, particularly from \\\"Moana 2,\\\" and contributions from \\\"Mufasa: The Lion King.\\\"\\n\\n4. **Sports**:\\n   - ESPN revenue grew by 8%, with domestic and international segments showing increases, while Star India revenue dropped by 90%.\\n   - Operating income for ESPN improved by 15%, while Star India shifted from a loss to a small profit.\\n\\nThe document highlights revenue trends, operating income changes, and key drivers for each segment, including programming costs, subscriber growth, pricing adjustments, and content performance.\"\n            },\n            {\n              \"title\": \"Content Sales/Licensing and Other\",\n              \"start_index\": 7,\n              \"end_index\": 7,\n              \"node_id\": \"0015\",\n              \"summary\": \"The partial document discusses the financial performance of Disney's streaming services, content sales, and sports segment. Key points include:\\n\\n1. **Disney+ Revenue**: Domestic and international Disney+ average monthly revenue per paid subscriber increased due to pricing hikes, partially offset by promotional offerings. International revenue also benefited from higher advertising revenue.\\n\\n2. **Hulu Revenue**: Hulu SVOD Only revenue remained stable, with pricing increases offsetting lower advertising revenue. Hulu Live TV + SVOD revenue increased due to pricing hikes.\\n\\n3. **Content Sales/Licensing**: Revenue and operating income improved significantly, driven by strong theatrical distribution results, particularly from \\\"Moana 2,\\\" and contributions from \\\"Mufasa: The Lion King.\\\"\\n\\n4. **Sports Revenue**: ESPN domestic and international revenues grew, while Star India revenue declined sharply. Operating income for ESPN improved, with domestic income slightly down and international losses reduced. Star India showed a notable recovery in operating income.\"\n            }\n          ],\n          \"node_id\": \"0012\",\n          \"summary\": \"The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels and the Disney+ Hotstar service in India, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\\u2019s results under \\u201cEquity in the income of investees.\\u201d Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues compared to the prior year, driven by growth in Direct-to-Consumer and Content Sales/Licensing and Other, despite a decline in Linear Networks. Operating income increased by 95%, primarily due to improved results in Content Sales/Licensing and Other and Direct-to-Consumer, partially offset by a decrease in Linear Networks.\"\n        },\n        {\n          \"title\": \"Sports\",\n          \"start_index\": 7,\n          \"end_index\": 7,\n          \"nodes\": [\n            {\n              \"title\": \"Domestic ESPN\",\n              \"start_index\": 8,\n              \"end_index\": 8,\n              \"node_id\": \"0017\",\n              \"summary\": \"The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for the current quarter compared to the prior-year quarter. Key points include:\\n\\n1. **Domestic ESPN**: \\n   - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights and changes in the College Football Playoff (CFP) format.\\n   - Increase in advertising revenue due to higher rates.\\n   - Revenue from sub-licensing CFP programming rights.\\n   - Affiliate revenue remained comparable, with rate increases offset by fewer subscribers.\\n\\n2. **International ESPN**: \\n   - Decrease in operating loss driven by higher fees from the Entertainment segment for Disney+ sports content.\\n   - Increased programming and production costs due to higher soccer rights costs.\\n   - Lower affiliate revenue due to fewer subscribers.\\n\\n3. **Star India**: \\n   - Improved operating results due to the absence of significant cricket events in the current quarter compared to the prior-year quarter, which included the ICC Cricket World Cup.\\n\\n4. **Key Metrics for ESPN+**:\\n   - Paid subscribers decreased from 25.6 million to 24.9 million.\\n   - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue.\"\n            },\n            {\n              \"title\": \"International ESPN\",\n              \"start_index\": 8,\n              \"end_index\": 8,\n              \"node_id\": \"0018\",\n              \"summary\": \"The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for the current quarter compared to the prior-year quarter. Key points include:\\n\\n1. **Domestic ESPN**: \\n   - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights and changes in the College Football Playoff (CFP) format.\\n   - Increase in advertising revenue due to higher rates.\\n   - Revenue from sub-licensing CFP programming rights.\\n   - Affiliate revenue remained comparable, with rate increases offset by fewer subscribers.\\n\\n2. **International ESPN**: \\n   - Decrease in operating loss driven by higher fees from the Entertainment segment for Disney+ sports content.\\n   - Increased programming and production costs due to higher soccer rights costs.\\n   - Lower affiliate revenue due to fewer subscribers.\\n\\n3. **Star India**: \\n   - Improved operating results due to the absence of significant cricket events in the current quarter compared to the ICC Cricket World Cup in the prior-year quarter.\\n\\n4. **Key Metrics for ESPN+**:\\n   - Paid subscribers decreased from 25.6 million to 24.9 million.\\n   - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue.\"\n            },\n            {\n              \"title\": \"Star India\",\n              \"start_index\": 8,\n              \"end_index\": 8,\n              \"node_id\": \"0019\",\n              \"summary\": \"The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for a specific quarter. Key points include:\\n\\n1. **Domestic ESPN**: \\n   - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights, including additional College Football Playoff (CFP) games under a revised format.\\n   - Increase in advertising revenue due to higher rates.\\n   - Revenue from sub-licensing CFP programming rights.\\n   - Affiliate revenue remained comparable to the prior year due to effective rate increases offset by fewer subscribers.\\n\\n2. **International ESPN**: \\n   - Decrease in operating loss driven by higher fees from the Entertainment segment for sports content on Disney+.\\n   - Increased programming and production costs due to higher soccer rights costs.\\n   - Lower affiliate revenue due to fewer subscribers.\\n\\n3. **Star India**: \\n   - Improvement in operating results due to the absence of significant cricket events in the current quarter compared to the prior year, which included the ICC Cricket World Cup.\\n\\n4. **Key Metrics for ESPN+**:\\n   - Paid subscribers decreased from 25.6 million to 24.9 million.\\n   - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue.\"\n            }\n          ],\n          \"node_id\": \"0016\",\n          \"summary\": \"The partial document discusses the financial performance of Disney's streaming services, content sales, and sports segment. Key points include:\\n\\n1. **Disney+ Revenue**: Domestic and international Disney+ average monthly revenue per paid subscriber increased due to pricing hikes, partially offset by promotional offerings. International revenue also benefited from higher advertising revenue.\\n\\n2. **Hulu Revenue**: Hulu SVOD Only revenue remained stable, with pricing increases offsetting lower advertising revenue. Hulu Live TV + SVOD revenue increased due to pricing hikes.\\n\\n3. **Content Sales/Licensing**: Revenue and operating income improved significantly, driven by strong theatrical performance, particularly from \\\"Moana 2,\\\" and contributions from \\\"Mufasa: The Lion King.\\\"\\n\\n4. **Sports Revenue**: ESPN domestic and international revenues grew, while Star India revenue declined sharply. Operating income for ESPN improved, with domestic income slightly down and international income showing significant recovery. Star India showed a notable turnaround in operating income.\"\n        },\n        {\n          \"title\": \"Experiences\",\n          \"start_index\": 9,\n          \"end_index\": 9,\n          \"node_id\": \"0020\",\n          \"summary\": \"The partial document provides financial performance details for the Parks & Experiences segment, including revenues and operating income for domestic and international operations, as well as consumer products. It highlights a 3% increase in total revenue and stable operating income compared to the prior year. Domestic parks and experiences were negatively impacted by hurricanes, leading to lower volumes and higher costs, despite increased guest spending. International parks and experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings. The document also notes increased corporate expenses due to a legal settlement and a $143 million loss related to the Star India Transaction.\"\n        }\n      ],\n      \"node_id\": \"0010\",\n      \"summary\": \"The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels, Disney+ Hotstar, and certain RIL-controlled media businesses, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\\u2019s results under \\\"Equity in the income of investees.\\\" Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues and a 95% increase in operating income compared to the prior-year quarter. The growth in operating income is attributed to improved results in Content Sales/Licensing and Direct-to-Consumer, partially offset by a decline in Linear Networks.\"\n    },\n    {\n      \"title\": \"OTHER FINANCIAL INFORMATION\",\n      \"start_index\": 9,\n      \"end_index\": 9,\n      \"nodes\": [\n        {\n          \"title\": \"Corporate and Unallocated Shared Expenses\",\n          \"start_index\": 9,\n          \"end_index\": 9,\n          \"node_id\": \"0022\",\n          \"summary\": \"The partial document provides a financial overview of revenues and operating income for Parks & Experiences, including Domestic, International, and Consumer Products segments, comparing the quarters ending December 28, 2024, and December 30, 2023. It highlights a 3% increase in overall revenue and stable operating income. Domestic Parks and Experiences were negatively impacted by Hurricanes Milton and Helene, leading to closures, cancellations, higher costs, and lower attendance, despite increased guest spending. International Parks and Experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, offset by higher costs. The document also notes a $152 million increase in corporate and unallocated shared expenses due to a legal settlement and a $143 million loss related to the Star India Transaction.\"\n        },\n        {\n          \"title\": \"Restructuring and Impairment Charges\",\n          \"start_index\": 9,\n          \"end_index\": 9,\n          \"node_id\": \"0023\",\n          \"summary\": \"The partial document provides financial performance details for the Parks & Experiences segment, including revenues and operating income for domestic and international operations, as well as consumer products. It highlights a 3% increase in overall revenue and stable operating income compared to the prior year. Domestic parks and experiences were negatively impacted by hurricanes, leading to lower volumes and higher costs, despite increased guest spending. International parks and experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, though costs also rose. Additionally, corporate and unallocated shared expenses increased due to a legal settlement, and a $143 million loss was recorded related to the Star India Transaction.\"\n        },\n        {\n          \"title\": \"Interest Expense, net\",\n          \"start_index\": 10,\n          \"end_index\": 10,\n          \"node_id\": \"0024\",\n          \"summary\": \"The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ending December 28, 2024, and December 30, 2023. Key points include:\\n\\n1. **Interest Expense, Net**: A decrease in interest expense due to lower average rates and debt balances, partially offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses compared to prior-year gains.\\n\\n2. **Equity in the Income of Investees**: A $89 million decrease in income from investees, primarily due to lower income from A+E and losses from the India joint venture.\\n\\n3. **Income Taxes**: An increase in the effective income tax rate from 25.1% to 27.8%, driven by a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects of employee share-based awards.\"\n        },\n        {\n          \"title\": \"Equity in the Income of Investees\",\n          \"start_index\": 10,\n          \"end_index\": 10,\n          \"node_id\": \"0025\",\n          \"summary\": \"The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ended December 28, 2024, and December 30, 2023. It highlights a decrease in net interest expense due to lower average rates and debt balances, offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses. Equity income from investees decreased significantly, driven by lower income from A+E and losses from the India joint venture. The effective income tax rate increased due to a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects.\"\n        },\n        {\n          \"title\": \"Income Taxes\",\n          \"start_index\": 10,\n          \"end_index\": 10,\n          \"node_id\": \"0026\",\n          \"summary\": \"The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ended December 28, 2024, and December 30, 2023. It highlights a decrease in net interest expense due to lower average rates and debt balances, offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses. Equity income from investees dropped significantly, driven by lower income from A+E and losses from the India joint venture. The effective income tax rate increased due to a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects.\"\n        },\n        {\n          \"title\": \"Noncontrolling Interests\",\n          \"start_index\": 11,\n          \"end_index\": 11,\n          \"node_id\": \"0027\",\n          \"summary\": \"The partial document covers two main points:\\n\\n1. **Noncontrolling Interests**: It discusses the net income attributable to noncontrolling interests, which decreased by 63% compared to the prior-year quarter. The decrease is attributed to the prior-year accretion of NBC Universal\\u2019s interest in Hulu. The calculation of net income attributable to noncontrolling interests is based on income after royalties, management fees, financing costs, and income taxes.\\n\\n2. **Cash from Operations**: It details cash provided by operations and free cash flow, showing an increase in cash provided by operations by $1.0 billion to $3.2 billion in the current quarter. The increase is driven by lower tax payments, higher operating income at Entertainment, and higher film and television production spending, along with the timing of payments for sports rights. Free cash flow decreased by $147 million compared to the prior-year quarter.\"\n        },\n        {\n          \"title\": \"Cash from Operations\",\n          \"start_index\": 11,\n          \"end_index\": 11,\n          \"node_id\": \"0028\",\n          \"summary\": \"The partial document covers two main points:\\n\\n1. **Noncontrolling Interests**: It discusses the net income attributable to noncontrolling interests, which decreased by 63% in the quarter ended December 28, 2024, compared to the prior-year quarter. The decrease is attributed to the prior-year accretion of NBC Universal\\u2019s interest in Hulu. The calculation of net income attributable to noncontrolling interests includes royalties, management fees, financing costs, and income taxes.\\n\\n2. **Cash from Operations**: It details cash provided by operations and free cash flow for the quarter ended December 28, 2024, compared to the prior-year quarter. Cash provided by operations increased by $1.0 billion, driven by lower tax payments, higher operating income at Entertainment, and higher film and television production spending, along with the timing of payments for sports rights. Free cash flow decreased by $147 million due to increased investments in parks, resorts, and other property.\"\n        },\n        {\n          \"title\": \"Capital Expenditures\",\n          \"start_index\": 12,\n          \"end_index\": 12,\n          \"node_id\": \"0029\",\n          \"summary\": \"The partial document provides details on capital expenditures and depreciation expenses for parks, resorts, and other properties. It highlights an increase in capital expenditures from $1.3 billion to $2.5 billion, primarily due to higher spending on cruise ship fleet expansion in the Experiences segment. The document also breaks down investments and depreciation expenses by category (Entertainment, Sports, Domestic and International Experiences, and Corporate) for the quarters ending December 28, 2024, and December 30, 2023. Depreciation expenses increased from $823 million to $909 million, with detailed figures provided for each segment.\"\n        },\n        {\n          \"title\": \"Depreciation Expense\",\n          \"start_index\": 12,\n          \"end_index\": 12,\n          \"node_id\": \"0030\",\n          \"summary\": \"The partial document provides details on capital expenditures and depreciation expenses for parks, resorts, and other properties. It highlights an increase in capital expenditures from $1.3 billion to $2.5 billion, primarily due to higher spending on cruise ship fleet expansion in the Experiences segment. The breakdown of investments and depreciation expenses is provided for Entertainment, Sports, Domestic and International Experiences, and Corporate segments for the quarters ending December 28, 2024, and December 30, 2023. Depreciation expenses also increased from $823 million to $909 million, with detailed segment-wise allocations.\"\n        }\n      ],\n      \"node_id\": \"0021\",\n      \"summary\": \"The partial document provides a financial overview of revenues and operating income for Parks & Experiences, including Domestic, International, and Consumer Products segments, comparing the quarters ending December 28, 2024, and December 30, 2023. It highlights a 3% increase in total revenue and stable operating income. Domestic Parks and Experiences were negatively impacted by Hurricanes Milton and Helene, leading to closures, cancellations, higher costs, and lower attendance, despite increased guest spending. International Parks and Experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, offset by increased costs. The document also notes a rise in corporate and unallocated shared expenses due to a legal settlement and a $143 million loss related to the Star India Transaction.\"\n    },\n    {\n      \"title\": \"THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF INCOME\",\n      \"start_index\": 13,\n      \"end_index\": 13,\n      \"node_id\": \"0031\",\n      \"summary\": \"The partial document provides a condensed consolidated statement of income for The Walt Disney Company for the quarters ended December 28, 2024, and December 30, 2023. It includes details on revenues, costs and expenses, restructuring and impairment charges, net interest expense, equity in the income of investees, income before income taxes, income taxes, and net income. It also breaks down net income attributable to noncontrolling interests and The Walt Disney Company. Additionally, it provides earnings per share (diluted and basic) and the weighted average number of shares outstanding (diluted and basic) for both periods.\"\n    },\n    {\n      \"title\": \"THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED BALANCE SHEETS\",\n      \"start_index\": 14,\n      \"end_index\": 14,\n      \"node_id\": \"0032\",\n      \"summary\": \"The partial document is a condensed consolidated balance sheet for The Walt Disney Company, comparing financial data as of December 28, 2024, and September 28, 2024. It details the company's assets, liabilities, and equity. Key points include:\\n\\n1. **Assets**: Breakdown of current assets (cash, receivables, inventories, content advances, and other assets), produced and licensed content costs, investments, property (attractions, buildings, equipment, projects in progress, and land), intangible assets, goodwill, and other assets. Total assets increased slightly from $196.2 billion to $197 billion.\\n\\n2. **Liabilities**: Includes current liabilities (accounts payable, borrowings, deferred revenue), long-term borrowings, deferred income taxes, and other long-term liabilities. Total liabilities remained relatively stable.\\n\\n3. **Equity**: Details Disney shareholders' equity, including common stock, retained earnings, accumulated other comprehensive loss, and treasury stock. Noncontrolling interests are also included. Total equity increased from $105.5 billion to $106.7 billion.\\n\\n4. **Overall Financial Position**: The balance sheet reflects a stable financial position with slight changes in assets, liabilities, and equity over the period.\"\n    },\n    {\n      \"title\": \"THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS\",\n      \"start_index\": 15,\n      \"end_index\": 15,\n      \"node_id\": \"0033\",\n      \"summary\": \"The partial document provides a condensed consolidated statement of cash flows for The Walt Disney Company for the quarters ended December 28, 2024, and December 30, 2023. It details cash flow activities categorized into operating, investing, and financing activities. Key points include:\\n\\n1. **Operating Activities**: Net income increased from $2,151 million in 2023 to $2,644 million in 2024. Other significant changes include variations in depreciation, deferred taxes, equity income, content costs, and changes in operating assets and liabilities, resulting in cash provided by operations of $3,205 million in 2024 compared to $2,185 million in 2023.\\n\\n2. **Investing Activities**: Investments in parks, resorts, and other properties increased significantly in 2024 ($2,466 million) compared to 2023 ($1,299 million), leading to higher cash used in investing activities.\\n\\n3. **Financing Activities**: The company saw a net cash outflow in financing activities, including commercial paper borrowings, stock repurchases, and debt reduction. In 2024, cash used in financing activities was $997 million, a significant improvement from $8,006 million in 2023.\\n\\n4. **Exchange Rate Impact**: Exchange rates negatively impacted cash in 2024 by $153 million, compared to a positive impact of $79 million in 2023.\\n\\n5. **Overall Cash Position**: The company\\u2019s cash, cash equivalents, and restricted cash decreased from $14,235 million at the beginning of the 2023 period to $5,582 million at the end of the 2024 period.\"\n    },\n    {\n      \"title\": \"DTC PRODUCT DESCRIPTIONS AND KEY DEFINITIONS\",\n      \"start_index\": 16,\n      \"end_index\": 16,\n      \"node_id\": \"0034\",\n      \"summary\": \"The partial document provides an overview of Disney's Direct-to-Consumer (DTC) product offerings, key definitions, and metrics. It details the availability of Disney+, ESPN+, and Hulu as standalone services or bundled offerings in the U.S., including Hulu Live TV + SVOD, which incorporates Disney+ and ESPN+. It explains the global reach of Disney+ in over 150 countries and the various purchase channels, including websites, third-party platforms, and wholesale arrangements. The document defines \\\"paid subscribers\\\" as those generating subscription revenue, excluding extra member add-ons, and outlines how subscribers are counted for multi-product offerings. It also describes the calculation of average monthly revenue per paid subscriber for Hulu, ESPN+, and Disney+, including revenue components like subscription fees, advertising, and add-ons, while noting differences in revenue allocation and the impact of wholesale arrangements on average revenue.\"\n    },\n    {\n      \"title\": \"NON-GAAP FINANCIAL MEASURES\",\n      \"start_index\": 17,\n      \"end_index\": 17,\n      \"nodes\": [\n        {\n          \"title\": \"Diluted EPS excluding certain items\",\n          \"start_index\": 17,\n          \"end_index\": 18,\n          \"node_id\": \"0036\",\n          \"summary\": \"The partial document discusses the use of non-GAAP financial measures, specifically diluted EPS excluding certain items (adjusted EPS), total segment operating income, and free cash flow. It explains that these measures are not defined by GAAP but are important for evaluating the company's performance. The document highlights that these measures should be reviewed alongside comparable GAAP measures and may not be directly comparable to similar measures from other companies. It provides details on the adjustments made to diluted EPS, including the exclusion of certain items affecting comparability and amortization of TFCF and Hulu intangible assets, to better reflect operational performance. The document also includes a reconciliation table comparing reported diluted EPS to adjusted EPS for specific quarters, showing the impact of excluded items such as restructuring charges and intangible asset amortization. Additionally, it notes the challenges in providing forward-looking GAAP measures due to unpredictable factors.\"\n        },\n        {\n          \"title\": \"Total segment operating income\",\n          \"start_index\": 19,\n          \"end_index\": 20,\n          \"node_id\": \"0037\",\n          \"summary\": \"The partial document focuses on the evaluation of the company's performance through two key financial metrics: total segment operating income and free cash flow. It explains that total segment operating income is used to assess the performance of operating segments separately from non-operational factors, providing insights into operational results. A reconciliation table is provided, showing the calculation of total segment operating income for two quarters, highlighting changes in various components such as corporate expenses, restructuring charges, and interest expenses. Additionally, the document discusses free cash flow as a measure of cash available for purposes beyond capital expenditures, such as debt servicing, acquisitions, and shareholder returns. A summary of consolidated cash flows and a reconciliation of cash provided by operations to free cash flow are presented, comparing figures for two quarters and highlighting changes in cash flow components.\"\n        },\n        {\n          \"title\": \"Free cash flow\",\n          \"start_index\": 20,\n          \"end_index\": 20,\n          \"node_id\": \"0038\",\n          \"summary\": \"The partial document provides a reconciliation of the company's consolidated cash provided by operations to free cash flow for the quarters ended December 28, 2024, and December 30, 2023. It highlights a $1,020 million increase in cash provided by operations, a $1,167 million increase in investments in parks, resorts, and other property, and a $147 million decrease in free cash flow.\"\n        }\n      ],\n      \"node_id\": \"0035\",\n      \"summary\": \"The partial document discusses the use of non-GAAP financial measures by the company, including diluted EPS excluding certain items (adjusted EPS), total segment operating income, and free cash flow. It explains that these measures are not defined by GAAP but are important for evaluating the company's performance. The document emphasizes that these measures should be reviewed alongside comparable GAAP measures and may not be directly comparable to similar measures from other companies. It highlights the company's inability to provide forward-looking GAAP measures or reconciliations due to uncertainties in predicting significant items. Additionally, the document details the rationale for excluding certain items and amortization of TFCF and Hulu intangible assets from diluted EPS to enhance comparability and provide a clearer evaluation of operational performance, particularly given the significant impact of the 2019 TFCF and Hulu acquisition.\"\n    },\n    {\n      \"title\": \"FORWARD-LOOKING STATEMENTS\",\n      \"start_index\": 21,\n      \"end_index\": 21,\n      \"node_id\": \"0039\",\n      \"summary\": \"The partial document outlines the inclusion of forward-looking statements in an earnings release, emphasizing that these statements are based on management's views and assumptions about future events and business performance. It highlights that actual results may differ materially due to various factors, including company actions (e.g., restructuring, strategic initiatives, cost rationalization), external developments (e.g., economic conditions, competition, consumer behavior, regulatory changes, technological advancements, labor market activities, and natural disasters), and their potential impacts on operations, profitability, content performance, advertising markets, and taxation. The document also references additional risk factors and analyses detailed in the company's filings with the SEC, such as annual and quarterly reports.\"\n    },\n    {\n      \"title\": \"PREPARED EARNINGS REMARKS AND CONFERENCE CALL INFORMATION\",\n      \"start_index\": 22,\n      \"end_index\": 22,\n      \"node_id\": \"0040\",\n      \"summary\": \"The partial document provides information about The Walt Disney Company's prepared management remarks and a conference call scheduled for February 5, 2025, at 8:30 AM EST/5:30 AM PST, accessible via a live webcast on their investor website. It also mentions that a replay of the webcast will be available on the site. Additionally, contact details for Corporate Communications (David Jefferson) and Investor Relations (Carlos Gomez) are provided.\"\n    }\n  ]\n}"
  },
  {
    "path": "tutorials/doc-search/README.md",
    "content": "\n\n## Document Search Examples\n\n\nPageIndex currently enables reasoning-based RAG within a single document by default.\nFor users who need to search across multiple documents, we provide three best-practice workflows for different scenarios below.\n\n* [**Search by Metadata**:](metadata.md) for documents that can be distinguished by metadata.\n* [**Search by Semantics**:](semantics.md) for documents with different semantic content or cover diverse topics.\n* [**Search by Description**:](description.md) a lightweight strategy for a small number of documents.\n\n\n## 💬 Support\n\n* 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)\n* 📨 [Contact Us](https://ii2abc2jejf.typeform.com/to/meB40zV0)"
  },
  {
    "path": "tutorials/doc-search/description.md",
    "content": "\n## Document Search by Description\n\nFor documents that don't have metadata, you can use LLM-generated descriptions to help with document selection. This is a lightweight approach that works best with a small number of documents.\n\n\n### Example Pipeline\n\n\n#### PageIndex Tree Generation\nUpload all documents into PageIndex to get their `doc_id` and tree structure.\n\n#### Description Generation\n\nGenerate a description for each document based on its PageIndex tree structure and node summaries.\n```python\nprompt = f\"\"\"\nYou are given a table of contents structure of a document. \nYour task is to generate a one-sentence description for the document that makes it easy to distinguish from other documents.\n    \nDocument tree structure: {PageIndex_Tree}\n\nDirectly return the description, do not include any other text.\n\"\"\"\n```\n\n#### Search with LLM\n\nUse an LLM to select relevant documents by comparing the user query against the generated descriptions.\n\nBelow is a sample prompt for document selection based on their descriptions:\n\n```python\nprompt = f\"\"\" \nYou are given a list of documents with their IDs, file names, and descriptions. Your task is to select documents that may contain information relevant to answering the user query.\n\nQuery: {query}\n\nDocuments: [\n    {\n        \"doc_id\": \"xxx\",\n        \"doc_name\": \"xxx\",\n        \"doc_description\": \"xxx\"\n    }\n]\n\nResponse Format:\n{{\n    \"thinking\": \"<Your reasoning for document selection>\",\n    \"answer\": <Python list of relevant doc_ids>, e.g. ['doc_id1', 'doc_id2']. Return [] if no documents are relevant.\n}}\n\nReturn only the JSON structure, with no additional output.\n\"\"\"\n```\n\n#### Retrieve with PageIndex\n\nUse the PageIndex `doc_id` of the retrieved documents to perform further retrieval via the PageIndex retrieval API.\n\n\n\n## 💬 Help & Community\nContact us if you need any advice on conducting document searches for your use case.\n\n- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)  \n- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0)"
  },
  {
    "path": "tutorials/doc-search/metadata.md",
    "content": "\n\n## Document Search by Metadata\n<callout>PageIndex with metadata support is in closed beta. Fill out this form to request early access to this feature.</callout>\n\nFor documents that can be easily distinguished by metadata, we recommend using metadata to search the documents.\nThis method is ideal for the following document types:\n- Financial reports categorized by company and time period\n- Legal documents categorized by case type\n- Medical records categorized by patient or condition\n- And many others\n\nIn such cases, you can search documents by leveraging their metadata. A popular method is to use \"Query to SQL\" for document retrieval.\n\n\n### Example Pipeline\n\n#### PageIndex Tree Generation\nUpload all documents into PageIndex to get their `doc_id`.\n\n#### Set up SQL tables\n\nStore documents along with their metadata and the PageIndex `doc_id` in a database table.\n\n#### Query to SQL\n\nUse an LLM to transform a user’s retrieval request into a SQL query to fetch relevant documents.\n\n#### Retrieve with PageIndex\n\nUse the PageIndex `doc_id` of the retrieved documents to perform further retrieval via the PageIndex retrieval API.\n\n## 💬 Help & Community\nContact us if you need any advice on conducting document searches for your use case.\n\n- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)  \n- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0)"
  },
  {
    "path": "tutorials/doc-search/semantics.md",
    "content": "## Document Search by Semantics\n\nFor documents that cover diverse topics, one can also use vector-based semantic search to search the documents. The procedure is slightly different from the classic vector-search-based method.\n\n### Example Pipeline\n\n\n#### Chunking and Embedding\nDivide the documents into chunks, choose an embedding model to convert the chunks into vectors and store each vector with its corresponding `doc_id` in a vector database.\n\n\n#### Vector Search\n\nFor each query, conduct a vector-based search to get top-K chunks with their corresponding documents. \n\n#### Compute Document Score\n\nFor each document, calculate a relevance score. Let N be the number of content chunks associated with each document, and let **ChunkScore**(n) be the relevance score of chunk n. The document score is computed as:\n\n\n$$\n\\text{DocScore}=\\frac{1}{\\sqrt{N+1}}\\sum_{n=1}^N \\text{ChunkScore}(n)\n$$\n\n- The sum aggregates relevance from all related chunks.\n- The +1 inside the square root ensures the formula handles nodes with zero chunks.\n- Using the square root in the denominator allows the score to increase with the number of relevant chunks, but with diminishing returns. This rewards documents with more relevant chunks, while preventing large nodes from dominating due to quantity alone.\n- This scoring favors documents with fewer, highly relevant chunks over those with many weakly relevant ones.\n\n\n#### Retrieve with PageIndex\n\nSelect the documents with the highest DocScore, then use their `doc_id` to perform further retrieval via the PageIndex retrieval API.\n\n\n\n## 💬 Help & Community\nContact us if you need any advice on conducting document searches for your use case.\n\n- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)  \n- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0)"
  },
  {
    "path": "tutorials/tree-search/README.md",
    "content": "## Tree Search Examples\nThis tutorial provides a basic example of how to perform retrieval using the PageIndex tree.\n\n### Basic LLM Tree Search Example\nA simple strategy is to use an LLM agent to conduct tree search. Here is a basic tree search prompt.\n\n```python\nprompt = f\"\"\"\nYou are given a query and the tree structure of a document.\nYou need to find all nodes that are likely to contain the answer.\n\nQuery: {query}\n\nDocument tree structure: {PageIndex_Tree}\n\nReply in the following JSON format:\n{{\n  \"thinking\": <your reasoning about which nodes are relevant>,\n  \"node_list\": [node_id1, node_id2, ...]\n}}\n\"\"\"\n```\n<callout>\nIn our dashboard and retrieval API, we use a combination of LLM tree search and value function-based Monte Carlo Tree Search ([MCTS](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search)). More details will be released soon.\n</callout>\n\n### Integrating User Preference or Expert Knowledge\nUnlike vector-based RAG where integrating expert knowledge or user preference requires fine-tuning the embedding model, in PageIndex, you can incorporate user preferences or expert knowledge by simply adding knowledge to the LLM tree search prompt. Here is an example pipeline.\n\n\n#### 1. Preference Retrieval\n\nWhen a query is received, the system selects the most relevant user preference or expert knowledge snippets from a database or a set of domain-specific rules. This can be done using keyword matching, semantic similarity, or LLM-based relevance search.\n\n#### 2. Tree Search with Preference\nIntegrating preference into the tree search prompt.\n\n**Enhanced Tree Search with Expert Preference Example**\n\n```python\nprompt = f\"\"\"\nYou are given a question and a tree structure of a document.\nYou need to find all nodes that are likely to contain the answer.\n\nQuery: {query}\n\nDocument tree structure:  {PageIndex_Tree}\n\nExpert Knowledge of relevant sections: {Preference}\n\nReply in the following JSON format:\n{{\n  \"thinking\": <reasoning about which nodes are relevant>,\n  \"node_list\": [node_id1, node_id2, ...]\n}}\n\"\"\"\n```\n\n**Example Expert Preference**\n> If the query mentions EBITDA adjustments, prioritize Item 7 (MD&A) and footnotes in Item 8 (Financial Statements) in 10-K reports.\n\n\n\nBy integrating user or expert preferences, node search becomes more targeted and effective, leveraging both the document structure and domain-specific insights.\n\n## 💬 Help & Community\nContact us if you need any advice on conducting document searches for your use case.\n\n- 🤝 [Join our Discord](https://discord.gg/VuXuf29EUj)  \n- 📨 [Leave us a message](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)\n"
  }
]