Full Code of datawhalechina/happy-llm for AI

main ee0a220acdfd cached

76 files

1.1 MB

447.3k tokens

157 symbols

1 requests

Download .txt

Showing preview only (1,173K chars total). Download the full file or copy to clipboard to get everything.

Repository: datawhalechina/happy-llm
Branch: main
Commit: ee0a220acdfd
Files: 76
Total size: 1.1 MB

Directory structure:
gitextract_wk0u_t6d/

├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── book_issue.yml
│       └── config.yml
├── .gitignore
├── Extra-Chapter/
│   ├── CDDRS/
│   │   ├── CDDRS.ipynb
│   │   └── readme.md
│   ├── Readme.md
│   ├── generation-method/
│   │   ├── llm_generation.py
│   │   ├── model_down.py
│   │   └── readme.md
│   ├── s1-vllm-thinking-budget/
│   │   ├── output/
│   │   │   ├── output_1754208752.txt
│   │   │   └── output_1754209653.txt
│   │   ├── readme.md
│   │   └── s1.py
│   ├── text-data-processing/
│   │   └── readme.md
│   ├── transformer-architecture/
│   │   └── readme.md
│   ├── vlm-concatenation-finetune/
│   │   └── README.md
│   └── why-fine-tune-small-large-language-models/
│       ├── readme.md
│       └── why-fine-tune-small-large-language-models.ipynb
├── LICENSE.txt
├── README.md
├── README_en.md
└── docs/
    ├── .nojekyll
    ├── README.md
    ├── _sidebar.md
    ├── chapter1/
    │   └── 第一章 NLP基础概念.md
    ├── chapter2/
    │   ├── code/
    │   │   ├── requirements.txt
    │   │   └── transformer.py
    │   └── 第二章 Transformer架构.md
    ├── chapter3/
    │   └── 第三章 预训练语言模型.md
    ├── chapter4/
    │   └── 第四章 大语言模型.md
    ├── chapter5/
    │   ├── code/
    │   │   ├── dataset.py
    │   │   ├── ddp_pretrain.py
    │   │   ├── ddp_sft_full.py
    │   │   ├── deal_dataset.py
    │   │   ├── download_dataset.sh
    │   │   ├── export_model.py
    │   │   ├── k_model.py
    │   │   ├── model_sample.py
    │   │   ├── requirements.txt
    │   │   ├── tokenizer_k/
    │   │   │   ├── special_tokens_map.json
    │   │   │   ├── tokenizer.json
    │   │   │   └── tokenizer_config.json
    │   │   ├── train_tokenizer.py
    │   │   └── windows_download_dataset.sh
    │   └── 第五章 动手搭建大模型.md
    ├── chapter6/
    │   ├── 6.4[WIP] 偏好对齐.md
    │   ├── code/
    │   │   ├── download_dataset.py
    │   │   ├── download_model.py
    │   │   ├── ds_config_zero2.json
    │   │   ├── finetune.py
    │   │   ├── finetune.sh
    │   │   ├── pretrain.ipynb
    │   │   ├── pretrain.py
    │   │   ├── pretrain.sh
    │   │   ├── process_dataset.ipynb
    │   │   ├── requirements.txt
    │   │   └── whole.ipynb
    │   ├── readme.md
    │   └── 第六章 大模型训练流程实践.md
    ├── chapter7/
    │   ├── Agent/
    │   │   ├── demo.py
    │   │   ├── requirements.txt
    │   │   ├── src/
    │   │   │   ├── __init__.py
    │   │   │   ├── core.py
    │   │   │   ├── tools.py
    │   │   │   └── utils.py
    │   │   └── web_demo.py
    │   ├── RAG/
    │   │   ├── .env_example
    │   │   ├── Embeddings.py
    │   │   ├── LLM.py
    │   │   ├── VectorBase.py
    │   │   ├── demo.py
    │   │   ├── requirements.txt
    │   │   └── utils.py
    │   └── 第七章 大模型应用.md
    ├── index.html
    └── 前言.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE/book_issue.yml
================================================
name: Happy-LLM 问题反馈 / Textbook Issue Report
description: 报告 Happy-LLM 中的代码/内容问题 / Report code/content issues in textbook
title: "[问题/Issue] 章节X.X：简短问题描述 / ChapterX.X: Brief description"
labels: ["documentation", "需要审核/needs-review"]
body:
  - type: markdown
    attributes:
      value: |
        🐛 **请提供以下信息帮助我们快速定位问题 / Please provide the following information to help us locate the issue**  

  - type: input
    id: chapter
    attributes:
      label: 1. 遇到问题的章节 / Affected Chapter
      description: "格式示例：Chapter3.2 或 Chapter6.4 / Format example: Chapter3.2 or Chapter6.4"
      placeholder: "ChapterX.X"
    validations:
      required: true

  - type: textarea
    id: problem
    attributes:
      label: 2. 具体问题描述 / Problem Description
      description: |
        请清晰描述遇到的问题现象或错误信息 / Clearly describe the issue or error message
        - 代码问题：遇到了什么错误？预期行为是什么？/ Code issue: What error occurred? Expected behavior?
        - 内容问题：哪里不准确/有歧义？/ Content issue: What is inaccurate/ambiguous?
      placeholder: |
        例：代码运行时出现XX错误 / E.g.: XX error when running code...
        例：图3-5的内容描述不准确 / E.g.: Inaccurate description in Figure 3-5...
    validations:
      required: true

  - type: textarea
    id: reproduction
    attributes:
      label: 3. 问题重现材料 / Reproduction Materials
      description: |
        🔧 根据问题类型提供以下内容 / Provide based on issue type:
        - 代码问题：完整代码片段+错误日志 / Code issue: Full code snippet + error log
        - 内容问题：相关段落/图片编号+修改建议 / Content issue: Related paragraph/figure + suggested fix
        - 格式问题：截图+问题位置描述 / Format issue: Screenshot + location description
      placeholder: |
        ```python
        # 代码问题请粘贴完整代码 / Paste full code for code issues
        print("问题重现代码 / reproduction code")
        ```
        
        **错误信息 / Error log**:
        ```
        粘贴完整的错误日志 / Paste full error log
        ```
        
        **建议修改 / Suggested fix**:
        对原内容的修改建议... / Your suggested modification...
    validations:
      required: true

  - type: checkboxes
    id: checks
    attributes:
      label: 确认事项 / Verification
      description: 请确认以下事项 / Please confirm the following
      options:
        - label: 此问题未在过往Issue中被报告过 / This issue hasn't been reported before
          required: true

================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: false
contact_links:
  - name: 📚 Happy-LLM 教程文档
    url: https://github.com/datawhalechina/happy-llm
    about: 查看完整的 Happy-LLM 教程文档和代码示例
  - name: 🐛 快速反馈 / Quick Feedback
    url: https://github.com/datawhalechina/happy-llm/issues/new?template=book_issue.yml
    about: 快速报告教程中的问题 / Quickly report issues in the tutorial

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
.idea/
# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy
.history
# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.DS_Store
.obsidian

.claude/

================================================
FILE: Extra-Chapter/CDDRS/CDDRS.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4d1f9d85",
   "metadata": {},
   "source": [
    "# 建筑文档智能RAG审查系统\n",
    "\n",
    "一个从零开始实现的建筑文档智能审查系统，旨在帮助开发者理解知识引导检索在专业领域文档审查中的核心原理和实现细节。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2248f779",
   "metadata": {},
   "source": [
    "## 项目动机\n",
    "\n",
    "建筑施工交底文档的合规性审查是保障施工项目安全性、经济性的关键环节。在施工项目全周期中，各项操作必须符合相关规范条文要求，才能确保建设项目的安全性与可持续性。然而，相关查询参考往往分散在各个项目文件中，传统基于人工的审查方法难以处理庞大复杂的建筑条文，其审查过程需要基于审查人员的经验与专业知识，具有主观性强，耗时长且易出错等弊端。\n",
    "\n",
    "随着大语言模型技术的发展，LLM为自动化建筑文档审查带来了新的希望。然而，大语言模型通常使用通用语料进行训练，缺乏建筑相关背景知识，在处理建造背景下的复杂推理问题中会产生严重的幻觉现象。通过使用基于向量相似匹配的RAG方法，可以为LLMs提供初步的相似参考知识，从而减轻基于人工或规则的审查方法难以处理庞大建筑文本所带来的错误率高的问题。\n",
    "\n",
    "然而，传统RAG方法在建筑专业文档审查中存在关键局限：由于固定的分块设计，使得文本块之间面临知识信息缺失问题；在检索过程中，使用整句问询嵌入的方法进行相似性匹配，缺少对问询细粒度特征的识别与考量，检索效率低下。在建筑施工交底文档中，这类文档详细阐述了施工工艺特点和方法、质量规格、操作程序以及安全协议，包含大量知识细节且专业性极强。因此需要一个能够精准理解和检索建筑领域专业知识的智能系统。\n",
    "\n",
    "因此，本项目提出了一个生成式知识引导的建筑文档审查系统，旨在提升审查的可靠性和准确性。系统具有两大核心创新：首先提出动态语义知识分块策略，构建具有更优语义连贯性和完整性的知识库；其次基于增强的知识表示，提出生成式知识引导检索框架，在语义嵌入检索过程中增强对细粒度信息的关注，从而提高知识参考检索的准确性和建筑文档审查任务中修正的可靠性。\n",
    "\n",
    "需要注意的是，由于篇幅限制，我们无法展示完整的整个实现过程，但是，我们将在文档中讲解每个必要的实现步骤以及背后的思考，您可以通过这些内容快速理解如何实现一个建筑文档智能审查系统。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f052666",
   "metadata": {},
   "source": [
    "## 前置实现\n",
    "\n",
    "接下来，我们将带领大家，从0开始，实现一个建筑文档智能审查系统。首先，我们将完成一些基本的准备过程。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76df1ef0",
   "metadata": {},
   "source": [
    "### 1. 实现 LLM 模块\n",
    "\n",
    "首先我们需要实现 LLM 模块，这是系统中最基本的模块，我们将利用大模型完成文档的清洗，信息提取等工作，可以说本系统的一部分精髓即为使用大模型预先处理文档信息，方便后续进行检索，这里我们使用 DeepSeek 的 api 来实现。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "13e0d4fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from abc import ABC, abstractmethod\n",
    "from typing import Any, Optional\n",
    "\n",
    "class BaseLLM(ABC):\n",
    "    \"\"\"Interface for large language models.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        model_name: str,\n",
    "        model_params: Optional[dict[str, Any]] = None,\n",
    "        **kwargs: Any,\n",
    "    ):\n",
    "        self.model_name = model_name\n",
    "        self.model_params = model_params or {}\n",
    "\n",
    "    @abstractmethod\n",
    "    def predict(self, input: str) -> str:\n",
    "        \"\"\"Sends a text input to the LLM and retrieves a response.\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0783425",
   "metadata": {},
   "source": [
    "如上是一个调用大模型的抽象接口，这可以帮助我们统一调用大模型的格式，我们继承这个基类，实现调用大模型的接口。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b80f5410",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "from typing import Any, Optional\n",
    "\n",
    "class DeepSeekLLM(BaseLLM):\n",
    "    \"\"\"Implementation of the BaseLLM interface using DeepSeek API.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        model_name: str,\n",
    "        api_key: str,\n",
    "        base_url: str = \"https://api.deepseek.com/v1\",\n",
    "        model_params: Optional[dict[str, Any]] = None,\n",
    "        **kwargs: Any,\n",
    "    ):\n",
    "        super().__init__(model_name, model_params, **kwargs)\n",
    "        self.client = OpenAI(api_key=api_key, base_url=base_url)\n",
    "\n",
    "    def predict(self, input: str) -> str:\n",
    "        response = self.client.chat.completions.create(\n",
    "            model=self.model_name,\n",
    "            messages=[{\"role\": \"user\", \"content\": input}],\n",
    "        )\n",
    "        return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99e82db8",
   "metadata": {},
   "source": [
    "完成搭建后，我们可以通过尝试调用 predict 方法来测试是否成功。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "910377cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "当然可以！我很乐意帮助您进行建筑文档审查。不过，为了提供更准确的建议，我需要了解一些具体信息：\n",
      "\n",
      "**请提供以下细节（根据实际情况选择分享）：**\n",
      "1. **文档类型**：是施工图、技术规范、合同文件、设计说明、工程量清单（BOQ）还是其他？\n",
      "2. **具体需求**：您希望审查哪些方面？（例如：规范性、合规性、技术细节、数据一致性、错漏检查等）\n",
      "3. **项目类型**：住宅、商业建筑、工业设施还是基础设施？\n",
      "4. **关注重点**：是否有特定标准需要符合（如国家标准、绿色建筑认证、安全规范等）？\n",
      "\n",
      "---\n",
      "\n",
      "### 我能协助的常见审查方向：\n",
      "1. **格式与完整性**  \n",
      "   - 检查目录结构、编号系统、图表标注是否清晰一致。  \n",
      "   - 确认关键章节（如设计说明、材料规格、施工要求）是否齐全。\n",
      "\n",
      "2. **合规性提示**（需提供当地规范名称）  \n",
      "   - 防火疏散、无障碍设计、结构荷载等基本规范（例如：中国《建筑设计防火规范》GB 50016）。  \n",
      "   - 环保或节能要求（如LEED、BREEAM或中国绿色建筑标准）。\n",
      "\n",
      "3. **逻辑一致性**  \n",
      "   - 平面图、立面图、剖面图之间的尺寸与标注是否冲突。  \n",
      "   - 设计说明与技术细节是否匹配（如材料规格与施工工艺）。\n",
      "\n",
      "4. **常见错误排查**  \n",
      "   - 单位错误（如mm与cm混淆）、标注遗漏、图例错误等。  \n",
      "   - 工程量清单（BOQ）中项目与图纸数量是否一致。\n",
      "\n",
      "5. **术语与表述优化**  \n",
      "   - 建议更专业的表述方式或术语标准化。\n",
      "\n",
      "---\n",
      "\n",
      "### 注意事项：\n",
      "- 我是AI，无法替代专业工程师或律师的审核，但可提供初步参考意见。  \n",
      "- 涉及结构安全、法规强制要求的内容，请务必由持证专业人员复核。  \n",
      "- 您可以直接发送具体段落或描述问题，我会尽力协助！\n",
      "\n",
      "请提供更多细节，我会为您提供针对性建议。\n"
     ]
    }
   ],
   "source": [
    "llm = DeepSeekLLM(\n",
    "    model_name=\"deepseek-chat\", \n",
    "    api_key=\"your-api-key\",\n",
    "    base_url=\"https://api.deepseek.com/v1\"\n",
    ")\n",
    "print(llm.predict(\"你好，你能帮助我进行建筑文档审查吗？\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8506fa9",
   "metadata": {},
   "source": [
    "当观察到 LLM 正确回复后，我们这一模块的构建就完成了。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d049de7",
   "metadata": {},
   "source": [
    "### 2. 实现 Embedding 模块\n",
    "\n",
    "除了调用大模型，我们还需要实现 Embedding 模块，Embedding 模块用于将文本转换为向量，我们将使用向量来表示文档中的信息，这样的好处是，我们可以通过向量的相似度来衡量文档与查询之间的相似度，从而召回对回复用户问题最有帮助的文档。\n",
    "\n",
    "构建 Embedding 模块的方法与构建 LLM 模块类似。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "40f7b8da",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "d:\\anaconda\\envs\\llamaindex\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from abc import ABC, abstractmethod\n",
    "from typing import List, Any, Optional\n",
    "import numpy as np\n",
    "\n",
    "class BaseEmb(ABC):\n",
    "    def __init__(\n",
    "        self,\n",
    "        model_name: str,\n",
    "        model_params: Optional[dict[str, Any]] = None,\n",
    "        **kwargs: Any,\n",
    "    ):\n",
    "        self.model_name = model_name\n",
    "        self.model_params = model_params or {}\n",
    "\n",
    "    @abstractmethod\n",
    "    def get_emb(self, input: str) -> List[float]:\n",
    "        \"\"\"Sends a text input to the embedding model and retrieves the embedding.\"\"\"\n",
    "        pass\n",
    "\n",
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "\n",
    "class BGEEmbedding(BaseEmb):\n",
    "    def __init__(self, model_name: str = \"BAAI/bge-m3\", **kwargs):\n",
    "        super().__init__(model_name=model_name, **kwargs)\n",
    "        self.embed_model = HuggingFaceEmbedding(\n",
    "            model_name=model_name,\n",
    "            trust_remote_code=True,\n",
    "            cache_folder=\"./model_cache\"\n",
    "        )\n",
    "\n",
    "    def get_emb(self, text: str) -> List[float]:\n",
    "        embedding = self.embed_model.get_text_embedding(text)\n",
    "        return embedding\n",
    "    \n",
    "    def encode(self, texts, show_progress_bar=False):\n",
    "        if isinstance(texts, str):\n",
    "            texts = [texts]\n",
    "        \n",
    "        embeddings = []\n",
    "        for text in texts:\n",
    "            emb = self.get_emb(text)\n",
    "            embeddings.append(emb)\n",
    "        \n",
    "        return np.array(embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60f3c300",
   "metadata": {},
   "source": [
    "完成搭建后，我们可以通过尝试调用 get_emb 方法来测试是否成功。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "bf21d439",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-0.008159242570400238, 0.005754395853728056, -0.024663547053933144, 0.0038987270090729, 0.0221844669431448, 0.0007059215568006039, 0.004481502342969179, -0.0015061416197568178, -0.022438108921051025, -0.0030729181598871946, -0.028854981064796448, 0.0223580002784729, -0.014190766029059887, -0.02054421603679657, -0.0019247722811996937, -0.041896067559719086, 0.013186885975301266, -0.04618095979094505, 0.009962386451661587, -0.002442343160510063, -0.013730430044233799, -0.0670018196105957, -0.028913620859384537, -0.01982208900153637, 0.02447838895022869, 0.03258971497416496, -0.03164232149720192, -0.002734268084168434, -0.020909657701849937, 0.05742897093296051, 0.021285204216837883, -0.02225871942937374, 0.07097937166690826, -0.004255959764122963, 0.021880434826016426, -0.04358939453959465, 0.047468990087509155, -0.0665409043431282, -0.05259682610630989, 0.004260585643351078, -0.010267225094139576, -0.03244507685303688, 0.0223003588616848, -0.019258055835962296, -0.0003713485202752054, -0.04361735284328461, 0.026206018403172493, -0.03023737110197544, 0.013313274830579758, 0.0002251705009257421, 0.008343675173819065, -0.030304472893476486, -0.004975242540240288, -0.026585601270198822, 0.011065107770264149, 0.017322678118944168, 0.02283507212996483, -0.0435573048889637, -0.05924893915653229, -0.022334391251206398, -0.030627364292740822, 0.029643332585692406, -0.004384420812129974, -0.001198050333186984, 0.03390760347247124, 0.01960756443440914, -0.005116869695484638, -0.04911072552204132, -0.007285997737199068, -0.06285838782787323, 0.02450537495315075, -0.02004382573068142, -0.010927724651992321, 0.028427692130208015, -0.04490334168076515, -0.011161834001541138, 0.005235197953879833, -0.01957201212644577, -0.04021269828081131, -0.02049367129802704, -0.03427192196249962, 0.050749681890010834, 0.026393454521894455, 0.02682250551879406, -0.008772539906203747, 0.025520754978060722, -0.010053625330328941, 0.03046583943068981, 0.015397612005472183, -0.0509209930896759, -0.0539475716650486, 0.004840736277401447, 0.03514740616083145, -0.0034363295417279005, -0.055750150233507156, -0.0026254449039697647, -0.004683077801018953, -0.03790610656142235, 0.03699930012226105, -0.0014170100912451744, 0.007965784519910812, -0.051486629992723465, 0.028585148975253105, -0.019408004358410835, 0.03357052803039551, -0.02611122652888298, 0.016895202919840813, 0.005833339411765337, -0.005644122138619423, -0.013383697718381882, 0.015408149920403957, -0.011529901996254921, 0.033537622541189194, 0.006104541476815939, 0.0181846022605896, -0.025680096819996834, 0.009130419231951237, -0.022236522287130356, -0.03444993868470192, 0.002107771812006831, 0.06445536017417908, -0.015476440079510212, 0.04229044169187546, -0.061122022569179535, 0.014955008402466774, -0.0019445978105068207, -0.003454635851085186, 0.04888900741934776, 0.0026477354113012552, 0.03745702654123306, 0.002253544284030795, 0.016443034633994102, -0.00763604324311018, -0.016121556982398033, -0.028234101831912994, 0.02534499578177929, -0.002640634309500456, 0.04827441647648811, 0.016369467601180077, -0.026182588189840317, 0.05507376044988632, 0.008636887185275555, -0.02046351693570614, -0.019356289878487587, 0.01482913177460432, -0.019503336399793625, 0.004169851075857878, -0.026032783091068268, 0.030682329088449478, -0.04218831658363342, 0.0005025395657867193, 0.0010827184887602925, 0.04963816702365875, 0.02726980298757553, 0.02827153541147709, 0.0007039656047709286, -0.039251312613487244, -0.009724192321300507, 0.04078090563416481, -0.016981666907668114, 0.013814247213304043, 0.0032257679849863052, -0.0013130503939464688, 0.014551229774951935, 0.010753488168120384, -0.00707739032804966, -0.0038439284544438124, -0.005789074581116438, -0.04882612079381943, 0.033391907811164856, -0.016739539802074432, -0.05458156019449234, -0.005394328851252794, -0.015763062983751297, 0.013416395522654057, -0.0021539125591516495, 0.08293414115905762, 0.030252447351813316, -0.017012350261211395, -0.042549606412649155, -0.01920093223452568, 0.025530759245157242, -0.02751781791448593, 0.032338351011276245, -0.0034977809991687536, 0.010373717173933983, 0.024666251614689827, 0.02892077900469303, 0.015242754481732845, 0.014347690157592297, -0.01561523787677288, -0.02215687930583954, 0.06355950981378555, -0.03061477281153202, 0.013036615215241909, 0.007974004372954369, 0.04307020828127861, 0.04244697839021683, 0.055088672786951065, 0.010855956934392452, 0.020225398242473602, -0.0047472394071519375, 0.049362748861312866, -0.014738965779542923, -0.0182197168469429, 0.0019958424381911755, -0.0336240753531456, -0.030366841703653336, 0.019157813861966133, -0.04555022343993187, -0.01577058620750904, -0.01285738218575716, 0.048405565321445465, 0.03940584138035774, -0.03844134137034416, 0.023708857595920563, -0.025616571307182312, 0.0037186769768595695, -0.014304162934422493, 0.00843740813434124, -0.012598640285432339, -0.01618318073451519, 0.010255432687699795, -0.05655519291758537, 0.034144721925258636, 0.0007591137546114624, 0.016931485384702682, 0.011007283814251423, 0.02576807513833046, -0.04384667053818703, 0.01992262899875641, 0.01989668421447277, -0.00542093813419342, -0.03282667696475983, 0.05913662537932396, 0.04952588677406311, 0.009882078506052494, 0.018577586859464645, 0.05728962644934654, -0.010044380091130733, 0.012747026048600674, -0.06368867307901382, 0.044708989560604095, -0.010439618490636349, -0.025499198585748672, -0.02203153818845749, 0.01491384394466877, 0.02911507338285446, -0.03340592607855797, -0.027025623247027397, 0.0016051143174991012, -0.024251878261566162, 0.014879211783409119, -0.004962841048836708, 0.03983527794480324, 0.01239701546728611, -0.00918954610824585, -0.011987953446805477, 0.048828013241291046, 0.0033441183622926474, 0.033875368535518646, 0.038586921989917755, 0.00328968558460474, 0.025484472513198853, 0.04570803418755531, -0.007011420093476772, -0.016749851405620575, -0.02737160213291645, 0.0074218763038516045, 0.013574519194662571, 0.008099310100078583, 0.019064737483859062, 0.02236461639404297, -0.021313967183232307, -0.010747293941676617, 0.03744882345199585, -0.0050834789872169495, 0.003102300688624382, 0.06860381364822388, -0.00735899293795228, -0.0007147222640924156, -0.05509135127067566, -0.011860791593790054, 0.00011540263221831992, -0.0013021506601944566, -0.05093668773770332, -0.020819270983338356, -0.04814756289124489, -0.002467456506565213, -0.05456923320889473, 0.0015354141360148787, -0.048741698265075684, 0.08611465245485306, -0.030389651656150818, 0.020445002242922783, 0.015324302949011326, -0.005680671893060207, -0.15097397565841675, -0.0018968378426507115, -0.01568077690899372, 0.025341948494315147, 0.0059852986596524715, -0.0011712362756952643, -0.023526951670646667, -0.03041907772421837, -0.04523899406194687, 0.00013821253378409892, -0.06493302434682846, -0.06562570482492447, 0.009819632396101952, -0.00178980128839612, 0.05128961801528931, -0.0014715674333274364, -0.04730747640132904, 0.021933168172836304, -0.010594665072858334, -0.028410566970705986, -0.018882660195231438, -0.00906360987573862, 0.06769031286239624, 0.02508336678147316, 0.03553688898682594, -0.020766498520970345, 0.07419787347316742, -0.02149971015751362, -0.024415716528892517, 0.013685734942555428, -0.016621537506580353, 0.008740579709410667, -0.008354954421520233, 0.007627969607710838, 0.04790123179554939, -0.0035329859238117933, 0.008899709209799767, -0.02104680798947811, -0.025366663932800293, 0.029143797233700752, 0.019871309399604797, 0.015991905704140663, -0.01845811866223812, -0.0166324395686388, 0.024127844721078873, -0.05731573700904846, 0.028131384402513504, -0.01046037208288908, 0.001185560249723494, -0.0060633160173892975, -0.044790688902139664, 0.010853624902665615, -0.004852153826504946, -0.00883186049759388, -0.06479669362306595, -0.013680394738912582, -0.03005005232989788, 0.01836826652288437, -0.02213100716471672, -0.014201287180185318, 0.015034623444080353, -0.08682654798030853, -0.005147515796124935, -0.02640114352107048, 0.018159715458750725, -0.0361277274787426, -0.012911590747535229, -0.00467587960883975, -0.009012139402329922, -0.010229379869997501, -0.013843361288309097, -0.03788858652114868, -0.042526692152023315, -0.04254135489463806, 0.01150231808423996, 0.02659926377236843, -0.018250424414873123, 0.04497944936156273, -0.00014919576642569155, -0.09051945805549622, 0.011776473373174667, 0.007612489629536867, 0.0015711288433521986, 0.004494070075452328, -0.03757219389081001, 0.007738599553704262, 0.011990596540272236, 0.028499284759163857, 0.022048959508538246, 0.1985333412885666, -0.04426541179418564, -0.037347856909036636, -0.027941251173615456, 0.0010355062549933791, -0.019851993769407272, 0.025895407423377037, 0.020690426230430603, 0.0013027037493884563, -0.03536158800125122, -0.06401663273572922, 0.02169177308678627, 0.027167029678821564, -0.00950489193201065, 0.0383734405040741, -0.026381129398941994, -0.016536174342036247, 0.025340868160128593, 0.09294909983873367, -0.013789238408207893, 0.01720455102622509, 0.012751719914376736, 0.0008590622455812991, 0.007423927541822195, -0.05435327813029289, -0.03942925110459328, -0.0157479178160429, 0.04002939537167549, -0.037848036736249924, 0.005167141556739807, -0.004750934895128012, 0.027016988024115562, 0.03793930634856224, -0.0021914951503276825, -0.03640374541282654, 0.043172162026166916, -0.017139365896582603, 0.02823808044195175, -0.05737394094467163, 0.048020925372838974, 0.004737760405987501, -0.01879766955971718, -0.02250038832426071, -0.0013258701656013727, -0.017133887857198715, -0.024853866547346115, -0.005083943717181683, 0.013188464567065239, -0.016682926565408707, 0.018507882952690125, -0.014458547346293926, 0.025885671377182007, -0.02617962658405304, 0.026932578533887863, -0.034904152154922485, 0.042742617428302765, -0.029168149456381798, -0.026066835969686508, -0.00531662767753005, 0.046957630664110184, 0.033596962690353394, -0.01823245733976364, -0.035914432257413864, 0.012456627562642097, 0.02694978378713131, -0.0066740200854837894, -0.03440346196293831, -0.013075420632958412, 0.031181251630187035, 0.035029660910367966, 0.06028195098042488, -0.023016994819045067, -0.019278833642601967, 0.004756495822221041, 0.03375618904829025, 0.008311769925057888, 0.030073100700974464, 0.0760876014828682, -0.00802715215831995, 0.02579597197473049, 0.0006852707592770457, 0.014750344678759575, 0.0017241452587768435, 0.005290407687425613, 0.011468152515590191, 0.027898697182536125, -0.06464944034814835, 0.004964817781001329, 0.02021591179072857, 0.012583546340465546, -0.01791338622570038, -0.0024353712797164917, -0.022329390048980713, 0.0021304956171661615, 0.0036976851988583803, 0.03899001330137253, 0.011300807818770409, -0.014965500682592392, -0.046065475791692734, -0.03753596916794777, 0.027129262685775757, -0.024559300392866135, -0.02639039047062397, 0.022491294890642166, 0.009033432230353355, -0.008362609893083572, 0.0001487734989495948, 0.039778005331754684, 0.014077797532081604, 0.05170590430498123, -0.022254720330238342, 0.016157470643520355, 0.002221450675278902, -0.014121792279183865, -0.029643535614013672, 0.002873690566048026, -0.037266165018081665, 0.014971030876040459, 0.061250921338796616, 0.04801003262400627, -0.010305630974471569, 0.07863421738147736, -0.028787901625037193, -0.0012510212836787105, -0.05323777347803116, -0.0009261402883566916, -0.05675649270415306, -0.03329698368906975, -0.0021774962078779936, 0.06076507270336151, 0.02608843520283699, -0.013583504594862461, -0.02154897153377533, -0.013955802656710148, 0.05170121416449547, 0.007481550797820091, -0.004648298025131226, 0.051262687891721725, 0.000945074949413538, 0.022759323939681053, -0.010619102045893669, -0.012822180055081844, -0.030171940103173256, -0.01176955085247755, -0.016073077917099, -0.01429969072341919, -0.01441648043692112, -0.01920558325946331, -0.002067919122055173, -0.062922403216362, -0.007660774048417807, 0.00793564971536398, -0.007822266779839993, -0.06949237734079361, 0.04801243171095848, 0.0036780438385903835, -0.023699555546045303, -0.007212137337774038, -2.1765999917988665e-05, -0.0345299057662487, 0.02497580088675022, -0.014896790497004986, -0.008217303082346916, 0.07821899652481079, 0.047570597380399704, 0.015609372407197952, 0.004776008892804384, 0.012515947222709656, 0.011383699253201485, 0.005764748901128769, 0.05192238837480545, -0.007776946295052767, -0.06453675776720047, 0.04706624895334244, 0.08395755290985107, -0.01908132992684841, -0.11242877691984177, -0.011603251099586487, 0.03921223804354668, 0.027820877730846405, 0.042384177446365356, -0.0036653948482125998, -0.01978554204106331, -0.022139208391308784, -0.014130666851997375, -0.011148289777338505, 0.024651428684592247, -0.021974893286824226, 0.04416286572813988, -0.025216246023774147, -0.02720513381063938, 0.08111189305782318, -0.00949477031826973, -0.005520079284906387, 0.048297442495822906, -0.011168230324983597, -0.02132153883576393, -0.013992955908179283, -0.026559675112366676, 0.026964262127876282, -0.0038850067649036646, 0.0009995036525651813, 0.014295326545834541, 0.024056699126958847, -0.021158399060368538, -0.02968648448586464, -0.046675484627485275, 0.010086658410727978, -0.04817875474691391, 0.0017965140286833048, 0.010042015463113785, 0.012909766286611557, 0.010214140638709068, 0.008607500232756138, -0.0408991314470768, 0.014573400840163231, -0.025889338925480843, 0.04696730896830559, 0.0545382983982563, -0.04756659269332886, 0.004670929163694382, 0.025322075933218002, 0.04067161679267883, -0.030523329973220825, 0.023255759850144386, 0.008356871083378792, 0.05639335513114929, -0.02657676860690117, 0.01303358469158411, -0.001974142389371991, 0.005305653903633356, -0.01812334917485714, -0.014432300813496113, -0.04239063709974289, 0.035735342651605606, -0.0253747571259737, -0.04058588296175003, 0.02939145639538765, -0.014323804527521133, -0.014731739647686481, -0.00027179718017578125, -0.004914026241749525, 0.04327885061502457, 0.013477833941578865, 0.023266315460205078, 0.024892501533031464, -0.020088836550712585, 0.017984040081501007, -0.004191387444734573, 0.032651517540216446, 0.009538357146084309, -0.0042325942777097225, 0.022759458050131798, -0.0028504966758191586, -0.031356580555438995, -0.0017651666421443224, -0.0017029891023412347, -0.011442538350820541, -0.010529697872698307, 0.000798829656559974, -0.02697906456887722, -0.012902587652206421, -0.0326249897480011, 0.020969677716493607, -0.002441784832626581, -0.049833089113235474, 0.017128603532910347, -0.01148754172027111, -0.04498909413814545, 0.009785826317965984, -0.026679018512368202, -0.011309775523841381, -0.03883408382534981, -0.0032419587951153517, -0.07190737873315811, -0.0025989296846091747, -0.07983973622322083, 0.027425453066825867, -0.039434172213077545, 0.000113294692710042, 0.043311141431331635, 0.008025635965168476, -0.009404486045241356, 0.03524414449930191, -0.0018946131458505988, -0.027102893218398094, -0.005990078207105398, -0.01817081868648529, -0.007102642208337784, 0.004550714511424303, 0.03653610870242119, 0.011947095394134521, -0.022481150925159454, -0.02041434496641159, 0.001818244461901486, 0.06134777143597603, -0.035601906478405, 0.035721637308597565, -0.045185357332229614, 0.02865186519920826, -0.019968224689364433, 0.012533607892692089, -0.05320208519697189, 0.0040062167681753635, 0.0030551603995263577, 0.027050429955124855, 0.005466382950544357, 0.03785153105854988, 0.016277272254228592, -0.008986697532236576, 0.012693117372691631, -0.08303315937519073, 0.036484960466623306, -0.028784997761249542, 0.006555391009896994, -0.01160408928990364, 0.016153018921613693, -0.040073759853839874, 0.010013018734753132, 0.03954056277871132, -0.011690113693475723, 0.01969950459897518, -0.013315856456756592, -0.016824085265398026, 0.0038188984617590904, -0.006109373643994331, -0.017293602228164673, 0.04615958407521248, -0.00044023891678079963, 0.007981272414326668, -0.05818289890885353, -0.0026455027982592583, -0.01792796142399311, 0.0014871961902827024, 0.03274844214320183, 0.04940430447459221, 0.0017527458257973194, -0.03890381380915642, 0.05372945964336395, 0.003851843299344182, 0.02785513736307621, -0.022414304316043854, 0.0451260544359684, -0.008142136037349701, -0.008295497857034206, -0.058595504611730576, 0.0007978644571267068, -0.030595725402235985, -0.022097865119576454, 0.014339243061840534, -0.00782725028693676, 0.0245810579508543, -0.0063614631071686745, -0.055110398679971695, -0.010170851834118366, 0.03121359832584858, -0.004042973276227713, 0.00584474578499794, 0.026070119813084602, -0.009010711684823036, 0.03935391083359718, 0.00025671112234704196, 0.014945556409657001, -0.029459843412041664, -0.023990990594029427, -0.038410481065511703, 0.0025528273545205593, -0.004377412144094706, -0.00675099715590477, 0.006689084693789482, 0.0026730282697826624, 0.01695932261645794, -0.022455744445323944, -0.014170611277222633, 0.018368937075138092, 0.05317522585391998, -0.019511310383677483, -0.04786750674247742, 0.011356540024280548, 0.036167509853839874, -0.018453096970915794, -0.013108968734741211, -0.003705380717292428, 0.0031310259364545345, 0.01957552507519722, 0.022996235638856888, -0.04692375287413597, 0.010523623786866665, 0.0151978749781847, 0.009586344473063946, -0.009694855660200119, 0.06290755420923233, -0.020792188122868538, -0.045424968004226685, -0.15302127599716187, -0.0014960793778300285, 0.016392981633543968, -0.007223495282232761, -0.012082898057997227, -0.02785315550863743, 0.02535806968808174, 0.0009754220372997224, -0.018229473382234573, -0.00011553549120435491, -0.016246603801846504, 0.018446525558829308, 0.012868187390267849, 0.015233148820698261, 0.008403956890106201, -0.0035992716439068317, -0.042287036776542664, 0.018119484186172485, 0.0014472078764811158, 0.006772890686988831, -0.037261154502630234, 0.04294487088918686, -0.04072646424174309, 0.008717454969882965, 0.014839038252830505, 0.002356856595724821, -0.028509268537163734, -0.0587744303047657, -0.007323797792196274, -0.07020425796508789, -0.020460154861211777, 0.0013530286960303783, 0.010395233519375324, 0.03011184185743332, 0.021195784211158752, -0.010125909931957722, 0.019076786935329437, -0.025666862726211548, 0.012932072393596172, 0.003098204266279936, -0.02096043899655342, 0.030774418264627457, -0.007811218500137329, -0.0074181207455694675, 0.013342125341296196, 0.02827904373407364, 0.02116740681231022, -0.011440825648605824, -0.05386372283101082, 0.02591989003121853, -0.055587124079465866, -0.0024650339037179947, -0.10242019593715668, -0.009212649427354336, -0.013324497267603874, 0.0174571480602026, 0.004072034731507301, -0.015411539934575558, -0.052660007029771805, 0.03861368075013161, -0.015914669260382652, 0.028986552730202675, -0.014176104217767715, -0.029218900948762894, -0.007927855476737022, 0.01631328836083412, -0.048870109021663666, 0.03039037063717842, 0.04374617710709572, -0.04552233964204788, -0.04128720611333847, 0.01328893844038248, -0.01783362776041031, 0.021039430052042007, 0.020082302391529083, -0.0026240693405270576, 0.014590953476727009, 0.0021560601890087128, 0.015555495396256447, -0.0180243831127882, 0.00387466698884964, -0.04059351608157158, -0.017667148262262344, 0.013516885228455067, 0.06464220583438873, -0.0053078969940543175, 0.020719559863209724, 0.027941640466451645, 0.055165331810712814, -0.02014738693833351, -0.045160628855228424, -0.020908063277602196, -0.008618908002972603, 0.024866502732038498, -0.013092943467199802, 0.01663080044090748, 0.015296213328838348, -0.0397157222032547, -0.01881938986480236, -0.02327280305325985, 0.037480369210243225, 0.020219353958964348, 0.00946501549333334, 0.03273487836122513, -0.04057733342051506, 0.04875422641634941, 0.04313022643327713, -0.0031367503106594086, 0.01641390472650528, -0.0166687723249197, 0.0052114990539848804, -0.006858861073851585, -0.0009919543517753482, 0.011691082268953323, -0.026969704777002335, 0.027193116024136543, -0.026623165234923363, -0.009551705792546272, -0.0812152624130249, 0.01891920529305935, -0.012630821205675602, 0.03687284514307976, -0.02955334447324276, -0.04365872964262962, -0.06358622014522552, -0.004390461836010218, 0.016415808349847794, 0.013363885693252087, 0.015943197533488274, -0.04743891581892967, 0.029777981340885162, -0.023721372708678246, -0.017798207700252533, 0.010984212160110474, 0.014349219389259815, -0.008148356340825558, -0.02200779691338539, 0.056212782859802246, -0.030449382960796356, -0.038457322865724564, -0.03935403749346733, 0.012095058336853981, 0.007333962246775627, -0.010145832784473896, 0.04818291217088699, -0.006354400888085365, 0.014433770440518856, 0.013005473650991917, 0.022247975692152977, -0.043795693665742874, 0.03307293355464935, -0.01626189611852169, -0.03347942978143692, 0.030900413170456886, 0.01975036785006523, 0.0037053741980344057, 0.054877035319805145, 0.00715956324711442, -0.02806491032242775, 0.043243926018476486, 0.027810178697109222, -0.026569802314043045, -0.03132997825741768, -0.027514895424246788, -0.016161726787686348, 0.004605554975569248, -0.019134240224957466, 0.03173599764704704, -0.05500989779829979, -0.006796687841415405, 0.02597261592745781, -0.006349897477775812, -0.00793306715786457, 0.03255782648921013, 0.023534180596470833, 0.034416571259498596, 0.02472779154777527, 0.008446929045021534, 0.04494461789727211, 0.05273646116256714, 0.027852870523929596, 0.042726922780275345, -0.012875341810286045, 0.014557878486812115, -0.006294076796621084, 0.0563679076731205, 0.02326614409685135, 0.0435221828520298, -0.03701339662075043, -0.03432697057723999, 0.027957096695899963, 0.015457337722182274, 0.006810764316469431, 0.027425691485404968, 0.006900117266923189, -0.012477259151637554, 0.04869560897350311, -0.0059214429929852486, -0.015709098428487778, 0.03524148464202881, -0.03533501178026199, 0.006197778508067131, -0.02276379056274891, 0.005504221189767122, -0.005742272362112999, 0.00864323414862156, 0.01449623517692089, -0.08615878969430923, -0.019902745261788368, -0.012709731236100197, -0.042176391929388046, 0.006437498610466719, 0.0015244637615978718, 0.05173034593462944, -0.049557872116565704, 0.014568259008228779, 0.02895624376833439, 0.010240156203508377, -0.001294239773415029, -0.051832351833581924, 0.011399160139262676, 0.057161569595336914, 0.05856688693165779, 0.009896301664412022, 0.0004667540779337287, -0.0018993027042597532, -0.014930336736142635, -0.009480022825300694, -0.049976982176303864, -0.025915687903761864, 0.03031117282807827, -0.004909704439342022, -0.032247595489025116, 0.0592353381216526, 0.08491376042366028, -0.04929697513580322, -0.01218048669397831, -0.027961745858192444, -0.006770662497729063, 0.012578483670949936, 0.019840944558382034, 0.008873778395354748, 0.021609125658869743, 0.0015359801473096013]\n"
     ]
    }
   ],
   "source": [
    "emb = BGEEmbedding(model_name=\"BAAI/bge-m3\")\n",
    "print(emb.get_emb(\"建筑结构的安全性检查包括哪些方面？\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08f9a445",
   "metadata": {},
   "source": [
    "当观察到 Embedding 正确给出了编码后的向量，我们这一模块的构建就完成了。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "878401f7",
   "metadata": {},
   "source": [
    "### 3. 实现文档预处理模块\n",
    "\n",
    "为了处理建筑文档，我们需要预先准备好文档读取模块。本系统假设所有建筑规范和标准已经转换为Markdown格式，便于后续的文本处理和分析。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "317fa91a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "from typing import Dict, List\n",
    "\n",
    "class DocumentProcessor:\n",
    "    def __init__(self):\n",
    "        pass\n",
    "    \n",
    "    def load_documents(self, directory_path: str) -> List[str]:\n",
    "        documents = []\n",
    "        \n",
    "        for file_path in Path(directory_path).rglob('*.md'):\n",
    "            try:\n",
    "                with open(file_path, 'r', encoding='utf-8') as f:\n",
    "                    content = f.read()\n",
    "                    documents.append(content)\n",
    "            except Exception as e:\n",
    "                print(f\"Error reading {file_path}: {e}\")\n",
    "                    \n",
    "        return documents"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62252a8f",
   "metadata": {},
   "source": [
    "完成文档预处理模块的设置后，我们就可以采用下面的方法来加载建筑规范文档了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3e6c0e71",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "加载了 1 个建筑规范文档\n"
     ]
    }
   ],
   "source": [
    "processor = DocumentProcessor()\n",
    "documents = processor.load_documents(\"./construction_standards\")\n",
    "print(f\"加载了 {len(documents)} 个建筑规范文档\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "679c7210",
   "metadata": {},
   "source": [
    "## 核心实现\n",
    "\n",
    "建筑文档审查系统的主要流程如下。首先，让我们来梳理一下建筑文档审查的工作流程，系统的一个核心思想在于，我们需要把用户提供的文档内容通过智能化的问询生成和知识引导检索来识别潜在的合规性问题。与传统RAG方法不同，我们的系统专门针对建筑领域的专业特点进行了优化，能够更准确地理解建筑规范要求，提供更可靠的审查建议。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adb5f361",
   "metadata": {},
   "source": [
    "### 动态语义知识分块\n",
    "\n",
    "在传统RAG流程中，文本通过设置固定的token数量划分文本区块。然而，固定token数量会在句子中间截断，导致信息缺失。为此，本系统使用基于建筑文本语义动态划分的方式，通过双重语义聚类的方式，完成考虑建筑语义连贯性的知识chunk划分。\n",
    "\n",
    "首先，将整个文档内容处理成单独句子序列 $S = \\{s_0, s_1, \\ldots, s_a\\}$。通过计算相邻句子间的语义差异度来识别潜在的语义边界：\n",
    "\n",
    "$$\\gamma_i = 1 - \\frac{s_{i-1} \\cdot s_i}{\\|s_{i-1}\\| \\|s_i\\|}$$\n",
    "\n",
    "基于语义差异度分布自动确定动态阈值：\n",
    "\n",
    "$$\\psi = \\text{Quantile}(\\Gamma, \\frac{a-p}{a})$$\n",
    "\n",
    "确保最终的分块既保持语义连贯性又满足长度约束：\n",
    "\n",
    "$$\\mathbb{E}[\\gamma_{\\text{intra}}] < \\mathbb{E}[\\gamma_{\\text{inter}}]$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5fbb0886",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import math\n",
    "import numpy as np\n",
    "from typing import List, Dict, Tuple\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "class DynamicSemanticChunker:\n",
    "   def __init__(self, \n",
    "                embedding_model: str = \"BAAI/bge-m3\",\n",
    "                max_chunk_length: int = 512,\n",
    "                min_chunk_length: int = 50):\n",
    "       self.embedding_model = SentenceTransformer(\n",
    "           embedding_model,\n",
    "           cache_folder=\"./model_cache\"\n",
    "       )\n",
    "       self.max_chunk_length = max_chunk_length\n",
    "       self.min_chunk_length = min_chunk_length\n",
    "   \n",
    "   def split_text(self, text: str) -> Dict[str, str]:\n",
    "       sentences = self._split_into_sentences(text)\n",
    "       if len(sentences) == 0:\n",
    "           return {}\n",
    "       \n",
    "       total_tokens = sum(len(self.embedding_model.tokenizer.encode(s)) for s in sentences)\n",
    "       baseline_chunks = math.ceil(total_tokens / self.max_chunk_length)\n",
    "       alpha = (len(sentences) - baseline_chunks) / len(sentences)\n",
    "       \n",
    "       sentence_embeddings = self.embedding_model.encode(sentences)\n",
    "       gamma_values = self._compute_semantic_discrepancy(sentence_embeddings)\n",
    "       threshold = np.quantile(gamma_values, alpha) if len(gamma_values) > 0 and alpha > 0 else 0.5\n",
    "       \n",
    "       boundaries = self._identify_boundaries(gamma_values, threshold)\n",
    "       initial_chunks = self._create_initial_chunks(sentences, boundaries)\n",
    "       final_chunks = self._enforce_length_constraints(initial_chunks)\n",
    "       \n",
    "       chunks_dict = {}\n",
    "       for i, chunk in enumerate(final_chunks):\n",
    "           if chunk.strip():\n",
    "               chunk_id = f\"chunk-{i+1:03d}\"\n",
    "               chunks_dict[chunk_id] = chunk.strip()\n",
    "       \n",
    "       return chunks_dict\n",
    "   \n",
    "   def _split_into_sentences(self, text: str) -> List[str]:\n",
    "       sentence_pattern = r'[。！？；\\n]+'\n",
    "       sentences = re.split(sentence_pattern, text)\n",
    "       \n",
    "       cleaned_sentences = []\n",
    "       for sentence in sentences:\n",
    "           sentence = sentence.strip()\n",
    "           if len(sentence) > 5:\n",
    "               cleaned_sentences.append(sentence)\n",
    "       \n",
    "       return cleaned_sentences\n",
    "   \n",
    "   def _compute_semantic_discrepancy(self, embeddings: np.ndarray) -> List[float]:\n",
    "       gamma_values = []\n",
    "       \n",
    "       for i in range(1, len(embeddings)):\n",
    "           similarity = cosine_similarity(\n",
    "               embeddings[i-1].reshape(1, -1),\n",
    "               embeddings[i].reshape(1, -1)\n",
    "           )[0][0]\n",
    "           \n",
    "           gamma = 1 - similarity\n",
    "           gamma_values.append(gamma)\n",
    "       \n",
    "       return gamma_values\n",
    "   \n",
    "   def _identify_boundaries(self, gamma_values: List[float], threshold: float) -> List[int]:\n",
    "       boundaries = [0]\n",
    "       \n",
    "       for i, gamma in enumerate(gamma_values):\n",
    "           if gamma > threshold:\n",
    "               boundaries.append(i + 1)\n",
    "       \n",
    "       boundaries.append(len(gamma_values) + 1)\n",
    "       return sorted(set(boundaries))\n",
    "   \n",
    "   def _create_initial_chunks(self, sentences: List[str], boundaries: List[int]) -> List[str]:\n",
    "       chunks = []\n",
    "       \n",
    "       for i in range(len(boundaries) - 1):\n",
    "           start = boundaries[i]\n",
    "           end = boundaries[i + 1]\n",
    "           \n",
    "           chunk_sentences = sentences[start:end]\n",
    "           chunk_text = ' '.join(chunk_sentences)\n",
    "           chunks.append(chunk_text)\n",
    "       \n",
    "       return chunks\n",
    "   \n",
    "   def _enforce_length_constraints(self, chunks: List[str]) -> List[str]:\n",
    "       final_chunks = []\n",
    "       \n",
    "       for chunk in chunks:\n",
    "           chunk_tokens = len(self.embedding_model.tokenizer.encode(chunk))\n",
    "           \n",
    "           if chunk_tokens <= self.max_chunk_length:\n",
    "               if chunk_tokens >= self.min_chunk_length:\n",
    "                   final_chunks.append(chunk)\n",
    "           else:\n",
    "               split_chunks = self._split_overlong_chunk(chunk)\n",
    "               final_chunks.extend(split_chunks)\n",
    "       \n",
    "       return final_chunks\n",
    "   \n",
    "   def _split_overlong_chunk(self, chunk: str) -> List[str]:\n",
    "       sentences = re.split(r'[。！？；\\n]+', chunk)\n",
    "       sentences = [s.strip() for s in sentences if s.strip()]\n",
    "       \n",
    "       if not sentences:\n",
    "           return [chunk]\n",
    "       \n",
    "       result_chunks = []\n",
    "       current_chunk_sentences = []\n",
    "       current_tokens = 0\n",
    "       \n",
    "       for sentence in sentences:\n",
    "           if current_chunk_sentences:\n",
    "               temp_text = ' '.join(current_chunk_sentences + [sentence])\n",
    "           else:\n",
    "               temp_text = sentence\n",
    "               \n",
    "           temp_tokens = len(self.embedding_model.tokenizer.encode(temp_text))\n",
    "           \n",
    "           if temp_tokens > self.max_chunk_length and current_chunk_sentences:\n",
    "               chunk_text = ' '.join(current_chunk_sentences)\n",
    "               if len(self.embedding_model.tokenizer.encode(chunk_text)) >= self.min_chunk_length:\n",
    "                   result_chunks.append(chunk_text)\n",
    "               \n",
    "               current_chunk_sentences = [sentence]\n",
    "               current_tokens = len(self.embedding_model.tokenizer.encode(sentence))\n",
    "           else:\n",
    "               current_chunk_sentences.append(sentence)\n",
    "               current_tokens = temp_tokens\n",
    "       \n",
    "       if current_chunk_sentences:\n",
    "           chunk_text = ' '.join(current_chunk_sentences)\n",
    "           if len(self.embedding_model.tokenizer.encode(chunk_text)) >= self.min_chunk_length:\n",
    "               result_chunks.append(chunk_text)\n",
    "       \n",
    "       return result_chunks if result_chunks else [chunk]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cab99fff",
   "metadata": {},
   "source": [
    "### 建筑文档审查系统\n",
    "\n",
    "整体的审查过程如下图所示。系统获取需要审查的区域后，依据提示生成审查问题推荐，此部分也可供工程师进行相关问题输入或推荐问题选择，生成待审查问题。随后，系统通过生成式知识引导检索框架，依据审查问题在所建文本知识库中检索出相应的知识参考。最终，依据检索的部分与审查原文，进行问题分析与审查修正，完成最终的审查流程。\n",
    "\n",
    "![picture](images/pic1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f156b1d",
   "metadata": {},
   "source": [
    "#### 审查问题生成\n",
    "\n",
    "在文档审查流程中，系统引入了双阶段Prompt工程驱动的智能化问询生成机制，旨在对建筑施工交底文档进行预见性分析与风险挖掘，实现对文档潜在问题的高效、精准定位。\n",
    "\n",
    "阶段1为待查文档主旨目标解构，模型被指示从文本中提炼核心事件、关键技术、工艺流程等要素，结构化地总结文档的核心内容，由此界定本次审查的靶向目标，为后续的精细化问询奠定基础。阶段2为多维度风险探测与定制化问询生成，基于第一阶段提炼的核心要素，通过few-shot等方式引导 LLM 从合规性、安全性、可操作性等多维度对文档进行风险探测。Prompt 指示模型围绕潜在的限制条件、操作流程、以及可能存在的合规性隐患等方面，进行细粒度、多角度的审查提问。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d283d9fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "CORE_COMPONENTS_PROMPT = \"\"\"\n",
    "Task: Extract key information components from the construction document below. Focus on technical requirements, construction methods, and compliance-related elements.\n",
    "\n",
    "Please identify:\n",
    "1. Technical specifications and standards\n",
    "2. Construction techniques and processes  \n",
    "3. Quality requirements and restrictions\n",
    "\n",
    "Input: {document_chunk}\n",
    "\n",
    "Please provide a concise summary in English:\n",
    "\"\"\"\n",
    "\n",
    "REVIEW_QUERIES_PROMPT = \"\"\"\n",
    "Task: Generate 3-5 specific review questions based on the construction document and extracted components. These questions should help identify potential compliance issues by retrieving relevant construction codes and standards.\n",
    "\n",
    "Document: {document_chunk}\n",
    "Key components: {core_components}\n",
    "\n",
    "Generate review questions (one question per line):\n",
    "1.\n",
    "2.\n",
    "3.\n",
    "4.\n",
    "5.\n",
    "\"\"\"\n",
    "\n",
    "def generate_review_queries(llm, document_chunk: str) -> List[str]:\n",
    "    core_prompt = CORE_COMPONENTS_PROMPT.format(document_chunk=document_chunk)\n",
    "    core_response = llm.predict(core_prompt)\n",
    "    \n",
    "    queries_prompt = REVIEW_QUERIES_PROMPT.format(\n",
    "        document_chunk=document_chunk,\n",
    "        core_components=core_response\n",
    "    )\n",
    "    queries_response = llm.predict(queries_prompt)\n",
    "    \n",
    "    queries = []\n",
    "    lines = queries_response.strip().split('\\n')\n",
    "    \n",
    "    for line in lines:\n",
    "        line = line.strip()\n",
    "        line = re.sub(r'^\\d+[\\.\\)]\\s*', '', line)\n",
    "        line = re.sub(r'^\\*\\s*', '', line)\n",
    "        line = re.sub(r'^-\\s*', '', line) \n",
    "        \n",
    "        if line and len(line) > 5:\n",
    "            queries.append(line)\n",
    "    \n",
    "    return queries[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7b93451",
   "metadata": {},
   "source": [
    "#### 知识引导生成式检索\n",
    "\n",
    "系统的核心创新在于知识引导的检索框架，整个过程分为三个关键步骤。步骤1为句子级编码，主要负责输入查询句子的初始表示学习，计算查询与知识库chunks间的句子级相似度分数。步骤2为知识引导检索，进一步从查询中提取关键信息，利用这些信息结合文档长度自适应加权等机制，对每个知识库chunk进行更详细的评分。步骤3为重排序与增强，使用大语言模型对步骤2检索的结果进行进一步重排序，并利用精炼的知识来增强原始查询。\n",
    "![picture](images/pic2.png)\n",
    "\n",
    "首先建立专门针对建筑领域文本分析的深度提取模块，集成领域预训练BERT进行上下文编码，结合双向LSTM进行建筑法规依赖建模。建立三级重要性分类层次：max（最高）、mid（中等）、lit（字面）优先级。本项目直接通过大语言模型进行关键信息提取，如果需要更精准的效果，可以自行训练BERT模型进行专门的关键信息提取。\n",
    "![picture](images/pic3.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9bc88609",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from typing import Dict, Tuple, List\n",
    "\n",
    "KEY_INFO_EXTRACTION_PROMPT = \"\"\"\n",
    "Your task is to extract key information from the query with three different priority levels:\n",
    "\n",
    "Maximum priority (max): The most important core concepts or entities\n",
    "Medium priority (mid): Important modifiers or qualifying conditions  \n",
    "Literal priority (lit): Specific values, standards or specifications\n",
    "\n",
    "Query: {query}\n",
    "max:\n",
    "mid:\n",
    "lit:\n",
    "\"\"\"\n",
    "\n",
    "class KeyInfoExtractor:\n",
    "    def __init__(self, llm):\n",
    "        self.llm = llm\n",
    "\n",
    "    def extract_key_info(self, query: str) -> Dict[str, Tuple[str, float]]:\n",
    "        prompt = KEY_INFO_EXTRACTION_PROMPT.format(query=query)\n",
    "        response = self.llm.predict(prompt)\n",
    "        \n",
    "        lines = response.strip().split('\\n')\n",
    "        key_info = {}\n",
    "        weights = {'max': 0.5, 'mid': 0.3, 'lit': 0.2}\n",
    "        \n",
    "        for line in lines:\n",
    "            if line.startswith('max:'):\n",
    "                key_info['max'] = (line[4:].strip(), weights['max'])\n",
    "            elif line.startswith('mid:'):\n",
    "                key_info['mid'] = (line[4:].strip(), weights['mid'])\n",
    "            elif line.startswith('lit:'):\n",
    "                key_info['lit'] = (line[4:].strip(), weights['lit'])\n",
    "        \n",
    "        return key_info"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "322faa01",
   "metadata": {},
   "source": [
    "#### 文档长度自适应因子\n",
    "\n",
    "在知识引导检索过程中，文档长度自适应因子用于调整不同长度文档的权重分配，确保长短文档都能得到公平的评分机会。该因子的计算考虑了当前文档chunk的长度与平均文档长度的关系。\n",
    "\n",
    "$$\\Lambda_{\\text{DL}} = \\frac{\\overline{|k|} + |k_j|}{2\\overline{|k|}}$$\n",
    "\n",
    "其中 $|k_j|$ 表示当前文档chunk的长度，$\\overline{|k|}$ 表示平均文档长度。通过这种归一化处理，可以避免因文档长度差异导致的评分偏差。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "290be1ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_document_length_factor(chunk_length: int, avg_length: int = 100) -> float:\n",
    "    lambda_dl = (avg_length + chunk_length) / (2 * avg_length)\n",
    "    return lambda_dl"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8602cb08",
   "metadata": {},
   "source": [
    "#### 术语重要性计算\n",
    "\n",
    "术语重要性指标衡量术语在文档中的显著程度，结合术语频率和文档长度自适应因子，能够更准确地评估术语在当前文档中的重要性。计算公式考虑了术语频率的非线性增长特性。\n",
    "\n",
    "$$\\text{Sign}(t_{e_i}^\\tau, k_j) = \\frac{2 \\cdot f(t_{e_i}^\\tau, k_j) \\cdot \\Lambda_{\\text{DL}}}{f(t_{e_i}^\\tau, k_j) + 1}$$\n",
    "\n",
    "其中 $f(t_{e_i}^\\tau, k_j)$ 表示术语在文档chunk中的出现频率，$\\Lambda_{\\text{DL}}$ 为文档长度自适应因子。这种计算方式能够防止高频术语过度影响评分。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ba8bcc78",
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_term_significance(term_freq: int, doc_length_factor: float) -> float:\n",
    "    significance = (2 * term_freq * doc_length_factor) / (term_freq + 1)\n",
    "    return significance"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6bf4a3b",
   "metadata": {},
   "source": [
    "#### 术语稀有度计算\n",
    "\n",
    "术语稀有度用于衡量术语在整个知识库中的稀缺程度，稀有度越高的术语在检索中的权重越大。计算采用了改进的IDF公式，增加了平滑处理以避免零除问题。\n",
    "\n",
    "$\\text{Rarity}(t_{e_i}^\\tau) = \\log\\left(\\frac{D - \\text{df}(t_{e_i}^\\tau) + 0.5}{\\text{df}(t_{e_i}^\\tau) + 0.5} + 1\\right)$\n",
    "\n",
    "其中 $D$ 表示文档总数，$\\text{df}(t_{e_i}^\\tau)$ 表示包含该术语的文档数量。加一操作确保了对数值始终为正数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d6dc2700",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def compute_term_rarity(doc_freq: int, total_docs: int) -> float:\n",
    "    rarity = np.log((total_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1)\n",
    "    return rarity"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c19ec6e7",
   "metadata": {},
   "source": [
    "#### 连贯性指数评估\n",
    "\n",
    "连贯性指数反映术语在文档中的分布连贯性，通过滑动窗口技术分析术语在文档中的局部分布情况。连贯性高的术语往往在文档的特定区域集中出现，表明其与文档主题的强相关性。\n",
    "\n",
    "$$\\text{CI}(t_{e_i}^\\tau, k_j) = \\max_{w \\in W, \\, t \\in w} \\frac{\\sum I(t = t_{e_i}^\\tau) \\cdot |w|}{|k_j|}$$\n",
    "\n",
    "其中 $W$ 表示文档中的滑动窗口集合，$I(t = t_{e_i}^\\tau)$ 为指示函数，当窗口中包含该术语时为1，否则为0。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "33644f65",
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_coherence_index(term: str, chunk: str, window_size: int = 50) -> float:\n",
    "    chunk_tokens = chunk.lower().split()\n",
    "    chunk_length = len(chunk_tokens)\n",
    "    \n",
    "    if chunk_length == 0:\n",
    "        return 0.0\n",
    "    \n",
    "    max_coherence = 0.0\n",
    "    \n",
    "    for i in range(0, chunk_length - window_size + 1, 10):\n",
    "        window = chunk_tokens[i:i + window_size]\n",
    "        term_count = window.count(term.lower())\n",
    "        \n",
    "        if term_count > 0:\n",
    "            coherence = (term_count * window_size) / chunk_length\n",
    "            max_coherence = max(max_coherence, coherence)\n",
    "    \n",
    "    return max_coherence"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "494fe967",
   "metadata": {},
   "source": [
    "#### 评分融合与检索\n",
    "\n",
    "将句子级相似度评分与知识级评分进行融合，形成最终的文档相关性评分。融合过程采用加权平均的方式，平衡参数λ控制两种评分方式的重要性。\n",
    "\n",
    "$\\Phi = \\lambda \\Phi(\\mathcal{K}) + (1 - \\lambda) \\Phi(\\mathcal{S})$\n",
    "\n",
    "其中 $\\lambda$ 为平衡参数，$\\Phi(\\mathcal{K})$ 为知识级评分，$\\Phi(\\mathcal{S})$ 为句子级评分。通过调整λ值，可以控制系统更偏向语义相似还是知识匹配。当λ=0时，系统完全依赖句子级语义相似度；当λ=1时，系统完全依赖知识匹配评分；λ=0.5时，两种评分方式权重相等。在建筑文档审查场景中，通常设置λ=0.5以平衡专业知识匹配和语义理解。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "66d44da3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "from typing import List, Tuple, Dict, Any\n",
    "\n",
    "class GKGRRetriever:\n",
    "    def __init__(self, \n",
    "                 knowledge_base: List[str],\n",
    "                 embedding_model,\n",
    "                 key_info_extractor: KeyInfoExtractor,\n",
    "                 llm,\n",
    "                 config: Dict[str, Any] = None):\n",
    "        self.knowledge_base = knowledge_base\n",
    "        self.embedding_model = embedding_model\n",
    "        self.key_info_extractor = key_info_extractor\n",
    "        self.llm = llm\n",
    "        \n",
    "        default_config = {\n",
    "            \"lambda_param\": 0.5,\n",
    "            \"top_k\": 5,\n",
    "            \"rerank_enabled\": True,\n",
    "            \"query_expansion\": True,\n",
    "            \"similarity_threshold\": 0.1\n",
    "        }\n",
    "        self.config = {**default_config, **(config or {})}\n",
    "        \n",
    "        self.kb_embeddings = self._precompute_embeddings()\n",
    "    \n",
    "    def _precompute_embeddings(self) -> np.ndarray:\n",
    "        embeddings = self.embedding_model.encode(self.knowledge_base, show_progress_bar=True)\n",
    "        return embeddings\n",
    "    \n",
    "    def retrieve_with_scores(self, query: str) -> List[Tuple[str, float, Dict[str, float]]]:\n",
    "        query_embedding = self.embedding_model.encode([query])[0]\n",
    "        sentence_scores = cosine_similarity(\n",
    "            query_embedding.reshape(1, -1), \n",
    "            self.kb_embeddings\n",
    "        )[0]\n",
    "        \n",
    "        key_info = self.key_info_extractor.extract_key_info(query)\n",
    "        knowledge_scores = self._compute_knowledge_scores(key_info)\n",
    "        \n",
    "        final_scores = []\n",
    "        for i in range(len(self.knowledge_base)):\n",
    "            norm_sent = sentence_scores[i]\n",
    "            norm_know = knowledge_scores[i] / max(knowledge_scores) if max(knowledge_scores) > 0 else 0\n",
    "            \n",
    "            final_score = (self.config[\"lambda_param\"] * norm_know + \n",
    "                          (1 - self.config[\"lambda_param\"]) * norm_sent)\n",
    "            final_scores.append(final_score)\n",
    "        \n",
    "        results_with_scores = []\n",
    "        for i, final_score in enumerate(final_scores):\n",
    "            if final_score > self.config[\"similarity_threshold\"]:\n",
    "                score_details = {\n",
    "                    \"sentence_score\": float(sentence_scores[i]),\n",
    "                    \"knowledge_score\": float(knowledge_scores[i]),\n",
    "                    \"final_score\": float(final_score)\n",
    "                }\n",
    "                results_with_scores.append((self.knowledge_base[i], final_score, score_details))\n",
    "        \n",
    "        results_with_scores.sort(key=lambda x: x[1], reverse=True)\n",
    "        return results_with_scores[:self.config[\"top_k\"]]\n",
    "    \n",
    "    def _compute_knowledge_scores(self, key_info: Dict[str, Tuple[str, float]]) -> List[float]:\n",
    "        scores = []\n",
    "        avg_length = sum(len(chunk.split()) for chunk in self.knowledge_base) / len(self.knowledge_base)\n",
    "        \n",
    "        for chunk in self.knowledge_base:\n",
    "            chunk_score = 0.0\n",
    "            chunk_tokens = chunk.lower().split()\n",
    "            chunk_length = len(chunk_tokens)\n",
    "            \n",
    "            lambda_dl = compute_document_length_factor(chunk_length, avg_length)\n",
    "            \n",
    "            for priority, (info_text, weight) in key_info.items():\n",
    "                if not info_text.strip():\n",
    "                    continue\n",
    "                \n",
    "                terms = info_text.lower().split()\n",
    "                for term in terms:\n",
    "                    if term in chunk_tokens:\n",
    "                        tf = chunk_tokens.count(term)\n",
    "                        \n",
    "                        significance = compute_term_significance(tf, lambda_dl)\n",
    "                        \n",
    "                        segments_with_term = sum(1 for kb_chunk in self.knowledge_base \n",
    "                                                if term in kb_chunk.lower())\n",
    "                        rarity = compute_term_rarity(segments_with_term, len(self.knowledge_base))\n",
    "                        \n",
    "                        coherence = compute_coherence_index(term, chunk)\n",
    "                        \n",
    "                        term_score = significance * rarity * (1 + coherence) * weight\n",
    "                        chunk_score += term_score\n",
    "            \n",
    "            scores.append(chunk_score)\n",
    "        \n",
    "        return scores\n",
    "    \n",
    "    def retrieve(self, query: str) -> Tuple[List[str], str]:\n",
    "        results_with_scores = self.retrieve_with_scores(query)\n",
    "        \n",
    "        documents = [doc for doc, _, _ in results_with_scores]\n",
    "        \n",
    "        if self.config[\"rerank_enabled\"] and len(documents) > 1:\n",
    "            documents = self._llm_rerank(query, documents)\n",
    "        \n",
    "        augmented_query = query\n",
    "        if self.config[\"query_expansion\"]:\n",
    "            augmented_query = self._augment_query(query, documents[:3])\n",
    "        \n",
    "        return documents, augmented_query"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d23e7375",
   "metadata": {},
   "source": [
    "#### 重排序优化\n",
    "\n",
    "系统使用大语言模型对检索结果进行进一步重排序，通过LLM的语义理解能力优化文档的相关性排序。重排序过程中，系统会构造包含查询和候选文档的提示，要求LLM根据相关性对文档进行重新排序。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "3ae7c7a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "def _llm_rerank(self, query: str, documents: List[str]) -> List[str]:\n",
    "    if len(documents) <= 1:\n",
    "        return documents\n",
    "    \n",
    "    rerank_prompt = f\"\"\"\n",
    "Task: A list of documents is shown below. Each document has a number next to it. A question is also provided. Your task is to return the numbers of ALL documents in order of relevance from MOST to LEAST relevant. MUST include EVERY document number exactly once.\n",
    "\n",
    "Example format:\n",
    "    Document 1: <document 1>\n",
    "    Document 2: <document 2>\n",
    "    Document 3: <document 3>\n",
    "    Question: <question>\n",
    "    Answer: 3,1,2\n",
    "\n",
    "Now here are the actual documents and question.\n",
    "\n",
    "\"\"\"\n",
    "    for i, doc in enumerate(documents):\n",
    "        rerank_prompt += f\"Document {i+1}: {doc[:150]}...\\n\"\n",
    "    \n",
    "    rerank_prompt += f\"Question: {query}\\nAnswer:\"\n",
    "    \n",
    "    try:\n",
    "        response = self.llm.predict(rerank_prompt)\n",
    "        order_nums = [int(x.strip()) - 1 for x in response.split(',') \n",
    "                     if x.strip().isdigit() and 0 <= int(x.strip()) - 1 < len(documents)]\n",
    "        \n",
    "        reranked = [documents[i] for i in order_nums if i < len(documents)]\n",
    "        \n",
    "        # 添加遗漏的文档\n",
    "        used_indices = set(order_nums)\n",
    "        for i, doc in enumerate(documents):\n",
    "            if i not in used_indices:\n",
    "                reranked.append(doc)\n",
    "        \n",
    "        return reranked[:len(documents)]\n",
    "    except:\n",
    "        return documents\n",
    "    \n",
    "GKGRRetriever._llm_rerank = _llm_rerank"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96ee71c1",
   "metadata": {},
   "source": [
    "#### 查询增强\n",
    "\n",
    "同时系统还会利用检索到的知识来增强原始查询，生成更具体、更详细的查询用于进一步检索。查询增强通过分析检索结果的上下文信息，识别查询中可能遗漏的关键概念和术语。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "ce6e9451",
   "metadata": {},
   "outputs": [],
   "source": [
    "def _augment_query(self, original_query: str, top_results: List[str]) -> str:\n",
    "    if not top_results:\n",
    "        return original_query\n",
    "    \n",
    "    document_list = \"\"\n",
    "    for i, doc in enumerate(top_results):\n",
    "        document_list += f\"Document {i+1}: {doc[:100]}...\\n\"\n",
    "    \n",
    "    augment_prompt = f\"\"\"\n",
    "Task: Your task is to generate a detailed answer to the question by synthesizing information from ALL provided documents. Prioritize relevance, cite document numbers, and structure your response as follows:\n",
    "\n",
    "Question: {original_query}\n",
    "{document_list}\n",
    "Answer:\n",
    "\"\"\"\n",
    "    \n",
    "    try:\n",
    "        augmented = self.llm.predict(augment_prompt)\n",
    "        return augmented.strip()\n",
    "    except:\n",
    "        return original_query\n",
    "\n",
    "GKGRRetriever._augment_query = _augment_query"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c783319",
   "metadata": {},
   "source": [
    "#### 偏差检测分析\n",
    "\n",
    "在先期知识增强检索阶段获取领域知识后，系统随即进入误差辨析模块。该模块基于检索得到的知识参考，并结合预设的审阅问题，对原文进行细致的偏差检测与评估。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "6d84e7a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "class ErrorAnalyzer:\n",
    "    def __init__(self, llm):\n",
    "        self.llm = llm\n",
    "    \n",
    "    def analyze_errors(self, document_chunk: str, query: str, retrieved_knowledge: List[str]) -> Dict[str, Any]:\n",
    "        \n",
    "        analysis_prompt = f\"\"\"\n",
    "Task: Your task is to conduct an error analysis on a given review document, based on a provided review query and relevant reference specifications. This analysis MUST strictly adhere to the provided reference and focus specifically on reviewing and analyzing the original descriptive sections within the review document.\n",
    "\n",
    "Review document: {document_chunk}\n",
    "Query: {query}\n",
    "Reference: {chr(10).join([f\"{i+1}. {ref}\" for i, ref in enumerate(retrieved_knowledge)])}\n",
    "Analysis:\n",
    "\"\"\"\n",
    "        \n",
    "        analysis = self.llm.predict(analysis_prompt)\n",
    "        \n",
    "        return {\n",
    "            \"analysis\": analysis,\n",
    "            \"reference_support\": retrieved_knowledge\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9421c272",
   "metadata": {},
   "source": [
    "#### 修订建议生成\n",
    "\n",
    "误差辨析模块完成后，系统将输出标记偏差区域以及相关知识佐证。随后，系统进入修订策略生成模块。该模块依据误差分析结果和知识参考，对标记区域进行针对性的修订建议生成，最终实现对原文的知识驱动型自动修正。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "6d4ed81c",
   "metadata": {},
   "outputs": [],
   "source": [
    "class RevisionGenerator:\n",
    "    def __init__(self, llm):\n",
    "        self.llm = llm\n",
    "    \n",
    "    def generate_revisions(self, document_chunk: str, analysis: Dict[str, Any]) -> Dict[str, str]:     \n",
    "        revision_prompt = f\"\"\"\n",
    "Task: Your task is to review and revise the provided document based on the given analysis and corresponding reference specifications. STRICT adherence to the provided reference specifications is required. If the review document aligns with the analysis and reference specifications WITHOUT discrepancies, revision is not necessary.\n",
    "\n",
    "Review document: {document_chunk}\n",
    "Analysis: {analysis['analysis']}\n",
    "Reference: {chr(10).join([f\"- {ref}\" for ref in analysis['reference_support']])}\n",
    "Revision:\n",
    "\"\"\"\n",
    "        \n",
    "        revision = self.llm.predict(revision_prompt)\n",
    "        \n",
    "        return {\n",
    "            \"original_text\": document_chunk,\n",
    "            \"revision_suggestions\": revision,\n",
    "            \"modified_regions\": analysis.get(\"error_regions\", []),\n",
    "            \"confidence\": self._calculate_confidence(analysis)\n",
    "        }\n",
    "    \n",
    "    def _calculate_confidence(self, analysis: Dict[str, Any]) -> float:\n",
    "        ref_count = len(analysis.get(\"reference_support\", []))\n",
    "        error_count = len(analysis.get(\"error_regions\", []))\n",
    "        \n",
    "        confidence = min(0.9, 0.5 + (ref_count * 0.1) + (error_count * 0.05))\n",
    "        return confidence"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ef2f9f5",
   "metadata": {},
   "source": [
    "#### 完整审查流程\n",
    "\n",
    "将上述所有模块整合，形成完整的文档审查流程。系统首先生成审查问题，然后进行知识引导检索，接着执行错误分析，最后生成修订建议。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "e7378cf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def complete_review_process(document_chunk: str, \n",
    "                          gkgr_framework: GKGRRetriever, \n",
    "                          error_analyzer: ErrorAnalyzer,\n",
    "                          revision_generator: RevisionGenerator) -> Dict[str, Any]:    \n",
    "    review_queries = generate_review_queries(gkgr_framework.llm, document_chunk)\n",
    "    \n",
    "    results = {}\n",
    "    for query in review_queries[:3]:\n",
    "        retrieved_docs, augmented_query = gkgr_framework.retrieve(query)\n",
    "        \n",
    "        knowledge_refs = retrieved_docs\n",
    "        analysis = error_analyzer.analyze_errors(document_chunk, query, knowledge_refs)\n",
    "        \n",
    "        revision = revision_generator.generate_revisions(document_chunk, analysis)\n",
    "        \n",
    "        results[query] = {\n",
    "            \"retrieved_knowledge\": retrieved_docs,\n",
    "            \"augmented_query\": augmented_query,\n",
    "            \"analysis\": analysis,\n",
    "            \"revision\": revision\n",
    "        }\n",
    "    \n",
    "    return results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bc6173a",
   "metadata": {},
   "source": [
    "至此，我们就完成了建筑文档智能审查系统的核心实现。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "326ff27c",
   "metadata": {},
   "source": [
    "## 实际应用示例\n",
    "\n",
    "让我们通过一个完整的示例来展示系统的使用："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29ba3b1d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "审查问题: Is the minimum concrete compressive strength of C25 verified by test reports and does it meet the design requirements specified in standards like GB 50010?\n",
      "修订建议: 根据提供的分析报告和参考规范GB 50496-2018《大体积混凝土施工标准》，原审查文档存在重大不符合项，特别是关于混凝土强度验收指标和养护要求。审查文档未体现大体积混凝土施工的特殊性，与参考规范存在严重偏差。必须进行修订。\n",
      "\n",
      "**修订后的文档：**\n",
      "\n",
      "钢筋混凝土柱（大体积混凝土）的施工应符合以下要求：\n",
      "1.  混凝土强度等级应符合结构设计要求（如基于GB 50010），并宜采用60d或90d龄期的强度指标作为配合比设计和验收依据。应提供法定检测机构出具的混凝土抗压强度检验报告进行验证。\n",
      "2.  钢筋保护层厚度应符合设计要求及GB 50496等现行有关标准的规定。\n",
      "3.  混凝土浇筑应连续进行，采用整体分层或推移式方法施工。层间间歇时间不应大于混凝土的初凝时间。\n",
      "4.  混凝土应采用保温保湿养护，专人负责，持续时间不宜少于14d，并应经常检查保持混凝土表面湿润。保温覆盖层拆除应分层逐步进行，当混凝土表面温度与环境最大温差小于20℃时，方可全部拆除。\n",
      "\n",
      "**修订说明：**\n",
      "1.  **第1条**：根据分析报告指出的“未指定强度龄期”和“未要求验证”的关键错误，严格依据GB 50496-2018第4.3.1条，将强度要求修订为鼓励采用60d或90d龄期强度，并明确要求通过试验报告进行验证。\n",
      "2.  **第2条**：保护层厚度首先应满足设计要求，参考规范中未对保护层厚度做统一规定，故修订为更具通用性的表述。\n",
      "3.  **第3条**：参考GB 50496-2018第5.4.1条，将“间歇时间不超过1小时”的绝对化表述，修订为更科学的“不应大于混凝土的初凝时间”，并补充了大体积混凝土推荐的浇筑方法。\n",
      "4.  **第4条**：根据分析报告指出的养护要求不具体的问题，依据GB 50496-2018第5.5.1条，将“保持湿润”的普通养护要求，修订为针对大体积混凝土的“保温保湿养护”，并明确了养护时间、责任人以及保温层的拆除条件。\n",
      "--------------------------------------------------\n",
      "审查问题: Is the 25mm clear cover to reinforcement for the columns confirmed through inspection and does it comply with the minimum cover requirements for durability in GB 50010 based on the environmental category?\n",
      "修订建议: 钢筋混凝土柱的施工应符合以下要求：\n",
      "1. 混凝土强度等级不低于C25\n",
      "2. 钢筋的混凝土保护层最小厚度应符合设计规定及现行国家标准《混凝土结构设计规范》GB 50010中关于环境类别的要求，并应通过现场检测进行确认\n",
      "3. 混凝土浇筑应连续进行，间歇时间不超过1小时\n",
      "4. 养护期间应保持混凝土表面湿润，且保湿养护持续时间不宜少于14d\n",
      "\n",
      "**修订说明：**\n",
      "根据分析意见及GB 50496-2018标准要求，对原文档第2条关于钢筋保护层厚度的内容进行了修订。原条款仅规定了单一厚度，未考虑环境类别差异且未提及验收标准，不符合GB 50010的设计原则。修订后条款明确了保护层厚度应遵循设计规定和GB 50010标准，并增加了通过现场检测进行确认的要求，以满足施工质量控制和验收的需要。其余条款符合大体积混凝土施工标准的相关规定，予以保留。\n",
      "--------------------------------------------------\n",
      "审查问题: Is there a documented continuous concrete pouring sequence and method statement to ensure the maximum 1-hour间歇时间 between batches is not exceeded, as required to prevent cold joints?\n",
      "修订建议: 钢筋混凝土柱的施工应符合以下要求：\n",
      "1. 混凝土强度等级不低于C25\n",
      "2. 钢筋保护层厚度为25mm\n",
      "3. 混凝土浇筑应连续进行，层间间歇时间不应大于混凝土初凝时间（通过试验确定）\n",
      "4. 养护期间应保持混凝土表面湿润\n",
      "\n",
      "---\n",
      "**改写说明**：\n",
      "- **修正间歇时间要求**：将“间歇时间不超过1小时”改为“层间间歇时间不应大于混凝土初凝时间（通过试验确定）”，以严格遵循《大体积混凝土施工标准》GB 50496-2018中关于间歇时间应根据混凝土初凝时间试验确定、不宜简单设定固定时长的规定。\n",
      "- **保持其他条款一致**：对强度等级、保护层厚度及养护要求等无争议内容予以保留，未做变动。\n",
      "\n",
      "如果您有其他风格或用途上的需求，我可以进一步调整文本表达。\n",
      "--------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "embedding = BGEEmbedding(model_name=\"BAAI/bge-m3\")\n",
    "key_extractor = KeyInfoExtractor(llm)\n",
    "\n",
    "# 从markdown文档构建知识库\n",
    "processor = DocumentProcessor()\n",
    "documents = processor.load_documents(\"./construction_standards\")\n",
    "\n",
    "# 对文档进行动态语义分块\n",
    "chunker = DynamicSemanticChunker()\n",
    "knowledge_base = []\n",
    "for doc in documents:\n",
    "    chunks = chunker.split_text(doc)\n",
    "    knowledge_base.extend(chunks.values())\n",
    "\n",
    "# 初始化检索器\n",
    "gkgr_retriever = GKGRRetriever(\n",
    "    knowledge_base=knowledge_base,\n",
    "    embedding_model=embedding,\n",
    "    key_info_extractor=key_extractor,\n",
    "    llm=llm\n",
    ")\n",
    "\n",
    "# 初始化分析器\n",
    "error_analyzer = ErrorAnalyzer(llm)\n",
    "revision_generator = RevisionGenerator(llm)\n",
    "\n",
    "# 待审查的文档内容\n",
    "sample_document = \"\"\"\n",
    "钢筋混凝土柱的施工应符合以下要求：\n",
    "1. 混凝土强度等级不低于C25\n",
    "2. 钢筋保护层厚度为25mm\n",
    "3. 混凝土浇筑应连续进行，间歇时间不超过1小时\n",
    "4. 养护期间应保持混凝土表面湿润\n",
    "\"\"\"\n",
    "\n",
    "# 执行审查\n",
    "result = complete_review_process(\n",
    "    sample_document, \n",
    "    gkgr_retriever, \n",
    "    error_analyzer, \n",
    "    revision_generator\n",
    ")\n",
    "\n",
    "# 查看审查结果\n",
    "for query, analysis in result.items():\n",
    "    print(f\"审查问题: {query}\")\n",
    "    print(f\"修订建议: {analysis['revision']['revision_suggestions']}\")\n",
    "    print(\"-\" * 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96070e33",
   "metadata": {},
   "source": [
    "## 扩展性说明\n",
    "\n",
    "系统可以通过更换知识库轻松适应其他领域。对于特定企业或项目，可以通过微调关键信息提取模型来提升准确性。在性能优化方面，使用动态语义分块可以提升检索质量，预计算并缓存知识库嵌入以提升检索速度，对于大量文档可使用批量处理模式，根据具体应用场景调整λ参数和top-k值。\n",
    "\n",
    "## 写在最后\n",
    "\n",
    "恭喜你阅读完此文，你已经充分了解了如何实现一个建筑文档智能审查系统以及其背后的思考。这个系统展示了如何将动态语义分块、知识引导检索和大语言模型有机结合，为建筑行业的文档审查提供了一个实用的解决方案。\n",
    "\n",
    "虽然当前系统已经取得了不错的效果，但仍有改进空间。全局关联增强方面，当前基于文本块的检索可以进一步结合知识图谱等技术。多模态支持方面，未来可以扩展支持CAD图纸、施工图等视觉信息。实时更新方面，支持知识库的增量更新和动态维护。个性化定制方面，根据不同企业和项目特点进行系统定制。\n",
    "\n",
    "读者们可以运行项目中的示例代码，体验完整的建筑文档智能审查流程。我们相信这个系统不仅能够提升审查效率，更能为建筑行业的数字化转型贡献力量。\n",
    "\n",
    "## 致谢\n",
    "\n",
    "本项目的开发过程中，我们深入研究了建筑工程领域的专业知识和最新的自然语言处理技术。特别感谢建筑行业专家提供的宝贵建议，以及开源社区在技术实现方面的支持。项目代码实现参考了LlamaIndex、Transformers等优秀开源项目的设计理念。\n",
    "\n",
    "需要说明的是，本项目专门针对建筑施工领域的文档审查场景进行了深度优化。如果您需要处理其他领域的文档，建议根据具体需求对系统进行相应调整。\n",
    "\n",
    "## 源码获取\n",
    "\n",
    "本项目的源码以及实例数据存放在 [GitHub 仓库](https://github.com/Hongru0306/CDDRS)。\n",
    "\n",
    "## 引用\n",
    "\n",
    "如果您在研究中使用了本项目的成果，请按如下方式引用：\n",
    "\n",
    "```bibtex\n",
    "@article{XIAO2025103618,\n",
    "  title = {Generative knowledge-guided review system for construction disclosure documents},\n",
    "  journal = {Advanced Engineering Informatics},\n",
    "  volume = {68},\n",
    "  pages = {103618},\n",
    "  year = {2025},\n",
    "  issn = {1474-0346},\n",
    "  doi = {https://doi.org/10.1016/j.aei.2025.103618},\n",
    "  url = {https://www.sciencedirect.com/science/article/pii/S1474034625005117},\n",
    "}"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llamaindex",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: Extra-Chapter/CDDRS/readme.md
================================================
# 建筑文档智能RAG审查系统

一个从零开始实现的建筑文档智能审查系统，旨在帮助开发者理解知识引导检索在专业领域文档审查中的核心原理和实现细节。

## 项目动机

建筑施工交底文档的合规性审查是保障施工项目安全性、经济性的关键环节。在施工项目全周期中，各项操作必须符合相关规范条文要求，才能确保建设项目的安全性与可持续性。然而，相关查询参考往往分散在各个项目文件中，传统基于人工的审查方法难以处理庞大复杂的建筑条文，其审查过程需要基于审查人员的经验与专业知识，具有主观性强，耗时长且易出错等弊端。

随着大语言模型技术的发展，LLM为自动化建筑文档审查带来了新的希望。然而，大语言模型通常使用通用语料进行训练，缺乏建筑相关背景知识，在处理建造背景下的复杂推理问题中会产生严重的幻觉现象。通过使用基于向量相似匹配的RAG方法，可以为LLMs提供初步的相似参考知识，从而减轻基于人工或规则的审查方法难以处理庞大建筑文本所带来的错误率高的问题。

然而，传统RAG方法在建筑专业文档审查中存在关键局限：由于固定的分块设计，使得文本块之间面临知识信息缺失问题；在检索过程中，使用整句问询嵌入的方法进行相似性匹配，缺少对问询细粒度特征的识别与考量，检索效率低下。在建筑施工交底文档中，这类文档详细阐述了施工工艺特点和方法、质量规格、操作程序以及安全协议，包含大量知识细节且专业性极强。因此需要一个能够精准理解和检索建筑领域专业知识的智能系统。

因此，本项目提出了一个生成式知识引导的建筑文档审查系统，旨在提升审查的可靠性和准确性。系统具有两大核心创新：首先提出动态语义知识分块策略，构建具有更优语义连贯性和完整性的知识库；其次基于增强的知识表示，提出生成式知识引导检索框架，在语义嵌入检索过程中增强对细粒度信息的关注，从而提高知识参考检索的准确性和建筑文档审查任务中修正的可靠性。

需要注意的是，由于篇幅限制，我们无法展示完整的整个实现过程，但是，我们将在文档中讲解每个必要的实现步骤以及背后的思考，您可以通过这些内容快速理解如何实现一个建筑文档智能审查系统。

## 前置实现

接下来，我们将带领大家，从0开始，实现一个建筑文档智能审查系统。首先，我们将完成一些基本的准备过程。

### 1. 实现 LLM 模块

首先我们需要实现 LLM 模块，这是系统中最基本的模块，我们将利用大模型完成文档的清洗，信息提取等工作，可以说本系统的一部分精髓即为使用大模型预先处理文档信息，方便后续进行检索，这里我们使用 DeepSeek 的 api 来实现。

```python
from abc import ABC, abstractmethod
from typing import Any, Optional

class BaseLLM(ABC):
    """Interface for large language models."""

    def __init__(
        self,
        model_name: str,
        model_params: Optional[dict[str, Any]] = None,
        **kwargs: Any,
    ):
        self.model_name = model_name
        self.model_params = model_params or {}

    @abstractmethod
    def predict(self, input: str) -> str:
        """Sends a text input to the LLM and retrieves a response."""
```

如上是一个调用大模型的抽象接口，这可以帮助我们统一调用大模型的格式，我们继承这个基类，实现调用大模型的接口。

```python
from openai import OpenAI
from typing import Any, Optional
from .base import BaseLLM

class DeepSeekLLM(BaseLLM):
    """Implementation of the BaseLLM interface using DeepSeek API."""

    def __init__(
        self,
        model_name: str,
        api_key: str,
        base_url: str = "https://api.deepseek.com/v1",
        model_params: Optional[dict[str, Any]] = None,
        **kwargs: Any,
    ):
        super().__init__(model_name, model_params, **kwargs)
        self.client = OpenAI(api_key=api_key, base_url=base_url)

    def predict(self, input: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": input}],
        )
        return response.choices[0].message.content
```

完成搭建后，我们可以通过尝试调用 predict 方法来测试是否成功。

```python
llm = DeepSeekLLM(
    model_name="deepseek-chat", 
    api_key="your-api-key-here",
    base_url="https://api.deepseek.com/v1"
)
print(llm.predict("你好，你能帮助我进行建筑文档审查吗？"))
```

当观察到 LLM 正确回复后，我们这一模块的构建就完成了。

### 2. 实现 Embedding 模块

除了调用大模型，我们还需要实现 Embedding 模块，Embedding 模块用于将文本转换为向量，我们将使用向量来表示文档中的信息，这样的好处是，我们可以通过向量的相似度来衡量文档与查询之间的相似度，从而召回对回复用户问题最有帮助的文档。

构建 Embedding 模块的方法与构建 LLM 模块类似。

```python
from abc import ABC, abstractmethod
from typing import List, Any, Optional

class BaseEmb(ABC):
    def __init__(
        self,
        model_name: str,
        model_params: Optional[dict[str, Any]] = None,
        **kwargs: Any,
    ):
        self.model_name = model_name
        self.model_params = model_params or {}

    @abstractmethod
    def get_emb(self, input: str) -> List[float]:
        """Sends a text input to the embedding model and retrieves the embedding."""
        pass

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from .base import BaseEmb

class BGEEmbedding(BaseEmb):
    def __init__(self, model_name: str = "BAAI/bge-m3", **kwargs):
        super().__init__(model_name=model_name, **kwargs)
        self.embed_model = HuggingFaceEmbedding(
            model_name=model_name,
            trust_remote_code=True,
            cache_folder="./model_cache"
        )

    def get_emb(self, text: str) -> List[float]:
        embedding = self.embed_model.get_text_embedding(text)
        return embedding
```

完成搭建后，我们可以通过尝试调用 get_emb 方法来测试是否成功。

```python
emb = BGEEmbedding(model_name="BAAI/bge-m3")
print(emb.get_emb("建筑结构的安全性检查包括哪些方面？"))
```

当观察到 Embedding 正确给出了编码后的向量，我们这一模块的构建就完成了。

### 3. 实现文档预处理模块

为了处理建筑文档，我们需要预先准备好文档读取模块。本系统假设所有建筑规范和标准已经转换为Markdown格式，便于后续的文本处理和分析。

```python
import os
from pathlib import Path
from typing import Dict, List

class DocumentProcessor:
    def __init__(self):
        pass
    
    def load_documents(self, directory_path: str) -> List[str]:
        documents = []
        
        for file_path in Path(directory_path).rglob('*.md'):
            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                    documents.append(content)
            except Exception as e:
                print(f"Error reading {file_path}: {e}")
                    
        return documents
```

完成文档预处理模块的设置后，我们就可以采用下面的方法来加载建筑规范文档了。

```python
processor = DocumentProcessor()
documents = processor.load_documents("./construction_standards")
print(f"加载了 {len(documents)} 个建筑规范文档")
```

## 核心实现

建筑文档审查系统的主要流程如下。首先，让我们来梳理一下建筑文档审查的工作流程，系统的一个核心思想在于，我们需要把用户提供的文档内容通过智能化的问询生成和知识引导检索来识别潜在的合规性问题。与传统RAG方法不同，我们的系统专门针对建筑领域的专业特点进行了优化，能够更准确地理解建筑规范要求，提供更可靠的审查建议。

### 动态语义知识分块

在传统RAG流程中，文本通过设置固定的token数量划分文本区块。然而，固定token数量会在句子中间截断，导致信息缺失。为此，本系统使用基于建筑文本语义动态划分的方式，通过双重语义聚类的方式，完成考虑建筑语义连贯性的知识chunk划分。

首先，将整个文档内容处理成单独句子序列 $S = \{s_0, s_1, \ldots, s_a\}$。通过计算相邻句子间的语义差异度来识别潜在的语义边界：

$$\gamma_i = 1 - \frac{s_{i-1} \cdot s_i}{\|s_{i-1}\| \|s_i\|}$$

基于语义差异度分布自动确定动态阈值：

$$\psi = \text{Quantile}(\Gamma, \frac{a-p}{a})$$

确保最终的分块既保持语义连贯性又满足长度约束：

$$\mathbb{E}[\gamma_{\text{intra}}] < \mathbb{E}[\gamma_{\text{inter}}]$$

```python
import re
import numpy as np
from typing import List, Dict, Tuple
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class DynamicSemanticChunker:
    def __init__(self, 
                 embedding_model: str = "BAAI/bge-m3",
                 max_chunk_length: int = 512,
                 min_chunk_length: int = 50):
        self.embedding_model = SentenceTransformer(embedding_model)
        self.max_chunk_length = max_chunk_length
        self.min_chunk_length = min_chunk_length
    
    def split_text(self, text: str) -> Dict[str, str]:
        sentences = self._split_into_sentences(text)
        if len(sentences) == 0:
            return {}
        
        sentence_embeddings = self.embedding_model.encode(sentences)
        gamma_values = self._compute_semantic_discrepancy(sentence_embeddings)
        
        total_tokens = sum(len(s.split()) for s in sentences)
        baseline_chunks = max(1, total_tokens // self.max_chunk_length)
        alpha = max(0.1, (len(sentences) - baseline_chunks) / len(sentences))
        threshold = np.quantile(gamma_values, alpha) if len(gamma_values) > 0 else 0.5
        
        boundaries = self._identify_boundaries(gamma_values, threshold)
        initial_chunks = self._create_initial_chunks(sentences, boundaries)
        final_chunks = self._enforce_length_constraints(initial_chunks)
        
        chunks_dict = {}
        for i, chunk in enumerate(final_chunks):
            chunk_id = f"chunk-{i+1:03d}"
            chunks_dict[chunk_id] = chunk
        
        return chunks_dict
    
    def _split_into_sentences(self, text: str) -> List[str]:
        sentence_pattern = r'[。！？；\n]+'
        sentences = re.split(sentence_pattern, text)
        
        cleaned_sentences = []
        for sentence in sentences:
            sentence = sentence.strip()
            if len(sentence) > 5:
                cleaned_sentences.append(sentence)
        
        return cleaned_sentences
    
    def _compute_semantic_discrepancy(self, embeddings: np.ndarray) -> List[float]:
        gamma_values = []
        
        for i in range(1, len(embeddings)):
            similarity = cosine_similarity(
                embeddings[i-1].reshape(1, -1),
                embeddings[i].reshape(1, -1)
            )[0][0]
            
            gamma = 1 - similarity
            gamma_values.append(gamma)
        
        return gamma_values
    
    def _identify_boundaries(self, gamma_values: List[float], threshold: float) -> List[int]:
        boundaries = [0]
        
        for i, gamma in enumerate(gamma_values):
            if gamma > threshold:
                boundaries.append(i + 1)
        
        boundaries.append(len(gamma_values) + 1)
        return sorted(set(boundaries))
    
    def _create_initial_chunks(self, sentences: List[str], boundaries: List[int]) -> List[str]:
        chunks = []
        
        for i in range(len(boundaries) - 1):
            start = boundaries[i]
            end = boundaries[i + 1]
            
            chunk_sentences = sentences[start:end]
            chunk_text = ' '.join(chunk_sentences)
            chunks.append(chunk_text)
        
        return chunks
    
    def _enforce_length_constraints(self, chunks: List[str]) -> List[str]:
        final_chunks = []
        
        for chunk in chunks:
            chunk_length = len(chunk.split())
            
            if chunk_length <= self.max_chunk_length:
                if chunk_length >= self.min_chunk_length:
                    final_chunks.append(chunk)
            else:
                sub_chunks = self._split_long_chunk(chunk)
                final_chunks.extend(sub_chunks)
        
        return final_chunks
    
    def _split_long_chunk(self, chunk: str) -> List[str]:
        sentences = chunk.split('。')
        sub_chunks = []
        current_chunk = ""
        
        for sentence in sentences:
            if sentence.strip():
                test_chunk = current_chunk + sentence + "。"
                if len(test_chunk.split()) <= self.max_chunk_length:
                    current_chunk = test_chunk
                else:
                    if current_chunk:
                        sub_chunks.append(current_chunk.strip())
                    current_chunk = sentence + "。"
        
        if current_chunk:
            sub_chunks.append(current_chunk.strip())
        
        return sub_chunks
```

### 建筑文档审查系统

整体的审查过程如下图所示。系统获取需要审查的区域后，依据提示生成审查问题推荐，此部分也可供工程师进行相关问题输入或推荐问题选择，生成待审查问题。随后，系统通过生成式知识引导检索框架，依据审查问题在所建文本知识库中检索出相应的知识参考。最终，依据检索的部分与审查原文，进行问题分析与审查修正，完成最终的审查流程。

![picture](images/pic1.png)

#### 审查问题生成

在文档审查流程中，系统引入了双阶段Prompt工程驱动的智能化问询生成机制，旨在对建筑施工交底文档进行预见性分析与风险挖掘，实现对文档潜在问题的高效、精准定位。

阶段1为待查文档主旨目标解构，模型被指示从文本中提炼核心事件、关键技术、工艺流程等要素，结构化地总结文档的核心内容，由此界定本次审查的靶向目标，为后续的精细化问询奠定基础。阶段2为多维度风险探测与定制化问询生成，基于第一阶段提炼的核心要素，通过few-shot等方式引导 LLM 从合规性、安全性、可操作性等多维度对文档进行风险探测。Prompt 指示模型围绕潜在的限制条件、操作流程、以及可能存在的合规性隐患等方面，进行细粒度、多角度的审查提问。

```python
import re

CORE_COMPONENTS_PROMPT = """
Task: Your task involves the extraction of crucial information components from a designated text segment. The purpose of this extraction is to assist in uncovering hidden descriptions indicative of regulatory non-compliance. Key information components encompass, but are not limited to, core descriptive events, essential construction techniques, technologies, and associated limitations and restrictions.

Input: {document_chunk}
Answer:
"""

REVIEW_QUERIES_PROMPT = """
Task: Your task is to generate relevant search queries based on the text under review and provided core descriptive references. These queries should target potential areas of non-compliance within the text, facilitating the subsequent retrieval of original regulatory documents for detailed examination.

Input: {document_chunk}
Core components: {core_components}
Queries:
"""

def generate_review_queries(llm, document_chunk: str) -> List[str]:
    core_prompt = CORE_COMPONENTS_PROMPT.format(document_chunk=document_chunk)
    core_response = llm.predict(core_prompt)
    
    # 生成审查查询
    queries_prompt = REVIEW_QUERIES_PROMPT.format(
        document_chunk=document_chunk,
        core_components=core_response
    )
    queries_response = llm.predict(queries_prompt)
    
    # 从响应中提取查询列表
    queries = re.findall(r"'([^']*)'", queries_response)
    
    return queries[:5]
```

#### 知识引导生成式检索

系统的核心创新在于知识引导的检索框架，整个过程分为三个关键步骤。步骤1为句子级编码，主要负责输入查询句子的初始表示学习，计算查询与知识库chunks间的句子级相似度分数。步骤2为知识引导检索，进一步从查询中提取关键信息，利用这些信息结合文档长度自适应加权等机制，对每个知识库chunk进行更详细的评分。步骤3为重排序与增强，使用大语言模型对步骤2检索的结果进行进一步重排序，并利用精炼的知识来增强原始查询。
![picture](images/pic2.png)

首先建立专门针对建筑领域文本分析的深度提取模块，集成领域预训练BERT进行上下文编码，结合双向LSTM进行建筑法规依赖建模。建立三级重要性分类层次：max（最高）、mid（中等）、lit（字面）优先级。本项目直接通过大语言模型进行关键信息提取，如果需要更精准的效果，可以自行训练BERT模型进行专门的关键信息提取。
![picture](images/pic3.png)

```python
import re
from typing import Dict, Tuple, List

KEY_INFO_EXTRACTION_PROMPT = """
Your task is to extract key information from the query with three different priority levels:

Maximum priority (max): The most important core concepts or entities
Medium priority (mid): Important modifiers or qualifying conditions  
Literal priority (lit): Specific values, standards or specifications

Query: {query}
max:
mid:
lit:
"""

class KeyInfoExtractor:
    def __init__(self, llm):
        self.llm = llm

    def extract_key_info(self, query: str) -> Dict[str, Tuple[str, float]]:
        prompt = KEY_INFO_EXTRACTION_PROMPT.format(query=query)
        response = self.llm.predict(prompt)
        
        lines = response.strip().split('\n')
        key_info = {}
        weights = {'max': 0.5, 'mid': 0.3, 'lit': 0.2}
        
        for line in lines:
            if line.startswith('max:'):
                key_info['max'] = (line[4:].strip(), weights['max'])
            elif line.startswith('mid:'):
                key_info['mid'] = (line[4:].strip(), weights['mid'])
            elif line.startswith('lit:'):
                key_info['lit'] = (line[4:].strip(), weights['lit'])
        
        return key_info
```

#### 文档长度自适应因子

在知识引导检索过程中，文档长度自适应因子用于调整不同长度文档的权重分配，确保长短文档都能得到公平的评分机会。该因子的计算考虑了当前文档chunk的长度与平均文档长度的关系。

$$\Lambda_{\text{DL}} = \frac{\overline{|k|} + |k_j|}{2\overline{|k|}}$$

其中 $|k_j|$ 表示当前文档chunk的长度，$\overline{|k|}$ 表示平均文档长度。通过这种归一化处理，可以避免因文档长度差异导致的评分偏差。

```python
def compute_document_length_factor(chunk_length: int, avg_length: int = 100) -> float:
    lambda_dl = (avg_length + chunk_length) / (2 * avg_length)
    return lambda_dl
```

#### 术语重要性计算

术语重要性指标衡量术语在文档中的显著程度，结合术语频率和文档长度自适应因子，能够更准确地评估术语在当前文档中的重要性。计算公式考虑了术语频率的非线性增长特性。

$$\text{Sign}(t_{e_i}^\tau, k_j) = \frac{2 \cdot f(t_{e_i}^\tau, k_j) \cdot \Lambda_{\text{DL}}}{f(t_{e_i}^\tau, k_j) + 1}$$

其中 $f(t_{e_i}^\tau, k_j)$ 表示术语在文档chunk中的出现频率，$\Lambda_{\text{DL}}$ 为文档长度自适应因子。这种计算方式能够防止高频术语过度影响评分。

```python
def compute_term_significance(term_freq: int, doc_length_factor: float) -> float:
    significance = (2 * term_freq * doc_length_factor) / (term_freq + 1)
    return significance
```

#### 术语稀有度计算

术语稀有度用于衡量术语在整个知识库中的稀缺程度，稀有度越高的术语在检索中的权重越大。计算采用了改进的IDF公式，增加了平滑处理以避免零除问题。

$\text{Rarity}(t_{e_i}^\tau) = \log\left(\frac{D - \text{df}(t_{e_i}^\tau) + 0.5}{\text{df}(t_{e_i}^\tau) + 0.5} + 1\right)$

其中 $D$ 表示文档总数，$\text{df}(t_{e_i}^\tau)$ 表示包含该术语的文档数量。加一操作确保了对数值始终为正数。

```python
import numpy as np

def compute_term_rarity(doc_freq: int, total_docs: int) -> float:
    rarity = np.log((total_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1)
    return rarity
```

#### 连贯性指数评估

连贯性指数反映术语在文档中的分布连贯性，通过滑动窗口技术分析术语在文档中的局部分布情况。连贯性高的术语往往在文档的特定区域集中出现，表明其与文档主题的强相关性。

$$\text{CI}(t_{e_i}^\tau, k_j) = \max_{w \in W, \, t \in w} \frac{\sum I(t = t_{e_i}^\tau) \cdot |w|}{|k_j|}$$

其中 $W$ 表示文档中的滑动窗口集合，$I(t = t_{e_i}^\tau)$ 为指示函数，当窗口中包含该术语时为1，否则为0。

```python
def compute_coherence_index(term: str, chunk: str, window_size: int = 50) -> float:
    chunk_tokens = chunk.lower().split()
    chunk_length = len(chunk_tokens)
    
    if chunk_length == 0:
        return 0.0
    
    max_coherence = 0.0
    
    for i in range(0, chunk_length - window_size + 1, 10):
        window = chunk_tokens[i:i + window_size]
        term_count = window.count(term.lower())
        
        if term_count > 0:
            coherence = (term_count * window_size) / chunk_length
            max_coherence = max(max_coherence, coherence)
    
    return max_coherence
```

#### 评分融合与检索

将句子级相似度评分与知识级评分进行融合，形成最终的文档相关性评分。融合过程采用加权平均的方式，平衡参数λ控制两种评分方式的重要性。

$\Phi = \lambda \Phi(\mathcal{K}) + (1 - \lambda) \Phi(\mathcal{S})$

其中 $\lambda$ 为平衡参数，$\Phi(\mathcal{K})$ 为知识级评分，$\Phi(\mathcal{S})$ 为句子级评分。通过调整λ值，可以控制系统更偏向语义相似还是知识匹配。当λ=0时，系统完全依赖句子级语义相似度；当λ=1时，系统完全依赖知识匹配评分；λ=0.5时，两种评分方式权重相等。在建筑文档审查场景中，通常设置λ=0.5以平衡专业知识匹配和语义理解。

```python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Tuple, Dict, Any

class GKGRRetriever:
    def __init__(self, 
                 knowledge_base: List[str],
                 embedding_model,
                 key_info_extractor: KeyInfoExtractor,
                 llm,
                 config: Dict[str, Any] = None):
        self.knowledge_base = knowledge_base
        self.embedding_model = embedding_model
        self.key_info_extractor = key_info_extractor
        self.llm = llm
        
        default_config = {
            "lambda_param": 0.5,
            "top_k": 5,
            "rerank_enabled": True,
            "query_expansion": True,
            "similarity_threshold": 0.1
        }
        self.config = {**default_config, **(config or {})}
        
        self.kb_embeddings = self._precompute_embeddings()
    
    def _precompute_embeddings(self) -> np.ndarray:
        embeddings = self.embedding_model.encode(self.knowledge_base, show_progress_bar=True)
        return embeddings
    
    def retrieve_with_scores(self, query: str) -> List[Tuple[str, float, Dict[str, float]]]:
        query_embedding = self.embedding_model.encode([query])[0]
        sentence_scores = cosine_similarity(
            query_embedding.reshape(1, -1), 
            self.kb_embeddings
        )[0]
        
        key_info = self.key_info_extractor.extract_key_info(query)
        knowledge_scores = self._compute_knowledge_scores(key_info)
        
        final_scores = []
        for i in range(len(self.knowledge_base)):
            norm_sent = sentence_scores[i]
            norm_know = knowledge_scores[i] / max(knowledge_scores) if max(knowledge_scores) > 0 else 0
            
            final_score = (self.config["lambda_param"] * norm_know + 
                          (1 - self.config["lambda_param"]) * norm_sent)
            final_scores.append(final_score)
        
        results_with_scores = []
        for i, final_score in enumerate(final_scores):
            if final_score > self.config["similarity_threshold"]:
                score_details = {
                    "sentence_score": float(sentence_scores[i]),
                    "knowledge_score": float(knowledge_scores[i]),
                    "final_score": float(final_score)
                }
                results_with_scores.append((self.knowledge_base[i], final_score, score_details))
        
        results_with_scores.sort(key=lambda x: x[1], reverse=True)
        return results_with_scores[:self.config["top_k"]]
    
    def _compute_knowledge_scores(self, key_info: Dict[str, Tuple[str, float]]) -> List[float]:
        scores = []
        avg_length = sum(len(chunk.split()) for chunk in self.knowledge_base) / len(self.knowledge_base)
        
        for chunk in self.knowledge_base:
            chunk_score = 0.0
            chunk_tokens = chunk.lower().split()
            chunk_length = len(chunk_tokens)
            
            lambda_dl = compute_document_length_factor(chunk_length, avg_length)
            
            for priority, (info_text, weight) in key_info.items():
                if not info_text.strip():
                    continue
                
                terms = info_text.lower().split()
                for term in terms:
                    if term in chunk_tokens:
                        tf = chunk_tokens.count(term)
                        
                        significance = compute_term_significance(tf, lambda_dl)
                        
                        segments_with_term = sum(1 for kb_chunk in self.knowledge_base 
                                                if term in kb_chunk.lower())
                        rarity = compute_term_rarity(segments_with_term, len(self.knowledge_base))
                        
                        coherence = compute_coherence_index(term, chunk)
                        
                        term_score = significance * rarity * (1 + coherence) * weight
                        chunk_score += term_score
            
            scores.append(chunk_score)
        
        return scores
    
    def retrieve(self, query: str) -> Tuple[List[str], str]:
        results_with_scores = self.retrieve_with_scores(query)
        
        documents = [doc for doc, _, _ in results_with_scores]
        
        if self.config["rerank_enabled"] and len(documents) > 1:
            documents = self._llm_rerank(query, documents)
        
        augmented_query = query
        if self.config["query_expansion"]:
            augmented_query = self._augment_query(query, documents[:3])
        
        return documents, augmented_query
```

#### 重排序优化

系统使用大语言模型对检索结果进行进一步重排序，通过LLM的语义理解能力优化文档的相关性排序。重排序过程中，系统会构造包含查询和候选文档的提示，要求LLM根据相关性对文档进行重新排序。

```python
def _llm_rerank(self, query: str, documents: List[str]) -> List[str]:
    if len(documents) <= 1:
        return documents
    
    rerank_prompt = f"""
Task: A list of documents is shown below. Each document has a number next to it. A question is also provided. Your task is to return the numbers of ALL documents in order of relevance from MOST to LEAST relevant. MUST include EVERY document number exactly once.

Example format:
    Document 1: <document 1>
    Document 2: <document 2>
    Document 3: <document 3>
    Question: <question>
    Answer: 3,1,2

Now here are the actual documents and question.

"""
    for i, doc in enumerate(documents):
        rerank_prompt += f"Document {i+1}: {doc[:150]}...\n"
    
    rerank_prompt += f"Question: {query}\nAnswer:"
    
    try:
        response = self.llm.predict(rerank_prompt)
        order_nums = [int(x.strip()) - 1 for x in response.split(',') 
                     if x.strip().isdigit() and 0 <= int(x.strip()) - 1 < len(documents)]
        
        reranked = [documents[i] for i in order_nums if i < len(documents)]
        
        # 添加遗漏的文档
        used_indices = set(order_nums)
        for i, doc in enumerate(documents):
            if i not in used_indices:
                reranked.append(doc)
        
        return reranked[:len(documents)]
    except:
        return documents
```

#### 查询增强

同时系统还会利用检索到的知识来增强原始查询，生成更具体、更详细的查询用于进一步检索。查询增强通过分析检索结果的上下文信息，识别查询中可能遗漏的关键概念和术语。

```python
def _augment_query(self, original_query: str, top_results: List[str]) -> str:
    if not top_results:
        return original_query
    
    document_list = ""
    for i, doc in enumerate(top_results):
        document_list += f"Document {i+1}: {doc[:100]}...\n"
    
    augment_prompt = f"""
Task: Your task is to generate a detailed answer to the question by synthesizing information from ALL provided documents. Prioritize relevance, cite document numbers, and structure your response as follows:

Question: {original_query}
{document_list}
Answer:
"""
    
    try:
        augmented = self.llm.predict(augment_prompt)
        return augmented.strip()
    except:
        return original_query
```

#### 偏差检测分析

在先期知识增强检索阶段获取领域知识后，系统随即进入误差辨析模块。该模块基于检索得到的知识参考，并结合预设的审阅问题，对原文进行细致的偏差检测与评估。

```python
class ErrorAnalyzer:
    def __init__(self, llm):
        self.llm = llm
    
    def analyze_errors(self, document_chunk: str, query: str, retrieved_knowledge: List[str]) -> Dict[str, Any]:
        
        analysis_prompt = f"""
Task: Your task is to conduct an error analysis on a given review document, based on a provided review query and relevant reference specifications. This analysis MUST strictly adhere to the provided reference and focus specifically on reviewing and analyzing the original descriptive sections within the review document.

Review document: {document_chunk}
Query: {query}
Reference: {chr(10).join([f"{i+1}. {ref}" for i, ref in enumerate(retrieved_knowledge)])}
Analysis:
"""
        
        analysis = self.llm.predict(analysis_prompt)
        
        return {
            "analysis": analysis,
            "reference_support": retrieved_knowledge
        }
```

#### 修订建议生成

误差辨析模块完成后，系统将输出标记偏差区域以及相关知识佐证。随后，系统进入修订策略生成模块。该模块依据误差分析结果和知识参考，对标记区域进行针对性的修订建议生成，最终实现对原文的知识驱动型自动修正。

```python
class RevisionGenerator:
    def __init__(self, llm):
        self.llm = llm
    
    def generate_revisions(self, document_chunk: str, analysis: Dict[str, Any]) -> Dict[str, str]:     
        revision_prompt = f"""
Task: Your task is to review and revise the provided document based on the given analysis and corresponding reference specifications. STRICT adherence to the provided reference specifications is required. If the review document aligns with the analysis and reference specifications WITHOUT discrepancies, revision is not necessary.

Review document: {document_chunk}
Analysis: {analysis['analysis']}
Reference: {chr(10).join([f"- {ref}" for ref in analysis['reference_support']])}
Revision:
"""
        
        revision = self.llm.predict(revision_prompt)
        
        return {
            "original_text": document_chunk,
            "revision_suggestions": revision,
            "modified_regions": analysis.get("error_regions", []),
            "confidence": self._calculate_confidence(analysis)
        }
    
    def _calculate_confidence(self, analysis: Dict[str, Any]) -> float:
        ref_count = len(analysis.get("reference_support", []))
        error_count = len(analysis.get("error_regions", []))
        
        confidence = min(0.9, 0.5 + (ref_count * 0.1) + (error_count * 0.05))
        return confidence
```

#### 完整审查流程

将上述所有模块整合，形成完整的文档审查流程。系统首先生成审查问题，然后进行知识引导检索，接着执行错误分析，最后生成修订建议。

```python
def complete_review_process(document_chunk: str, 
                          gkgr_framework: GKGRRetriever, 
                          error_analyzer: ErrorAnalyzer,
                          revision_generator: RevisionGenerator) -> Dict[str, Any]:    
    review_queries = generate_review_queries(gkgr_framework.llm, document_chunk)
    
    results = {}
    for query in review_queries[:3]:
        retrieved_docs, augmented_query = gkgr_framework.retrieve(query)
        
        knowledge_refs = retrieved_docs
        analysis = error_analyzer.analyze_errors(document_chunk, query, knowledge_refs)
        
        revision = revision_generator.generate_revisions(document_chunk, analysis)
        
        results[query] = {
            "retrieved_knowledge": retrieved_docs,
            "augmented_query": augmented_query,
            "analysis": analysis,
            "revision": revision
        }
    
    return results
```

至此，我们就完成了建筑文档智能审查系统的核心实现。

## 实际应用示例

让我们通过一个完整的示例来展示系统的使用：

```python
# 初始化系统组件
llm = DeepSeekLLM(
    model_name='deepseek-chat',
    api_key='your-api-key',
    base_url='https://api.deepseek.com/v1'
)

embedding = BGEEmbedding(model_name="BAAI/bge-m3")
key_extractor = KeyInfoExtractor(llm)

# 从markdown文档构建知识库
processor = DocumentProcessor()
documents = processor.load_documents("./construction_standards")

# 对文档进行动态语义分块
chunker = DynamicSemanticChunker()
knowledge_base = []
for doc in documents:
    chunks = chunker.split_text(doc)
    knowledge_base.extend(chunks.values())

# 初始化检索器
gkgr_retriever = GKGRRetriever(
    knowledge_base=knowledge_base,
    embedding_model=embedding,
    key_info_extractor=key_extractor,
    llm=llm
)

# 初始化分析器
error_analyzer = ErrorAnalyzer(llm)
revision_generator = RevisionGenerator(llm)

# 待审查的文档内容
sample_document = """
钢筋混凝土柱的施工应符合以下要求：
1. 混凝土强度等级不低于C25
2. 钢筋保护层厚度为25mm
3. 混凝土浇筑应连续进行，间歇时间不超过1小时
4. 养护期间应保持混凝土表面湿润
"""

# 执行审查
result = complete_review_process(
    sample_document, 
    gkgr_retriever, 
    error_analyzer, 
    revision_generator
)

# 查看审查结果
for query, analysis in result.items():
    print(f"审查问题: {query}")
    print(f"修订建议: {analysis['revision']['revision_suggestions']}")
    print("-" * 50)
```

## 扩展性说明

系统可以通过更换知识库轻松适应其他领域。对于特定企业或项目，可以通过微调关键信息提取模型来提升准确性。在性能优化方面，使用动态语义分块可以提升检索质量，预计算并缓存知识库嵌入以提升检索速度，对于大量文档可使用批量处理模式，根据具体应用场景调整λ参数和top-k值。

## 写在最后

恭喜你阅读完此文，你已经充分了解了如何实现一个建筑文档智能审查系统以及其背后的思考。这个系统展示了如何将动态语义分块、知识引导检索和大语言模型有机结合，为建筑行业的文档审查提供了一个实用的解决方案。

虽然当前系统已经取得了不错的效果，但仍有改进空间。全局关联增强方面，当前基于文本块的检索可以进一步结合知识图谱等技术。多模态支持方面，未来可以扩展支持CAD图纸、施工图等视觉信息。实时更新方面，支持知识库的增量更新和动态维护。个性化定制方面，根据不同企业和项目特点进行系统定制。

读者们可以运行项目中的示例代码，体验完整的建筑文档智能审查流程。我们相信这个系统不仅能够提升审查效率，更能为建筑行业的数字化转型贡献力量。

## 致谢

本项目的开发过程中，我们深入研究了建筑工程领域的专业知识和最新的自然语言处理技术。特别感谢建筑行业专家提供的宝贵建议，以及开源社区在技术实现方面的支持。项目代码实现参考了LlamaIndex、Transformers等优秀开源项目的设计理念。

需要说明的是，本项目专门针对建筑施工领域的文档审查场景进行了深度优化。如果您需要处理其他领域的文档，建议根据具体需求对系统进行相应调整。

## 源码获取

本项目的源码以及实例数据存放在 [GitHub 仓库](https://github.com/Hongru0306/CDDRS)。

## 引用

如果您在研究中使用了本项目的成果，请按如下方式引用：

```bibtex
@article{XIAO2025103618,
    title = {Generative knowledge-guided review system for construction disclosure documents},
    journal = {Advanced Engineering Informatics},
    volume = {68},
    pages = {103618},
    year = {2025},
    issn = {1474-0346},
    doi = {https://doi.org/10.1016/j.aei.2025.103618},
    url = {https://www.sciencedirect.com/science/article/pii/S1474034625005117},
    author = {Hongru Xiao and Jiankun Zhuang and Bin Yang and Jiale Han and Yantao Yu and Songning Lai},
    keywords = {Construction documents review, Large language model (LLM), Knowledge-guided retrieval, Natural Language Processing (NLP)}
}
```


================================================
FILE: Extra-Chapter/Readme.md
================================================
<div align="center">
  <h2>🚀 Happy-LLM 扩展内容</h2>
  <p><em>社区驱动的大语言模型学习资源</em></p>
</div>

---

## 📖 为什么会有 Extra Chapter？

&emsp;&emsp;在 Happy-LLM 主教程的基础上，我们发现社区中有许多优秀的学习者和实践者，他们在学习和使用大语言模型的过程中积累了宝贵的经验、独到的见解和实用的技巧。这些内容虽然不属于主教程的核心知识体系，但对于深入理解和应用大语言模型具有重要价值。

**Extra Chapter 的设立目的：**

- 🌟 **汇聚智慧**：收集社区成员的优秀学习笔记、实践经验和技术博客
- 🔄 **持续更新**：保持内容的时效性，跟上大语言模型领域的快速发展
- 🤝 **促进交流**：为社区成员提供分享和交流的平台
- 📚 **补充完善**：对主教程内容进行有益的补充和扩展
- 💡 **启发思考**：通过不同视角和实践案例，启发读者的深度思考

**Extra Chapter 包含的内容类型：**

- 📝 **学习笔记**：深度学习心得和知识总结
- 🛠️ **实践案例**：真实项目中的应用经验
- 🔬 **技术探索**：前沿技术的研究和实验
- 💭 **思考感悟**：对大语言模型发展的独特见解
- 🎯 **专题研究**：特定领域或问题的深入分析

---

## 📋 PR 贡献规范

&emsp;&emsp;我们热烈欢迎社区成员为 Extra Chapter 贡献优质内容！为了保证内容质量和项目的整体性，请遵循以下规范：

### 🗂️ 目录结构规范

每个贡献的内容应按照以下目录结构组织：

```
Extra-Chapter/
├── your-topic-name/                    # 你的主题文件夹
│   ├── readme.md                       # 主要内容文件（必需）
│   ├── images/                         # 图片资源文件夹（可选）
│   │   ├── figure1.png
│   │   └── figure2.jpg
│   ├── code/                           # 代码文件夹（可选）
│   │   ├── example.py
│   │   └── requirements.txt
│   ├── data/                           # 数据文件夹（可选）
│   │   └── sample_data.json
│   └── notebook.ipynb                  # Jupyter Notebook（如涉及代码必选）
└── Readme.md                           # 本文件
```

### 📝 文件命名规范

1. **主题文件夹命名**：
   - 使用小写字母和连字符
   - 名称要简洁明了，能够概括主题内容
   - 例如：`why-fine-tune-small-large-language-models`、`rag-optimization-techniques`

2. **主要内容文件**：
   - 必须命名为 `readme.md`
   - 使用 Markdown 格式编写

3. **图片文件**：
   - 统一放在 `images/` 文件夹下
   - 使用描述性的文件名
   - 支持格式：`.png`、`.jpg`、`.jpeg`、`.gif`、`.svg`

4. **代码文件**：
   - 如涉及代码，请尽量提供可直接运行的 Jupyter Notebook 文件
   - 统一放在 `code/` 文件夹下
   - 使用标准的文件扩展名
   - 如有依赖，请提供 `requirements.txt`
   - 如有 Jupyter Notebook 文件，请放在主文件夹下

### ✍️ 内容质量要求

1. **原创性**：
   - 内容必须是原创或经过授权的
   - 如引用他人内容，请注明出处

2. **技术准确性**：
   - 确保技术内容的准确性
   - 代码示例应能正常运行
   - 提供必要的环境说明

3. **结构清晰**：
   - 使用清晰的标题层次
   - 合理使用列表、表格等格式
   - 重要内容使用适当的强调

4. **语言规范**：
   - 使用规范的中文表达
   - 技术术语使用准确
   - 避免错别字和语法错误

### PR commit messgae 内容

请在 PR commit message 中 包含以下内容：

- 新增的主题文件夹名称
- 贡献内容的概述
- 贡献内容的详细描述
- 你的 Github 个人主页链接，及你的个人介绍
- 个人 title 及工作经历 or 学校 or 研究方向

如以下所示：

```
Extra Chapter: 你的主题名称

详细描述你的贡献内容，包括新增的主题文件夹、文件内容和目录结构。

- 新增的主题文件夹名称：your-topic-name
- 贡献内容的概述：详细介绍你的贡献内容
- 贡献内容的详细描述：详细描述你的贡献内容，包括新增的主题文件夹、文件内容和目录结构。
- 你的 Github 个人主页链接及个人介绍：[你的个人主页链接](https://example.com)，介绍你的研究方向、技术专长等。
- 个人 title 及工作经历 or 学校 or 研究方向：内容贡献者-xxxx学校，研究方向为自然语言处理。
```


================================================
FILE: Extra-Chapter/generation-method/llm_generation.py
================================================
import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer

def test_decoding_strategies():
    """
    测试三种解码策略：贪婪解码、随机采样、束搜索
    """
    model_id = "../model/kmno4zx/happy-llm-215M-sft/"

    print("正在加载模型和tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="cpu").eval()

    # 测试prompt
    test_prompt = "请介绍一下自己"
    messages = [
        {"role": "system", "content": "你是一个AI助手"},
        {"role": "user", "content": test_prompt}
    ]

    # 准备输入
    input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    input_ids = tokenizer(input_ids).data['input_ids']
    x = (torch.tensor(input_ids, dtype=torch.long)[None, ...]).to(model.device)

    print(f"测试prompt: {test_prompt}")
    print(f"输入token数量: {len(input_ids)}")
    print("=" * 60)

    # 测试1: 贪婪解码 (Greedy Search)
    print("🔍 测试1: 贪婪解码 (Greedy Search)")
    print("参数: do_sample=False, num_beams=1, temperature=0.0")
    print("特点: 每步选择概率最大的token，结果确定，速度快")

    with torch.no_grad():
        greedy_output = model.generate_super(
            x,
            stop_id=tokenizer.eos_token_id,
            max_new_tokens=50,
            temperature=0.0,
            do_sample=False,
            num_beams=1
        )
        greedy_response = tokenizer.decode(greedy_output[0].tolist(), skip_special_tokens=True)

    print(f"贪婪解码结果: {greedy_response}")
    print()

    # 测试2: 随机采样 (Random Sampling)
    print("🎲 测试2: 随机采样 (Random Sampling)")
    print("参数: do_sample=True, num_beams=1, temperature=0.8, top_k=50")
    print("特点: 基于概率分布随机采样，结果多样，创造性高")

    with torch.no_grad():
        # 运行多次以展示随机性
        for i in range(3):
            sampling_output = model.generate_super(
                x,
                stop_id=tokenizer.eos_token_id,
                max_new_tokens=50,
                temperature=0.8,
                top_k=50,
                do_sample=True,
                num_beams=1
            )
            sampling_response = tokenizer.decode(sampling_output[0].tolist(), skip_special_tokens=True)
            print(f"随机采样结果 {i+1}: {sampling_response}")

    print()

    # 测试3: 束搜索 (Beam Search)
    print("🔦 测试3: 束搜索 (Beam Search)")
    print("参数: do_sample=False, num_beams=3, temperature=1.0")
    print("特点: 维护多条候选路径，选择总概率最高的序列，质量更高")

    with torch.no_grad():
        beam_output = model.generate_super(
            x,
            stop_id=tokenizer.eos_token_id,
            max_new_tokens=50,
            temperature=1.0,
            do_sample=False,
            num_beams=3
        )
        beam_response = tokenizer.decode(beam_output[0].tolist(), skip_special_tokens=True)

    print(f"束搜索结果: {beam_response}")
    print()

    # 测试4: 不同的温度参数对随机采样的影响
    print("🌡️ 测试4: 不同温度参数对随机采样的影响")
    print("参数: do_sample=True, num_beams=1, 测试不同temperature值")

    temperatures = [0.2, 0.8, 1.5]
    for temp in temperatures:
        with torch.no_grad():
            temp_output = model.generate_super(
                x,
                stop_id=tokenizer.eos_token_id,
                max_new_tokens=30,
                temperature=temp,
                do_sample=True,
                num_beams=1
            )
            temp_response = tokenizer.decode(temp_output[0].tolist(), skip_special_tokens=True)
            print(f"温度 {temp}: {temp_response}")

    print()
    print("=" * 60)
    print("✅ 三种解码策略测试完成！")
    print()
    print("📊 总结对比:")
    print("• 贪婪解码: 速度快，结果确定，适合确定性任务")
    print("• 随机采样: 创造性强，结果多样，适合创意生成")
    print("• 束搜索: 质量较高，平衡速度和质量，适合一般对话")

def test_original_generation():
    """
    原始的生成代码作为对比
    """
    model_id = "../model/kmno4zx/happy-llm-215M-sft/"

    print("运行原始生成代码...")
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="cpu").eval()

    messages = [
        {"role": "system", "content": "你是一个AI助手"},
        {"role": "user", "content": "你好，请介绍一下自己。"}
    ]

    input_ids = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
    input_ids = tokenizer(input_ids).data['input_ids']

    x = (torch.tensor(input_ids, dtype=torch.long)[None, ...]).to(model.device)

    with torch.no_grad():
        y = model.generate_super(x, stop_id=tokenizer.eos_token_id, max_new_tokens=512, temperature=0.6)
        response = tokenizer.decode(y[0].tolist(), skip_special_tokens=True)

    print(f"Assistant: {response}")

if __name__ == "__main__":
    print("开始测试三种解码策略...")
    print()

    try:
        test_decoding_strategies()
    except Exception as e:
        print(f"测试过程中出现错误: {e}")
        print("运行原始生成代码...")
        test_original_generation()


================================================
FILE: Extra-Chapter/generation-method/model_down.py
================================================
from modelscope import snapshot_download

model_dir = snapshot_download('kmno4zx/happy-llm-215M-sft', cache_dir='your/cache/dir', revision='master')

================================================
FILE: Extra-Chapter/generation-method/readme.md
================================================
# 大模型生成Token的方式

> 代码已更新到 Happy-LLM 仓库第五章的代码中。

## 贪婪解码（Greedy Decoding）

### 原理说明
贪婪解码是最简单直接的文本生成策略。在每一步生成时，它总是选择概率最大的那个token作为下一个token，然后继续生成，直到遇到停止条件或达到最大长度。

**核心思想**：局部最优选择 → 希望全局最优

**数学表达**：
```
token_t = argmax P(token_t | token_1, token_2, ..., token_{t-1})
```

### 代码实现
基于我们实现的 `_greedy_decode` 方法：

```python
def _greedy_decode(self, logits: torch.Tensor) -> torch.Tensor:
    """
    贪婪解码：选择概率最大的token

    Args:
        logits: 模型输出的logits，形状为 (batch_size, vocab_size)

    Returns:
        选择的token索引，形状为 (batch_size, 1)
    """
    _, idx_next = torch.topk(logits, k=1, dim=-1)
    return idx_next
```

**关键步骤解析**：
1. `torch.topk(logits, k=1, dim=-1)`：找到logits中最大值的位置
2. 返回最大概率token的索引
3. 该token被添加到序列中，继续下一轮生成

### 使用示例
```python
# 在 generate_super 函数中调用贪婪解码
output = model.generate_super(
    input_ids,
    do_sample=False,      # 不使用采样
    num_beams=1,          # 不使用束搜索
    temperature=0.0,      # 温度为0确保确定性
    max_new_tokens=100
)
```

### 优缺点分析

**优点**：
- ✅ **速度快**：每步只需要一次前向传播和简单的argmax操作
- ✅ **结果确定**：相同的输入总是产生相同的输出
- ✅ **内存效率高**：不需要维护多个候选序列
- ✅ **实现简单**：算法逻辑直观易懂

**缺点**：
- ❌ **容易陷入局部最优**：每步的局部最优不一定等于全局最优
- ❌ **缺乏多样性**：总是产生相同的序列，缺乏创造性
- ❌ **可能产生重复内容**：容易陷入重复循环
- ❌ **忽略长程依赖**：不考虑序列的整体连贯性

### 典型例子
假设模型生成了以下概率分布：

```
输入: "今天天气"
下一token概率:
- "很" (0.4)
- "不错" (0.3)
- "真好" (0.2)
- "不太好" (0.1)
```

贪婪解码会选择"很"，生成"今天天气很"，然后继续这个过程。

### 使用场景
- **确定性任务**：如数学计算、代码生成
- **需要一致性的应用**：如API服务、自动化脚本
- **计算资源受限的环境**：需要快速生成结果
- **基准测试**：作为其他算法的对比基准

## 采样解码（Sampling Decoding）

### 原理说明
采样解码不是选择概率最大的token，而是基于模型的概率分布进行随机采样。这样可以在每次生成时产生不同的结果，增加文本的多样性和创造性。

**核心思想**：基于概率分布随机选择 → 增加多样性

**数学表达**：

```
token_t ~ P(token_t | token_1, token_2, ..., token_{t-1})
```

### 关键参数

#### 1. Temperature（温度）
- **作用**：控制概率分布的平滑程度
- **原理**：将logits除以temperature，然后进行softmax
- **效果**：
  - `temperature > 1`：分布更平滑，增加随机性
  - `temperature < 1`：分布更尖锐，更接近贪婪解码
  - `temperature → 0`：等价于贪婪解码

#### 2. Top-k Sampling
- **作用**：限制候选token的范围
- **原理**：只考虑概率最高的k个token，其他token概率设为0
- **效果**：避免选择概率很低的"奇怪"token，提高质量

### 代码实现
基于我们实现的 `_random_sample` 方法：

```python
def _random_sample(self, logits: torch.Tensor, temperature: float = 1.0, top_k: int = None) -> torch.Tensor:
    """
    随机采样：基于概率分布随机选择token

    Args:
        logits: 模型输出的logits，形状为 (batch_size, vocab_size)
        temperature: 温度参数，控制随机性
        top_k: 只考虑概率最高的k个token

    Returns:
        选择的token索引，形状为 (batch_size, 1)
    """
    # 1. 温度缩放
    logits = logits / temperature

    # 2. Top-k过滤
    if top_k is not None:
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[:, [-1]]] = -float('Inf')

    # 3. 计算概率并采样
    probs = F.softmax(logits, dim=-1)
    idx_next = torch.multinomial(probs, num_samples=1)
    return idx_next
```

**关键步骤解析**：
1. **温度缩放**：调整概率分布的平滑程度
2. **Top-k过滤**：移除低概率候选，提高质量
3. **概率归一化**：使用softmax得到概率分布
4. **随机采样**：根据概率分布随机选择token

### 使用示例
```python
# 基本采样
output = model.generate_super(
    input_ids,
    do_sample=True,         # 启用采样
    num_beams=1,           # 不使用束搜索
    temperature=0.8,       # 中等温度
    max_new_tokens=100
)

# 带top-k的采样
output = model.generate_super(
    input_ids,
    do_sample=True,
    num_beams=1,
    temperature=1.0,       # 较高温度增加随机性
    top_k=50,             # 只考虑前50个候选
    max_new_tokens=100
)
```

### 温度参数详解

**不同温度的效果对比**：

```python
# 示例概率分布
original_probs = [0.6, 0.2, 0.1, 0.05, 0.05]

# Temperature = 0.1 (低温度，接近贪婪)
scaled_probs = [0.85, 0.08, 0.04, 0.015, 0.015]
# 结果：很可能选择第一个token

# Temperature = 1.0 (标准温度)
scaled_probs = [0.6, 0.2, 0.1, 0.05, 0.05]
# 结果：按原始概率采样

# Temperature = 2.0 (高温度，增加随机性)
scaled_probs = [0.35, 0.25, 0.18, 0.11, 0.11]
# 结果：各个token都有机会被选中
```

### Top-k机制详解

**Top-k过滤过程**：

```python
# 假设词汇表大小为1000，top_k=50
logits = [0.1, 2.3, 0.5, 1.8, 0.3, 3.2, 0.9, 0.2, 1.5, 0.7, ...]  # 1000个值

# 步骤1：找到前50个最大值
v, _ = torch.topk(logits, 50)
threshold = v[-1]  # 第50大的值

# 步骤2：过滤
logits[logits < threshold] = -float('Inf')
# 结果：只有50个token有非零概率，其他950个token概率为0
```

### 优缺点分析

**优点**：
- ✅ **多样性好**：每次生成可能产生不同的结果
- ✅ **创造性高**：能产生意想不到的内容
- ✅ **避免重复**：不容易陷入重复循环
- ✅ **可调性强**：通过参数控制随机程度

**缺点**：
- ❌ **结果不确定**：相同输入可能产生不同输出
- ❌ **质量不稳定**：可能产生低质量或不连贯的内容
- ❌ **需要调参**：temperature和top_k需要仔细调节
- ❌ **计算开销**：需要计算完整的概率分布

### 使用场景
- **创意写作**：故事生成、诗歌创作
- **对话系统**：让对话更加自然和有趣
- **数据增强**：生成多样化的训练数据
- **探索性任务**：需要探索多种可能性的场景

## 束搜索（Beam Search）

### 原理说明
束搜索是一种启发式搜索算法，它在每一步生成时保留多个候选序列（束），而不是只选择一个最佳序列。通过维护多条路径，它能够在计算效率和生成质量之间取得平衡。

**核心思想**：维护多条候选路径 → 选择累积概率最高的序列

**算法流程**：
1. **初始化**：从输入序列开始
2. **扩展**：为每个候选序列生成多个扩展
3. **评分**：计算每个新序列的累积概率
4. **筛选**：保留分数最高的N个候选
5. **重复**：继续扩展直到结束条件

### 关键概念

#### 束宽度（Beam Width）
- **定义**：每步保留的候选序列数量
- **权衡**：
  - 宽度=1：等价于贪婪解码
  - 宽度越大：搜索空间越大，质量越高，但计算成本也越大

#### 累积概率
- **计算方式**：序列概率 = 各个token概率的乘积
- **数值稳定性**：通常使用对数概率求和
- **公式**：`log P(sequence) = Σ log P(token_i | context)`

### 代码实现
基于我们实现的 `_beam_search` 方法：

```python
def _beam_search(self, idx: torch.Tensor, max_new_tokens: int, num_beams: int,
                 temperature: float = 1.0, top_k: int = None, stop_id: int = None) -> torch.Tensor:
    """
    束搜索：维护多个候选序列，选择最优路径

    Args:
        idx: 输入序列，形状为 (batch_size, seq_len)
        max_new_tokens: 最大生成token数量
        num_beams: 束宽度，表示保留的候选路径数量
        temperature: 温度参数，控制分布的平滑程度
        top_k: top-k过滤参数，限制候选token范围
        stop_id: 停止生成的token ID，遇到则停止

    Returns:
        生成的token序列，形状为 (batch_size, generated_length)
    """
    # 1. 初始化束
    beams = [idx.clone() for _ in range(num_beams)]
    beam_scores = torch.zeros(num_beams, device=idx.device)
    beam_scores[0] = 0.0  # 第一个候选是原始序列
    beam_scores[1:] = float('-inf')  # 其他候选初始分数为负无穷

    # 2. 主循环：逐步生成token
    for step in range(max_new_tokens):
        new_beams = []
        new_scores = []

        # 3. 扩展每个候选序列
        for beam_idx, beam in enumerate(beams):
            if beam_scores[beam_idx] == float('-inf'):
                continue  # 跳过无效候选

            # 前向传播获取logits
            output = self(beam)
            logits = output.logits[:, -1, :]

            # 应用温度和top-k
            if temperature != 1.0:
                logits = logits / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')

            # 计算对数概率
            log_probs = F.log_softmax(logits, dim=-1)

            # 获取前num_beams个候选token
            top_log_probs, top_indices = torch.topk(log_probs, k=num_beams, dim=-1)

            # 4. 为当前候选生成多个扩展
            for k in range(num_beams):
                token = top_indices[:, k:k+1]
                log_prob = top_log_probs[:, k]

                new_beam = torch.cat([beam, token], dim=1)
                new_score = beam_scores[beam_idx] + log_prob.item()

                new_beams.append(new_beam)
                new_scores.append(new_score)

        # 5. 筛选最佳候选
        if not new_beams:
            break

        # 按分数排序，选择前num_beams个
        sorted_indices = sorted(range(len(new_scores)), key=lambda i: new_scores[i], reverse=True)
        beams = [new_beams[i] for i in sorted_indices[:num_beams]]
        beam_scores = [new_scores[i] for i in sorted_indices[:num_beams]]

        # 检查停止条件
        if stop_id is not None and beams[0][0, -1] == stop_id:
            break

    # 6. 返回最佳序列
    return beams[0][:, idx.shape[1]:]  # 只返回生成部分
```

### 束搜索过程示例

假设束宽度=3，输入="今天天气"：

**第1步扩展**：
```
候选1: "今天天气很好" (分数: 0.4)
候选2: "今天天气不错" (分数: 0.3)
候选3: "今天天气真好" (分数: 0.2)
```

**第2步扩展**（每个候选再扩展3个）：
```
候选1.1: "今天天气很好啊" (分数: 0.4 + 0.1 = 0.5)
候选1.2: "今天天气很好。" (分数: 0.4 + 0.2 = 0.6) ← 保留
候选1.3: "今天天气很好，" (分数: 0.4 + 0.05 = 0.45)

候选2.1: "今天天气不错啊" (分数: 0.3 + 0.15 = 0.45)
候选2.2: "今天天气不错。" (分数: 0.3 + 0.1 = 0.4) ← 保留
候选2.3: "今天天气不错，" (分数: 0.3 + 0.08 = 0.38)

候选3.1: "今天天气真好啊" (分数: 0.2 + 0.12 = 0.32)
候选3.2: "今天天气真好。" (分数: 0.2 + 0.25 = 0.45) ← 保留
候选3.3: "今天天气真好，" (分数: 0.2 + 0.1 = 0.3)
```

**筛选结果**（保留分数最高的3个）：
```
最佳候选: "今天天气很好。" (分数: 0.6)
次佳候选: "今天天气不错。" (分数: 0.4)
第三候选: "今天天气真好。" (分数: 0.45)
```

### 使用示例
```python
# 基本束搜索
output = model.generate_super(
    input_ids,
    do_sample=False,        # 不使用采样
    num_beams=3,           # 束宽度为3
    temperature=1.0,       # 标准温度
    max_new_tokens=100
)

# 带top-k的束搜索
output = model.generate_super(
    input_ids,
    do_sample=False,
    num_beams=5,           # 更大的束宽度
    temperature=0.8,       # 稍微降低温度
    top_k=50,             # 限制候选范围
    max_new_tokens=100
)
```

### 优缺点分析

**优点**：
- ✅ **质量较高**：比贪婪解码质量更好
- ✅ **确定性**：结果相对稳定（相同输入产生相同输出）
- ✅ **平衡性好**：在质量和效率之间取得平衡
- ✅ **避免明显错误**：不容易选择明显不合适的token

**缺点**：
- ❌ **计算开销大**：需要维护多个候选序列
- ❌ **内存占用高**：存储多个候选序列和分数
- ❌ **仍可能局部最优**：虽然比贪婪好，但仍可能错过全局最优
- ❌ **多样性有限**：仍然偏向高概率路径，创造性不如采样

### 束宽度选择建议

| 束宽度 | 适用场景 | 优点 | 缺点 |
|--------|----------|------|------|
| 1-2 | 实时应用、计算资源有限 | 速度快、资源占用少 | 质量相对较低 |
| 3-5 | 一般对话、文本生成 | 质量较好、速度适中 | 资源占用中等 |
| 6-10 | 高质量生成、翻译 | 质量很高 | 计算开销大 |
| 10+ | 专业应用、研究 | 最高质量 | 开销很大 |

### 使用场景
- **机器翻译**：需要准确性和流畅性的平衡
- **文本摘要**：生成连贯的摘要内容
- **对话系统**：生成有逻辑的回复
- **代码生成**：需要语法正确和逻辑合理
- **长文本生成**：如文章写作、报告生成

## 辅助模型投机解码（Assisted Decoding）

### 原理说明
投机解码是一种**用小模型加速大模型推理**的技术。它通过"草稿-验证"的方式，让小先生成候选token，然后大家模型快速验证，减少大模型的前向传播次数。

**核心思想**：小模型投机生成 → 大模型批量验证 → 减少大模型计算负担

### 工作流程

#### 1. 草稿生成阶段
```
输入: "今天天气"
小模型快速生成草稿: "今天天气很好，适合出门散步"
```

#### 2. 验证阶段
大模型一次性验证整个草稿序列：
- ✅ 接受的token："今天天气很好，"
- ❌ 拒绝的token：从"适合"开始拒绝
- 🔧 大模型重新生成："适合在家休息"

#### 3. 最终结果
```
输出: "今天天气很好，适合在家休息"
```

### 关键优势

**速度提升**：
- 小模型推理快 → 生成多个候选token
- 大模型批量验证 → 一次处理多个token
- 减少90%+的大模型前向传播

**质量保证**：
- 大模型有最终否决权
- 只有大模型认可的token才会被保留
- 不会降低生成质量

### 具体例子对比

**传统方式**（大模型逐个生成）：
```
第1步: 大模型 → "今天"
第2步: 大模型 → "今天天气"
第3步: 大模型 → "今天天气很"
第4步: 大模型 → "今天天气很好"
第5步: 大模型 → "今天天气很好，"
第6步: 大模型 → "今天天气很好，适合"
... (每步都需要大模型前向传播)
```

**投机解码**：
```
第1步: 小模型快速草稿 → "今天天气很好，适合出门散步"
第2步: 大模型批量验证 → 接受"今天天气很好，"，拒绝"适合出门散步"
第3步: 大模型重新生成 → "适合在家休息"
```

这样原本需要6次大模型推理的过程，现在只需要2次！

### 技术实现要点

#### 1. 草稿长度控制
- **草稿不宜过长**：通常2-10个token
- **接受率平衡**：太长接受率低，太短加速效果不明显
- **动态调整**：根据接受率调整草稿长度

#### 2. 验证机制
```python
# 伪代码
def assisted_decoding(input_ids, assistant_model, main_model):
    # 小模型生成草稿
    draft_tokens = assistant_model.generate_draft(input_ids, max_draft_len=5)

    # 大模型验证
    accepted_count = main_model.verify_draft(input_ids, draft_tokens)

    # 构建最终结果
    if accepted_count == len(draft_tokens):
        return draft_tokens  # 全部接受
    else:
        # 部分接受，大模型重新生成剩余部分
        accepted_part = draft_tokens[:accepted_count]
        remaining_part = main_model.generate_remaining(input_ids + accepted_part)
        return accepted_part + remaining_part
```

### 总结
投机解码本质上是用**计算资源换时间**，通过小模型的"投机"来减少大模型的计算负担。它是一种聪明的工程优化，在不牺牲质量的前提下显著提升推理速度。

================================================
FILE: Extra-Chapter/s1-vllm-thinking-budget/output/output_1754208752.txt
================================================
<|im_start|>system
Please reason step by step, and put your final answer within \boxed{{}}.<|im_end|>
<|im_start|>user
There are exactly three positive real numbers $ k $ such that the function
$ f(x) = \frac{(x - 18)(x - 72)(x - 98)(x - k)}{x} $
defined over the positive real numbers achieves its minimum value at exactly two positive real numbers $ x $. Find the sum of these three values of $ k $.<|im_end|>
<|im_start|>assistant
<think>
Okay, so I need to find three positive real numbers k such that the function f(x) = [(x - 18)(x - 72)(x - 98)(x - k)] / x has its minimum value at exactly two positive real numbers x. Then, I need to find the sum of these three k's. 

First, let me understand the problem. The function f(x) is a rational function, which is a polynomial divided by x. The numerator is a quartic polynomial (degree 4) because it's the product of four linear terms (x - 18), (x - 72), (x - 98), and (x - k). The denominator is x, which is a linear term, so f(x) is a rational function with degree 4 in the numerator and degree 1 in the denominator, making the overall degree 3. However, since we're dealing with minima, we need to analyze its critical points.

The problem states that there are exactly three such k's where the function achieves its minimum value at exactly two positive real numbers x. So, my goal is to find those k's and sum them up.

First, let me think about how to approach this. Since we need to find minima, I need to find the critical points of f(x), which are the solutions to f'(x) = 0. The function f(x) has a minimum at a point x if the derivative changes from negative to positive there. However, the problem states that the minimum value is achieved at exactly two positive real numbers x. So, there must be exactly two critical points where the function attains its minimum. Wait, but minima can be local or global? The problem says "achieves its minimum value at exactly two positive real numbers x". So, maybe the global minimum is achieved at two different points? Or perhaps there are two local minima? Wait, but the wording is "achieves its minimum value", which is a bit ambiguous. Maybe they mean that there are two distinct points where the function attains its minimum value. But if it's a global minimum, then it would be achieved at only one point unless there's a plateau or something. But since the function is a rational function, it might not be constant over an interval, so the minimum would be achieved at a single point. But the problem says "exactly two positive real numbers x". Therefore, perhaps there are two points where the function has local minima, and those minima have the same value? Or maybe the function has two critical points where the function is minimized. Hmm, this is a bit confusing. Let me think again.

Alternatively, maybe the function has two critical points which are minima, and the rest are maxima. But the problem says "achieves its minimum value at exactly two positive real numbers x". So, perhaps the function has two points where it reaches its minimum value. But if there are two minima, then the function has two distinct minima. However, depending on the shape of the graph, maybe the function can have two minima with the same value. But how would that happen? Let me think of an example. Suppose the function is symmetric in some way, but since it's a rational function with varying roots, maybe not. Alternatively, maybe the function's derivative has two roots where the function reaches minima. But how do I ensure that?

Alternatively, maybe the function has exactly two critical points which are minima, but that seems unlikely because for higher degree polynomials, the number of critical points is related to the degree. Let me recall that for a function f(x), the derivative f'(x) would be a polynomial of degree one less than the original function. Since f(x) is a rational function, but when we take the derivative, we need to compute f'(x) properly.

Wait, let me first compute f'(x). Let me write f(x) as [(x - 18)(x - 72)(x - 98)(x - k)] / x. Let me denote the numerator as N(x) = (x - 18)(x - 72)(x - 98)(x - k). Then f(x) = N(x)/x. To find the critical points, we need to compute f'(x) and set it equal to zero.

Using the quotient rule: f'(x) = [N'(x) * x - N(x) * 1] / x². Therefore, f'(x) = 0 when N'(x) * x - N(x) = 0, i.e., N'(x) * x = N(x). So, N'(x)/N(x) = 1/x.

Alternatively, taking logarithmic derivatives? Maybe not necessary. Let me proceed step by step.

First, compute N(x) = (x - 18)(x - 72)(x - 98)(x - k). Let me denote the roots as 18, 72, 98, and k. Let me think about expanding N(x), but that might be complicated. Alternatively, take the derivative N'(x) using the product rule. But since N(x) is a product of four terms, the derivative would be the sum of the derivatives of each term times the rest. That is:

N'(x) = (x - 72)(x - 98)(x - k) + (x - 18)(x - 98)(x - k) + (x - 18)(x - 72)(x - k) + (x - 18)(x - 72)(x - 98).

Therefore, f'(x) = [N'(x) * x - N(x)] / x² = 0 when N'(x) * x - N(x) = 0. So, N'(x) * x = N(x). Therefore, we can write:

N'(x)/N(x) = 1/x.

But N'(x)/N(x) is the derivative of ln|N(x)|, so integrating that would give ln|N(x)|, but not sure if that helps here. Alternatively, maybe we can write:

Let me denote that equation N'(x)/N(x) = 1/x. Let me think about this. Let me consider that N'(x)/N(x) is the sum of the reciprocals of each factor. Wait, if N(x) = product_{i=1}^4 (x - a_i), then N'(x)/N(x) = sum_{i=1}^4 1/(x - a_i). Therefore, that's a standard result. Therefore, N'(x)/N(x) = 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k).

Therefore, according to our previous equation, we have:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Therefore, the critical points of f(x) are the solutions to this equation. Therefore, the equation:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Therefore, solving this equation for x will give the critical points.

Therefore, the problem reduces to finding the values of k such that this equation has exactly two positive real solutions x. Wait, but the problem says that the function achieves its minimum at exactly two positive real numbers x. So, perhaps the equation f'(x) = 0 has exactly two solutions where the function has minima? Wait, but the number of critical points depends on the number of solutions to that equation. However, the critical points can be maxima or minima. Therefore, maybe we need that the equation N'(x) * x = N(x) has exactly two positive real solutions, but with the additional condition that those solutions correspond to minima. However, the problem states that the minimum value is achieved at exactly two positive real numbers. So, perhaps there are two critical points where the function attains its minimum value, which could be two local minima with the same value? Or maybe the function has a single minimum which is achieved at two different points? But for smooth functions, if two different points have the same minimum value, they might be symmetric or have some relation.

Alternatively, maybe the equation f'(x) = 0 has exactly two real solutions (critical points), and both of them are minima. However, since f(x) is a rational function with leading term x^3 (since numerator is degree 4, denominator is degree 1), as x approaches infinity, f(x) behaves like x^3, which tends to infinity. Similarly, as x approaches zero from the right, the denominator approaches zero, but the numerator approaches (-18)(-72)(-98)(-k) which is positive (since four negative terms multiplied together). Therefore, near x = 0, f(x) approaches positive infinity. Therefore, the function tends to infinity as x approaches 0+ and as x approaches infinity. Therefore, the function must have at least one minimum somewhere in between. However, the problem says that there are exactly two points where the minimum is achieved. Therefore, perhaps there are two local minima, and the function has only those two points where it attains its minimum value? Wait, but if there are two local minima, then the global minimum would be the lower of the two. However, the problem says "achieves its minimum value at exactly two positive real numbers x". So, maybe the function has two local minima with the same value, which is the global minimum. Therefore, the function touches the same minimum value at two different points. That would require that the function is symmetric in some way around those points?

Alternatively, maybe the function has two critical points which are both minima, and the rest are maxima. However, for a quartic divided by x, the derivative would be a cubic? Wait, let me check the degree of N'(x) * x - N(x). Wait, N(x) is a quartic, so N'(x) is cubic. Then N'(x) * x is quartic, and N(x) is quartic. Therefore, N'(x) * x - N(x) is a quartic polynomial. Therefore, the equation N'(x) * x - N(x) = 0 is a quartic equation, which can have up to four real roots. Therefore, the equation f'(x) = 0 can have up to four real solutions. However, since we are considering positive real numbers x, we need to consider how many positive real solutions there are.

Therefore, the problem states that there are exactly three values of k such that the function f(x) has exactly two positive real numbers where it achieves its minimum. Therefore, for these k's, the equation f'(x) = 0 has exactly two positive real roots, and those roots correspond to minima. However, how do we ensure that those roots are minima?

Alternatively, maybe the equation f'(x) = 0 has exactly two positive real roots, and the function has a minimum at both of them. However, if there are two critical points, they could be a minimum and a maximum, or two minima or two maxima. But since the function tends to infinity at both ends, there must be at least one minimum. Therefore, if there are two critical points, they could be one minimum and one maximum. But the problem states that the minimum value is achieved at exactly two points. Therefore, maybe there are two minima. Therefore, perhaps the equation f'(x) = 0 has four real roots, but two of them are minima and two are maxima, but the problem wants only those k where exactly two of them are minima? Wait, no, the problem states that the function achieves its minimum value at exactly two positive real numbers x. So, regardless of how many critical points there are, the minimum value is achieved at exactly two points. So, maybe there are two points where the function has local minima with the same value, or one point where it has a minimum with multiplicity two? But multiplicity two would mean that the function just touches the minimum there.

Alternatively, perhaps the function has exactly two critical points, both of which are minima. However, given that the derivative equation is quartic, which can have up to four roots, but depending on k, maybe the number of positive real roots can be two. Therefore, the problem wants the k's for which the equation f'(x) = 0 has exactly two positive real roots, and both are minima. However, how do we ensure that those roots are minima? Well, perhaps by analyzing the second derivative or using test points. But since this is getting complicated, maybe I need another approach.

Alternatively, perhaps the equation f'(x) = 0 can be transformed into some equation that can be analyzed for the number of solutions. Let me recall that earlier, we had:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Let me denote this as:

Sum_{i=1}^4 1/(x - a_i) = 1/x,

where a_1=18, a_2=72, a_3=98, and a_4=k.

Let me rearrange this equation:

Sum_{i=1}^4 1/(x - a_i) - 1/x = 0.

Let me combine terms:

Sum_{i=1}^4 [1/(x - a_i) - 1/x] = 0.

Wait, no, that would be Sum_{i=1}^4 1/(x - a_i) - 1/x = 0, which is equivalent to Sum_{i=1}^4 [1/(x - a_i)] - 1/x = 0. Alternatively, maybe compute each term:

Each term 1/(x - a_i) - 1/x = [x - (x - a_i)] / [x(x - a_i)] = a_i / [x(x - a_i)]. Therefore:

Sum_{i=1}^4 [a_i / (x(x - a_i))] = 0.

But that would be:

Sum_{i=1}^4 [a_i / (x(x - a_i))] = 0.

Multiplying both sides by x(x - a_1)(x - a_2)(x - a_3)(x - a_4) would eliminate denominators, but that seems too complicated. Alternatively, let me note that:

Sum_{i=1}^4 1/(x - a_i) = 1/x.

Let me denote that equation again:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Let me combine the left-hand side:

Let me compute the left-hand side:

Let me denote S = 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k).

Then, S = 1/x.

Let me rearrange the equation:

S - 1/x = 0.

Which is:

[1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k)] - 1/x = 0.

Let me combine the terms:

Let me consider combining 1/(x - a) - 1/x = [x - (x - a)] / [x(x - a)] = a / [x(x - a)]. Therefore, if I do this for each term except maybe one? Wait, but here we have four terms on the left and one term subtracted. Alternatively, maybe group terms differently. Let me think.

Alternatively, let me bring 1/x to the left:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) - 1/x = 0.

Let me combine all terms over a common denominator. However, that would be very tedious. Alternatively, maybe consider that this equation can be rewritten as:

Sum_{i=1}^4 1/(x - a_i) = 1/x.

Let me multiply both sides by x(x - 18)(x - 72)(x - 98)(x - k) to eliminate denominators. However, this would result in a complicated equation, but maybe manageable.

Let me denote the equation as:

[1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k)] = 1/x.

Multiply both sides by x(x - 18)(x - 72)(x - 98)(x - k):

Left side:

x(x - 72)(x - 98)(x - k) + x(x - 18)(x - 98)(x - k) + x(x - 18)(x - 72)(x - k) + x(x - 18)(x - 72)(x - 98).

Right side:

(x - 18)(x - 72)(x - 98)(x - k).

Therefore, bringing all terms to left side:

x(x - 72)(x - 98)(x - k) + x(x - 18)(x - 98)(x - k) + x(x - 18)(x - 72)(x - k) + x(x - 18)(x - 72)(x - 98) - (x - 18)(x - 72)(x - 98)(x - k) = 0.

This seems very complicated, but maybe we can factor some terms. Let me note that each term on the left has a factor of x except the last term, which has a factor of (x - k). Wait, no. Let me check:

First term: x(...) with (x - k) in it.

Second term: x(...) with (x - k) in it.

Third term: x(...) with (x - k) in it.

Fourth term: x(...) with (x - k) not present? Wait, fourth term is x(x - 18)(x - 72)(x - 98). So, no (x - k) in it. Then the last term is -(x - 18)...(x - k).

Therefore, maybe we can factor out (x - k) from some terms?

Alternatively, perhaps factor out (x - 18)(x - 72)(x - 98) from some terms?

Alternatively, let me consider that this equation is complicated, but maybe there's a smarter approach.

Alternatively, note that the original function f(x) can be rewritten as:

f(x) = [(x - 18)(x - 72)(x - 98)(x - k)] / x.

Let me expand the numerator. However, expanding a quartic might be tedious, but maybe I can write it as:

Let me denote the numerator as (x - 18)(x - 72)(x - 98)(x - k). Let me first compute (x - 18)(x - 72) and (x - 98)(x - k) separately, then multiply them.

First, (x - 18)(x - 72) = x² - (18 + 72)x + 18*72 = x² - 90x + 1296.

Similarly, (x - 98)(x - k) = x² - (98 + k)x + 98k.

Then, multiplying these two quadratics:

(x² - 90x + 1296)(x² - (98 + k)x + 98k).

This will result in a quartic. Let me denote this as:

Let me compute term by term:

First term: x² * x² = x⁴.

Second term: x² * [ - (98 + k)x ] = - (98 + k)x³.

Third term: x² * 98k = 98k x².

Fourth term: -90x * x² = -90x³.

Fifth term: -90x * [ - (98 + k)x ] = +90(98 + k) x².

Sixth term: -90x * 98k = -90*98k x.

Seventh term: 1296 * x² = 1296x².

Eighth term: 1296 * [ - (98 + k)x ] = -1296(98 + k)x.

Ninth term: 1296 * 98k = 1296*98k.

So combining all terms:

x⁴ - (98 + k)x³ - 90x³ + [98k x² + 90(98 + k)x² + 1296x²] + [ -90*98k x - 1296(98 + k)x ] + 1296*98k.

Simplify term by term:

x⁴ - [ (98 + k) + 90 ]x³ + [ 98k + 90(98 + k) + 1296 ]x² - [ 90*98k + 1296(98 + k) ]x + 1296*98k.

Let me compute coefficients step by step.

First, coefficient of x³:

- [ (98 + k) + 90 ] = - (188 + k).

Coefficient of x²:

Let me compute 98k + 90*(98 + k) + 1296.

First, expand 90*(98 + k) = 90*98 + 90k.

Then total:

98k + 90*98 + 90k + 1296 = (98k + 90k) + 90*98 + 1296 = 188k + 90*98 + 1296.

Compute 90*98: 90*(100 - 2) = 9000 - 180 = 8820.

Then 8820 + 1296 = 10116.

Therefore, coefficient of x² is 188k + 10116.

Coefficient of x:

- [ 90*98k + 1296*(98 + k) ].

Compute 90*98k = 8820k.

Compute 1296*(98 + k) = 1296*98 + 1296k.

So total inside the brackets is 8820k + 1296*98 + 1296k = (8820k + 1296k) + 1296*98 = 10116k + 1296*98.

Therefore, the coefficient of x is - [ 10116k + 1296*98 ].

Constant term is 1296*98k.

Therefore, the numerator N(x) is:

N(x) = x⁴ - (188 + k)x³ + (188k + 10116)x² - [10116k + 1296*98]x + 1296*98k.

Then, f(x) = N(x)/x = x³ - (188 + k)x² + (188k + 10116)x - [10116k + 1296*98] + [1296*98k]/x.

But perhaps this expansion is not helpful. Alternatively, maybe I can compute f'(x) directly from this expression? Since f(x) = N(x)/x, then f'(x) = [N'(x)x - N(x)] / x². But N'(x) can be derived from the expanded N(x). However, this seems very tedious. Maybe there's another way.

Alternatively, since we had earlier that f'(x) = 0 corresponds to the equation:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Let me denote this equation again:

Sum_{i=1}^4 1/(x - a_i) = 1/x, where a_1=18, a_2=72, a_3=98, a_4=k.

Let me rearrange this equation:

Sum_{i=1}^4 1/(x - a_i) - 1/x = 0.

Let me combine the terms:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) - 1/x = 0.

Let me combine the 1/x with the other terms. Let me think about combining terms pairwise. Alternatively, let me consider moving 1/x to the left:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) - 1/x = 0.

Let me consider that 1/(x - a_i) - 1/x = [x - (x - a_i)] / [x(x - a_i)] = a_i / [x(x - a_i)]. Therefore, if I take each term 1/(x - a_i) - 1/x, then sum over i=1 to 4:

Sum_{i=1}^4 [a_i / (x(x - a_i))] = 0.

Wait, but that's not exactly the case, because in our equation, we have Sum_{i=1}^4 1/(x - a_i) - 1/x = 0, which is equivalent to Sum_{i=1}^4 [1/(x - a_i) - 1/x] + 1/x = 0? No, that's not correct. Wait, if I take Sum_{i=1}^4 [1/(x - a_i) - 1/x], that would be Sum_{i=1}^4 1/(x - a_i) - 4/x. Therefore, our original equation is Sum_{i=1}^4 1/(x - a_i) - 1/x = 0, which is different. Therefore, that approach may not help.

Alternatively, maybe think of the equation as:

Sum_{i=1}^4 1/(x - a_i) = 1/x.

Let me denote y = x, then:

1/(y - 18) + 1/(y - 72) + 1/(y - 98) + 1/(y - k) = 1/y.

Let me rearrange:

1/(y - 18) + 1/(y - 72) + 1/(y - 98) + 1/(y - k) - 1/y = 0.

Let me denote this as:

[1/(y - 18) - 1/y] + [1/(y - 72) - 1/y] + [1/(y - 98) - 1/y] + [1/(y - k) - 1/y] = 0 + 0 + 0 + 0? No, that would be adding four terms each of which is [1/(y - a_i) - 1/y], which would sum to Sum_{i=1}^4 [1/(y - a_i) - 1/y] = Sum_{i=1}^4 1/(y - a_i) - 4/y. But our equation is Sum_{i=1}^4 1/(y - a_i) - 1/y = 0, which is different. Therefore, not helpful.

Alternatively, maybe think of moving 1/y to the left:

Sum_{i=1}^4 1/(y - a_i) - 1/y = 0.

Let me combine terms:

Let me combine all terms over a common denominator. However, this seems too complex. Alternatively, think of the equation as:

Sum_{i=1}^4 1/(y - a_i) = 1/y.

Let me consider substituting z = 1/y, but not sure. Alternatively, think of the left side as a function of y and analyze its behavior.

Alternatively, think of the equation as:

Let me denote the left side as L(y) = 1/(y - 18) + 1/(y - 72) + 1/(y - 98) + 1/(y - k).

We need L(y) = 1/y.

Let me analyze the function L(y) - 1/y = 0.

To find the number of positive real solutions for y, we need to analyze how many times L(y) - 1/y crosses zero.

But since this is complicated, maybe think about the behavior of L(y):

For y approaching 0 from the positive side:

Each term 1/(y - a_i) approaches -1/a_i (since y is near 0, and a_i are positive). Therefore, L(y) approaches - [1/18 + 1/72 + 1/98 + 1/k]. Then, 1/y approaches +infty. Therefore, L(y) - 1/y approaches -something - infty = -infty.

For y approaching each a_i from the right, L(y) approaches +infty (since one term goes to +infty, others are finite). For y approaching each a_i from the left, L(y) approaches -infty. Also, for y approaching infinity, each term 1/(y - a_i) approaches 0, so L(y) approaches 0, and 1/y approaches 0. Therefore, the difference L(y) - 1/y approaches 0.

Therefore, the function L(y) - 1/y goes from -infty near y=0, has vertical asymptotes at y=18, 72, 98, k, and approaches 0 as y approaches infinity. Therefore, depending on the value of k, the number of times it crosses zero can vary.

But how can I determine the number of positive real roots? It might depend on the position of k relative to the other roots 18, 72, 98. Since the problem says there are exactly three such k's, which probably correspond to positions where the graph of L(y) - 1/y touches the x-axis at two points (i.e., tangency, leading to multiplicity two) or crosses it in some specific way.

However, since the problem states that there are exactly three k's for which the equation has exactly two positive real solutions, this suggests that for these k's, the equation has exactly two solutions, and for other k's, it has more or less. Therefore, we need to find k such that the equation L(y) = 1/y has exactly two positive solutions.

Alternatively, think of the equation as:

Let me consider that if we have four terms on the left and one on the right. Maybe if k is chosen such that one of the roots of the equation is a double root? For example, if the equation has a double root at some point, then the total number of roots (counting multiplicity) would be less. But since we need exactly two positive real roots, maybe there are two distinct roots or one double root?

But the problem says "exactly two positive real numbers x", so if there is a double root, that would count as one point with multiplicity two, but if the question counts distinct points, then maybe it's different. However, the problem says "exactly two positive real numbers", so probably distinct points. Therefore, maybe the equation has two distinct positive roots. But how to find the k's for which this occurs?

Alternatively, since the equation is complicated, maybe think about Vieta's formula or consider the equation as a rational equation and find conditions on k such that the equation has two solutions.

Alternatively, think of the equation:

Let me consider that the equation is:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Let me denote this as:

Let me combine terms:

Let me bring all terms to one side:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) - 1/x = 0.

Let me consider that this equation can be written as:

[1/(x - 18) - 1/x] + [1/(x - 72) - 1/x] + [1/(x - 98) - 1/x] + [1/(x - k) - 1/x] = 0 + 0 + 0 + 0? No, that would be adding four terms each of which is [1/(x - a_i) - 1/x], but our original equation is Sum_{i=1}^4 [1/(x - a_i)] - 1/x = 0, which is equal to Sum_{i=1}^4 [1/(x - a_i) - 1/x] + 3/x = 0. Not sure if helpful.

Alternatively, think of the equation as:

Let me combine terms pairwise. Let me pair 1/(x - 18) and 1/(x - 98), and 1/(x - 72) and 1/(x - k). Let me compute:

[1/(x - 18) + 1/(x - 98)] + [1/(x - 72) + 1/(x - k)] = 1/x.

Compute each pair:

For the first pair: 1/(x - 18) + 1/(x - 98) = [ (x - 98) + (x - 18) ] / [(x - 18)(x - 98)] = [2x - 116] / [(x - 18)(x - 98)].

Similarly, the second pair: 1/(x - 72) + 1/(x - k) = [ (x - k) + (x - 72) ] / [(x - 72)(x - k)] = [2x - (72 + k)] / [(x - 72)(x - k)].

Therefore, the equation becomes:

[2x - 116]/[(x - 18)(x - 98)] + [2x - (72 + k)] / [(x - 72)(x - k)] = 1/x.

This still seems complicated, but maybe manageable.

Let me denote A = [2x - 116]/[(x - 18)(x - 98)] and B = [2x - (72 + k)] / [(x - 72)(x - k)], so that A + B = 1/x.

Alternatively, cross-multiplying or finding common denominators.

Alternatively, consider specific substitutions. For example, let me suppose that k is chosen such that the two pairs are symmetric in some way? For example, if k is chosen such that 72 and k are symmetric with respect to some axis, similar to 18 and 98?

Note that 18 and 98 are symmetric with respect to (18 + 98)/2 = 58. Similarly, if k is chosen such that 72 and k are symmetric with respect to 58, then k = 58*2 - 72 = 116 - 72 = 44. So, k = 44. Maybe this is one of the k's? Similarly, if we consider symmetry between other pairs?

Alternatively, if there is some symmetry in the equation, then maybe k is related to the existing roots in a symmetric way. However, this is speculative.

Alternatively, think about the case when the equation has a double root. If the equation has a double root, then the discriminant is zero. But calculating the discriminant of a quartic is complicated. Alternatively, if we suppose that for certain k, the equation has a double root, which would correspond to the case where the function f(x) has a critical point which is also an inflection point or something, but I need more precise reasoning.

Alternatively, consider that if there are exactly two points where the minimum is achieved, then the function f(x) must have two critical points which are minima, and these two minima are equal in value. Therefore, the function is tangent to some horizontal line at two points. Therefore, the equation f'(x) = 0 and f(x) = m (for some m) would have two solutions. But this seems too vague.

Alternatively, think about the fact that if the equation f'(x) = 0 has exactly two solutions, then the quartic equation we derived earlier (N'(x)x - N(x) = 0) has exactly two positive real roots. Therefore, maybe we can analyze the number of positive real roots of that quartic equation depending on k.

However, this seems difficult. Alternatively, think about the original equation:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Let me consider that for large x, the left side approaches 0 + 0 + 0 + 0 = 0, and the right side also approaches 0. Therefore, there may be solutions at large x. Also, near each vertical asymptote, the left side tends to ±infty, so there may be solutions near each asymptote.

But how to ensure there are exactly two positive solutions? Perhaps if the function L(y) - 1/y has exactly two zeros. Given that L(y) has vertical asymptotes at y = 18, 72, 98, k, and approaches 0 at infinity, the number of zeros depends on the graph's behavior.

Alternatively, think about the case when k is one of the existing roots. For example, if k = 18, then the numerator would have a double root at 18, but the function f(x) would have a vertical asymptote at x = 0 and x = 18 (since denominator is x and numerator has (x - 18)). However, this might not be relevant here.

Alternatively, consider that for the equation to have exactly two solutions, the graph of L(y) must intersect the graph of 1/y exactly twice. Given that L(y) has vertical asymptotes, maybe this happens when k is placed in such a way that two of the asymptotes are positioned so that the intersections occur only twice.

Alternatively, think of the equation as follows: Let me define g(x) = 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k). We need to solve g(x) = 1/x. For each k, the number of solutions depends on how g(x) behaves.

Let me analyze the behavior of g(x):

- For x approaching 0+, g(x) approaches -[1/18 + 1/72 + 1/98 + 1/k], which is a negative number, and 1/x approaches +infty, so g(x) - 1/x approaches -infty.

- For x approaching 18 from the right, g(x) approaches +infty, and 1/x approaches 1/18, so g(x) - 1/x approaches +infty.

- Between 0 and 18: The function g(x) goes from -infty to +infty, so by Intermediate Value Theorem, there is at least one solution in (0, 18).

- Between 18 and 72: Let's see. As x approaches 18 from the right, g(x) approaches +infty. As x approaches 72 from the left, g(x) approaches -infty (since 1/(x - 72) approaches -infty). Therefore, g(x) goes from +infty to -infty, so crosses 1/x somewhere in (18, 72). Similarly, if there are other asymptotes in this interval.

Wait, but if k is not in (18, 72), then between 18 and 72, there is only one asymptote at x=18. However, if k is in (18, 72), then there's another asymptote at x=k. Similarly for other intervals.

This is getting too case-based. Maybe considering specific positions of k:

Case 1: k < 18

Case 2: 18 < k < 72

Case 3: 72 < k < 98

Case 4: k > 98

In each case, the number of vertical asymptotes in the positive real line changes.

But since the problem says there are exactly three such k's, it's possible that these k's are in different cases. However, without knowing, it's hard to proceed.

Alternatively, think of the equation having exactly two solutions, which would require that some of the intervals between asymptotes do not contain a solution. For example, if in some intervals, the function g(x) - 1/x does not cross zero, leading to fewer solutions.

But how to determine that?

Alternatively, think of the total number of solutions. Normally, with four vertical asymptotes, the function g(x) - 1/x would have several intervals where it goes from -infty to +infty or vice versa, leading to multiple solutions. To have exactly two solutions, maybe two of these intervals contain one solution each, and the rest do not.

Alternatively, consider that if the equation g(x) = 1/x has exactly two solutions, then it's possible that two of the intervals between asymptotes each contain one solution, and the other intervals have no solutions. For example, if k is very large, such that the asymptote at x=k is far to the right, then maybe the function has solutions in (0,18), (18,72), and (98, k), but depending on k, maybe only two of these intervals have solutions.

However, this is too vague. Maybe think about specific values of k.

Alternatively, consider that when k is such that one of the solutions coincides with an existing root or something. Wait, but the original function f(x) has roots at x=18, 72, 98, k. However, the critical points are determined by f'(x)=0, which is different.

Alternatively, think about the case where the equation g(x) = 1/x has a solution at x= some specific point, like x=... Maybe if we set x to be a symmetric point with respect to some of the existing roots?

For example, if x is the midpoint between 18 and 98, which is (18 + 98)/2 = 58. Let me check if x=58 is a solution for some k.

Let me plug x=58 into the equation:

1/(58 - 18) + 1/(58 - 72) + 1/(58 - 98) + 1/(58 - k) = 1/58.

Compute each term:

1/(40) + 1/(-14) + 1/(-40) + 1/(58 - k) = 1/58.

Simplify:

1/40 - 1/14 - 1/40 + 1/(58 - k) = 1/58.

The 1/40 and -1/40 cancel out, so:

-1/14 + 1/(58 - k) = 1/58.

Therefore:

1/(58 - k) = 1/58 + 1/14.

Compute RHS:

Find common denominator, which is 58*14 = 812.

1/58 = 14/812, and 1/14 = 58/812.

So total is (14 + 58)/812 = 72/812 = 18/203.

Thus:

1/(58 - k) = 18/203 => 58 - k = 203/18 => k = 58 - 203/18.

Convert 58 to 58/1 = 1044/18, so k = 1044/18 - 203/18 = 841/18 ≈ 46.722...

So k ≈ 46.722. This is between 18 and 72. So this is one possible k. Maybe this is one of the three k's? Let me check if this k leads to two solutions.

But I don't know yet. However, this suggests that if k is chosen such that x=58 is a solution, then that's one k. Perhaps there are similar k's for other symmetric points?

For example, consider the midpoint between 72 and k, or something else? Alternatively, the midpoint between 72 and 98 is (72 + 98)/2 = 85. Let me try x=85.

Plug x=85 into the equation:

1/(85 - 18) + 1/(85 - 72) + 1/(85 - 98) + 1/(85 - k) = 1/85.

Compute:

1/67 + 1/13 + 1/(-13) + 1/(85 - k) = 1/85.

Simplify:

1/67 + 1/13 - 1/13 + 1/(85 - k) = 1/85 => 1/67 + 1/(85 - k) = 1/85.

Then:

1/(85 - k) = 1/85 - 1/67 = (67 - 85)/(85*67) = (-18)/5695.

Therefore, 85 - k = -5695/18 => k = 85 + 5695/18.

Convert 85 to 1530/18, so k = 1530/18 + 5695/18 = 7225/18 ≈ 401.388... This is much larger than 98. So this is another possible k.

Similarly, let me check if there is another symmetric point. For example, the midpoint between 18 and 72 is (18 + 72)/2 = 45. Let me try x=45.

Plug x=45 into the equation:

1/(45 - 18) + 1/(45 - 72) + 1/(45 - 98) + 1/(45 - k) = 1/45.

Compute:

1/27 + 1/(-27) + 1/(-53) + 1/(45 - k) = 1/45.

Simplify:

1/27 - 1/27 - 1/53 + 1/(45 - k) = 1/45 => -1/53 + 1/(45 - k) = 1/45.

Then:

1/(45 - k) = 1/45 + 1/53 = (53 + 45)/(45*53) = 98/2385.

Thus:

45 - k = 2385/98 => k = 45 - 2385/98.

Convert 45 to 4410/98, so k = 4410/98 - 2385/98 = 2025/98 ≈ 20.663... This is between 18 and 72.

Therefore, we have three k's: approximately 46.72, 401.39, and 20.66. These are three different k's. Perhaps these are the three k's we need? If so, their sum would be 2025/98 + 841/18 + 7225/18. Wait, let me check:

Wait, for x=58, we had k = 841/18 ≈ 46.7222...

For x=85, we had k = 7225/18 ≈ 401.388...

For x=45, we had k = 2025/98 ≈ 20.663...

But need to check if these values of k indeed result in the equation having exactly two solutions. However, how can I be sure? Because when I forced x to be a specific value (midpoint), I found k such that x is a solution. But does this k result in the equation having exactly two solutions?

Alternatively, these k's might be the ones where the equation has a double root at x=58, x=45, or x=85, leading to the equation having fewer roots. For example, if x=58 is a double root, then the total number of roots would be reduced by one, leading to three roots instead of four? But the problem wants exactly two solutions. So, if there's a double root and two other roots, that would be three solutions. But if there's a double root and one other root, that would be two solutions. However, how to determine that?

Alternatively, if we set up the equation such that x is both a solution and a point where the derivative of g(x) - 1/x is zero, leading to a double root. That is, solving both g(x) = 1/x and g'(x) = 1/x². This would give us the values of x and k where the equation has a double root.

Let me try this approach. Let's suppose that for some x, both g(x) = 1/x and g'(x) = 1/x². Then, this x would be a double root of the equation g(x) - 1/x = 0.

So, for the case when x=58, we found a k such that g(58) = 1/58. Now, we need to check if for that k, g'(58) = 1/58². If yes, then x=58 is a double root, and the equation would have fewer solutions.

Let me compute g'(x). Recall that g(x) = 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k). Therefore, g'(x) = -1/(x - 18)^2 - 1/(x - 72)^2 - 1/(x - 98)^2 - 1/(x - k)^2.

So, for x=58, and the k we found earlier (k=841/18), let's compute g'(58):

Compute each term:

-1/(58 - 18)^2 = -1/(40)^2 = -1/1600

-1/(58 - 72)^2 = -1/(-14)^2 = -1/196

-1/(58 - 98)^2 = -1/(-40)^2 = -1/1600

-1/(58 - k)^2. Let's compute 58 - k = 58 - 841/18 = (1044 - 841)/18 = 203/18. Therefore, -1/(203/18)^2 = - (18^2)/(203^2) = -324/41209.

Therefore, total g'(58) = -1/1600 - 1/196 - 1/1600 - 324/41209.

Let me compute this numerically:

First, -1/1600 ≈ -0.000625

-1/196 ≈ -0.005102

-1/1600 ≈ -0.000625

-324/41209 ≈ -0.00786 (since 41209 ≈ 203², and 324 ≈ (18²), so approx - (18/203)^2 ≈ -(0.088)^2 ≈ -0.0077)

Adding them up: -0.000625 - 0.005102 - 0.000625 - 0.00786 ≈ -0.014212.

On the other hand, 1/x² at x=58 is 1/(58²) ≈ 1/3364 ≈ 0.000297.

Clearly, g'(58) ≈ -0.0142 ≠ 0.000297. Therefore, x=58 is not a double root. Therefore, my previous assumption was wrong.

Therefore, forcing g(x) = 1/x at a specific x does not necessarily lead to a double root. Therefore, this approach might not be helpful.

Alternatively, maybe think of the equation g(x) = 1/x and the condition that it has exactly two solutions. This likely happens when the horizontal line y=1/x is tangent to the graph of g(x) at two points, or when some other condition is met. However, this seems too abstract.

Alternatively, think back to the original problem: the function f(x) has its minimum at exactly two points. For this to happen, the derivative f'(x) must have exactly two critical points which are minima. Therefore, the equation f'(x) = 0 must have exactly two positive real roots, both of which are minima. To ensure that, the second derivative test would be needed at those points. However, without knowing the exact roots, this seems difficult.

Alternatively, think about the original function f(x). Since f(x) is a rational function, maybe analyze its behavior and find when it has two minima. Given that f(x) tends to infinity as x approaches 0+ and infinity, it must have at least one minimum. To have two minima, there must be two distinct points where the function reaches a local minimum. Therefore, the equation f'(x) = 0 must have two positive real roots, both of which are minima. Therefore, the number of positive real roots of f'(x) = 0 is two.

Since f'(x) = 0 leads to a quartic equation, which can have up to four positive real roots, the problem wants the values of k for which this equation has exactly two positive real roots. Therefore, the question reduces to finding the values of k for which the quartic equation N'(x)x - N(x) = 0 has exactly two positive real roots.

But how to find such k? This seems very abstract. However, perhaps the quartic equation can be transformed into a quadratic equation in some way, or have some symmetry.

Alternatively, think about the original equation again:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Let me multiply both sides by x(x - 18)(x - 72)(x - 98)(x - k) to eliminate denominators, as I tried before, but this might lead to a polynomial equation which can be analyzed.

Let me denote P(x) = x(x - 72)(x - 98)(x - k) + x(x - 18)(x - 98)(x - k) + x(x - 18)(x - 72)(x - k) + x(x - 18)(x - 72)(x - 98) - (x - 18)(x - 72)(x - 98)(x - k) = 0.

This is the same as before. Let me try to factor this equation. Note that each term has a common factor of (x - 18)(x - 72)(x - 98)(x - k), but no, each term is different.

Alternatively, notice that the first three terms have x multiplied by three of the four factors, and the last term is -(x - 18)...(x - k). Let me try to factor out (x - 18)(x - 72)(x - 98):

Let me look at the first term: x(x - 72)(x - 98)(x - k)

Second term: x(x - 18)(x - 98)(x - k)

Third term: x(x - 18)(x - 72)(x - k)

Fourth term: x(x - 18)(x - 72)(x - 98)

Fifth term: -(x - 18)(x - 72)(x - 98)(x - k)

Let me factor out (x - 18)(x - 72)(x - 98)(x - k) from all terms? Not sure. Let me see:

Let me take the first term: x(x - 72)(x - 98)(x - k) = x(x - 72)(x - 98)(x - k)

The fifth term is -(x - 18)(x - 72)(x - 98)(x - k)

So if I factor out (x - 72)(x - 98)(x - k) from the first and fifth terms, I get:

(x - 72)(x - 98)(x - k)[x - (x - 18)] = (x - 72)(x - 98)(x - k)[x - x + 18] = 18(x - 72)(x - 98)(x - k).

Similarly, let me do the same for the other terms.

Second term: x(x - 18)(x - 98)(x - k). If I factor out (x - 18)(x - 98)(x - k), then it is x * (x - 18)(x - 98)(x - k). Similarly, if I look at the fifth term, which is -(x - 18)(x - 72)(x - 98)(x - k). If I factor out (x - 18)(x - 98)(x - k), then the fifth term is -(x - 72)(x - 18)(x - 98)(x - k) = - (x - 18)(x - 98)(x - k)(x - 72). So, if I combine the second and fifth terms, but I already did the first and fifth. Let me try to do this for all terms.

Wait, this seems too complicated. Let me try to regroup the terms:

Let me write P(x) as:

x(x - 72)(x - 98)(x - k) + x(x - 18)(x - 98)(x - k) + x(x - 18)(x - 72)(x - k) + x(x - 18)(x - 72)(x - 98) - (x - 18)(x - 72)(x - 98)(x - k) = 0.

Let me factor out (x - 18)(x - 72)(x - 98)(x - k) from all terms? Let me see:

First term: x(x - 72)(x - 98)(x - k) = x * (x - 72)(x - 98)(x - k) = x * [ (x - 18) + 18 ] * (x - 98)(x - k) ? Not sure.

Alternatively, notice that if I add all the first four terms, they are x multiplied by the products missing one of the (x - a_i) terms, and the last term is -product of all (x - a_i). This resembles the expansion of the derivative of the product, but I'm not sure.

Wait, let me recall that for a function N(x) = (x - a)(x - b)(x - c)(x - d), then N'(x) = (x - b)(x - c)(x - d) + (x - a)(x - c)(x - d) + (x - a)(x - b)(x - d) + (x - a)(x - b)(x - c). Therefore, if I multiply N'(x) by x, I get x*N'(x) = x[(x - b)... + ...]. Then, the equation x*N'(x) - N(x) = 0 is exactly our P(x) = 0. Therefore, P(x) = x*N'(x) - N(x) = 0.

But N(x) = (x - 18)(x - 72)(x - 98)(x - k). Therefore, P(x) = x*N'(x) - N(x) = 0.

But how does this help? Well, perhaps we can write this as x*N'(x) = N(x), which implies that d/dx [N(x)/x] = 0, which is f'(x) = 0, as we had before.

But how to find the roots of P(x) = 0? It might be related to the roots of N(x) = 0, but not directly.

Alternatively, notice that if x is a root of N(x) = 0, then x is 18, 72, 98, or k. However, these are not in the domain of f(x) because f(x) has x in the denominator. But for P(x) = x*N'(x) - N(x), let's check if x = 18 is a root:

At x = 18, N(x) = 0, and N'(x) is the derivative at x=18, which is (18 - 72)(18 - 98)(18 - k) + ... (the other terms). But x*N'(x) - N(x) at x=18 is 18*N'(18) - 0 = 18*N'(18). Since N'(18) is not zero (unless k is chosen such that...), so x=18 is not a root of P(x) = 0. Similarly for x=72, 98, k.

Therefore, the roots of P(x) = 0 are not the roots of N(x) = 0.

This seems like a dead end.

Let me think back to the original problem. The problem says there are exactly three positive real numbers k such that the function achieves its minimum at exactly two positive real numbers x. The answer is the sum of these three k's. Given that the problem is from a competition, likely there is a clever substitution or symmetry to find these k's.

Earlier, I found three k's by assuming that x is the midpoint between two of the existing roots, and solving for k. These k's were:

- For x = 45 (midpoint between 18 and 72): k = 2025/98 ≈ 20.663

- For x = 58 (midpoint between 18 and 98): k = 841/18 ≈ 46.722

- For x = 85 (midpoint between 72 and 98): k = 7225/18 ≈ 401.388

If these are the three k's, then their sum is 2025/98 + 841/18 + 7225/18.

Let me compute this:

First, note that 841/18 + 7225/18 = (841 + 7225)/18 = 8066/18.

Then, add 2025/98:

Total sum = 8066/18 + 2025/98.

Find a common denominator, which is 18*98 = 1764.

Convert 8066/18 to (8066 * 98)/1764.

Convert 2025/98 to (2025 * 18)/1764.

Compute numerator:

First part: 8066 * 98. Let me compute 8066 * 100 = 806600, minus 8066 * 2 = 16132, so 806600 - 16132 = 790,468.

Second part: 2025 * 18. 2000*18 = 36,000, 25*18 = 450, total 36,450.

Total numerator: 790,468 + 36,450 = 826,918.

Therefore, total sum is 826918 / 1764.

Simplify this fraction:

Divide numerator and denominator by 2: 413,459 / 882.

Check if 413,459 and 882 have common factors. 882 = 2 * 441 = 2 * 21^2 = 2 * 3^2 * 7^2.

Check if 413,459 is divisible by 2: No, it's odd.

Check divisibility by 3: Sum of digits of 413,459: 4 + 1 + 3 + 4 + 5 + 9 = 26. 26 is not divisible by 3, so no.

Check divisibility by 7: Let's perform the division.

Divide 413,459 by 7:

7 * 59,000 = 413,000. 413,459 - 413,000 = 459. 459 / 7 = 65.571... Not integer. So no.

Therefore, the fraction is reduced: 413,459 / 882. But this seems like a very large number, and it's unlikely that this is the correct answer. Moreover, the problem says "three positive real numbers k", and their sum is likely a nice number, not a fraction like this. Therefore, my previous assumption that these midpoints give the correct k's is likely wrong.

Alternatively, perhaps the three k's are the midpoints between the existing roots, but calculated differently. Let me think.

Wait, when I calculated k for x=45, I got k = 2025/98 ≈ 20.663. But 2025 is 45², and 98 is 49*2. Similarly, for x=58, k = 841/18, and 841 is 29², and 18 is 2*9. For x=85, k = 7225/18, and 7225 is 85². So there is a pattern here: k = x² / (something). Let me check:

For x=45, k = 45² / (something). Let me see, in the equation when I set x=45, I had:

-1/53 + 1/(45 - k) = 1/45 => 1/(45 - k) = 1/45 + 1/53 = (53 + 45)/ (45*53) = 98/2385, so 45 - k = 2385/98 => k = 45 - 2385/98 = (45*98 - 2385)/98.

Calculate 45*98: 45*(100 - 2) = 4500 - 90 = 4410. Then 4410 - 2385 = 2025, so k = 2025/98, which is 45² / 98.

Similarly, for x=58, we had k = 841/18, and 841 is 29², and 18 is 2*9. Wait, 58 is 29*2, so 29² / 9? No, 29² is 841, and 18 is 9*2. Hmm.

Wait, for x=58, the calculation was:

1/(58 - k) = 18/203, so 58 - k = 203/18, so k = 58 - 203/18 = (1044 - 203)/18 = 841/18. Now, 58 is 29*2, and 203 is 7*29. So 203 = 7*29, and 203/18 = 7*29/18. Therefore, k = 58 - 7*29/18 = (58*18 - 7*29)/18. Let me compute 58*18: 58*10 + 58*8 = 580 + 464 = 1044. 7*29 = 203. 1044 - 203 = 841. So yes, k = 841/18.

Similarly, for x=85, k = 7225/18, and 7225 is 85², and 85 is 5*17, but 18 is 2*9. Not sure about the relation.

However, if I assume that these three k's are indeed the correct ones, even though the sum seems complex, maybe there is a simplification.

Let me calculate the sum again:

2025/98 + 841/18 + 7225/18 = 2025/98 + (841 + 7225)/18 = 2025/98 + 8066/18.

Convert 8066/18 to a decimal: 8066 ÷ 18 ≈ 448.111...

Convert 2025/98 ≈ 20.663...

Total ≈ 448.111 + 20.663 ≈ 468.774...

But if the problem is from a competition, the answer is likely an integer or a simple fraction. My calculation gives a complex fraction, so this suggests that my assumption is wrong.

Alternatively, maybe the three k's are 45, 58, and 85? Let me check.

If k=45, then let's see if the equation has exactly two solutions. But this is just a guess.

Alternatively, think about the original equation: when k is such that the function f(x) has two minima, it might be when the function is symmetric in some way. For example, if k is chosen so that the function is symmetric about some axis, leading to two minima.

For example, if k is such that 18 and 98 are symmetric with respect to some point, and 72 and k are symmetric with respect to the same point. The midpoint between 18 and 98 is 58, so if 72 and k are symmetric about 58, then k = 58*2 - 72 = 116 - 72 = 44. So k = 44. Similarly, if the function is symmetric about x=58, then this might create two minima.

Let me try k=44. Let's analyze the equation for k=44:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - 44) = 1/x.

Let me check if there is symmetry here. If we let y = 58 - x, then perhaps some terms transform. Let me try substituting x = 58 - t.

Let x = 58 - t, then:

1/(58 - t - 18) + 1/(58 - t - 72) + 1/(58 - t - 98) + 1/(58 - t - 44) = 1/(58 - t).

Simplify each term:

1/(40 - t) + 1/(-14 - t) + 1/(-40 - t) + 1/(14 - t) = 1/(58 - t).

Simplify:

1/(40 - t) + 1/(14 - t) + 1/(-14 - t) + 1/(-40 - t) = 1/(58 - t).

Notice that 1/(14 - t) + 1/(-14 - t) = [1/(14 - t) - 1/(14 + t)] = [ (14 + t - 14 + t) / ((14 - t)(14 + t)) ] = (2t)/(196 - t²).

Similarly, 1/(40 - t) + 1/(-40 - t) = [1/(40 - t) - 1/(40 + t)] = [ (40 + t - 40 + t) / (1600 - t²) ] = (2t)/(1600 - t²).

Therefore, the left side becomes:

(2t)/(1600 - t²) + (2t)/(196 - t²) = 1/(58 - t).

Factor out 2t:

2t [1/(1600 - t²) + 1/(196 - t²)] = 1/(58 - t).

This equation is symmetric in t, but it's still complex. However, if t=0, let's check:

Left side: 0. Right side: 1/58 ≠ 0. So t=0 is not a solution. If t=14, let's see:

Left side: 2*14 [1/(1600 - 196) + 1/(196 - 196)] = 28 [1/1404 + 1/0]. Division by zero, so t=14 is not in domain. Similarly, t=40 would give division by zero. Not helpful.

However, this symmetry suggests that if t is a solution, then -t is also a solution? Let me see. If we replace t with -t in the left side:

2(-t)[1/(1600 - t²) + 1/(196 - t²)] = -2t[...], while the right side becomes 1/(58 - (-t)) = 1/(58 + t). So unless 1/(58 + t) = -2t[...], which is not generally true. Therefore, the equation is not symmetric in t, so this approach might not help.

However, the fact that k=44 is symmetric with respect to 58 (midpoint between 18 and 98) suggests that it might be one of the desired k's. Let me try to find the number of solutions for k=44.

But this is time-consuming. Let me try to estimate.

For k=44, let's consider the behavior of the function g(x) = 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - 44).

Vertical asymptotes at x=18, 44, 72, 98.

Let me analyze the intervals:

1. (0, 18): g(x) goes from -infty to +infty, so crosses 1/x once.

2. (18, 44): As x approaches 18+ g(x) → +infty, as x approaches 44- g(x) → -infty, so crosses 1/x once.

3. (44, 72): As x approaches 44+ g(x) → +infty, as x approaches 72- g(x) → -infty, so crosses 1/x once.

4. (72, 98): As x approaches 72+ g(x) → +infty, as x approaches 98- g(x) → -infty, so crosses 1/x once.

5. (98, ∞): As x approaches 98+ g(x) → +infty, as x approaches ∞, g(x) approaches 0, so g(x) decreases from +infty to 0, while 1/x decreases from 1/98 to 0. Depending on the behavior, there might be one or no crossings.

Therefore, potentially, there are five intervals, each possibly containing one solution, leading to five solutions. But the problem states that for some k, there are exactly two solutions. Therefore, k=44 is likely not one of them.

This suggests that my previous approach of assuming midpoints was incorrect.

Given that I'm stuck, let me try to recall that in problems where a function has its minimum at two points, it often relates to the function being symmetric or having a double root in its derivative. Given that, and the earlier calculation where for x=58, k=841/18, and similar for other midpoints, perhaps these are the correct k's, and the sum is indeed 2025/98 + 841/18 + 7225/18, which simplifies to:

First, note that 841/18 + 7225/18 = (841 + 7225)/18 = 8066/18.

Now, 8066 divided by 2 is 4033, and 18 divided by 2 is 9, so 4033/9.

Then, add 2025/98. To add 4033/9 and 2025/98, find a common denominator, which is 9*98 = 882.

Convert 4033/9 to (4033*98)/882.

Convert 2025/98 to (2025*9)/882.

Compute 4033*98: Let's calculate 4033*100 = 403300, minus 4033*2 = 8066, so 403300 - 8066 = 395,234.

Compute 2025*9 = 18,225.

Total numerator: 395,234 + 18,225 = 413,459.

Thus, the sum is 413,459 / 882. Let me see if this reduces. As before, 413,459 and 882. Let's check if 413,459 is divisible by 7: 7*59,000 = 413,000, 413,459 - 413,000 = 459. 459 / 7 = 65.571... No. Divisible by 3? 4 + 1 + 3 + 4 + 5 + 9 = 26, not divisible by 3. Divisible by 2? No. So it's irreducible.

But this seems too complex, and likely not the intended answer. Therefore, my earlier approach must be wrong.

Let me try a different approach. Let me consider that the original function f(x) = [(x - 18)(x - 72)(x - 98)(x - k)] / x. Let me make a substitution y = x, and consider the function f(x). To find where it has minima, we can consider taking the derivative and setting it to zero, which we know leads to the equation with the sum of reciprocals.

Let me consider that for the function to have its minimum at two points, the equation f'(x) = 0 must have two solutions. This is a quartic equation, and the number of positive real solutions depends on k. To find the values of k where this equation has exactly two positive real solutions, we can think of this as a parameterized equation and use the concept of discriminants or resultants, but this is advanced.

Alternatively, recall that in the equation 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x, if we let x be a root, then it must satisfy this equation. Let me assume that this equation has two solutions, say x1 and x2, and that these are the points where the function has minima.

But without additional information, this is not helpful.

Wait, let me think about the original function f(x). If I set x = 18, 72, 98, or k, the function is undefined. But for other x, it is defined. Let me consider the behavior of f(x) between the roots.

However, this is too vague. Given that I'm stuck, and considering the time I've spent, perhaps I should look for another way.

Let me recall that in similar problems, when a function has its minimum at two points, it often relates to the function having a double root in its derivative, which would mean that the equation f'(x) = 0 has a double root. For the equation to have exactly two positive real roots, it could have a double root and another root, or two double roots, or two simple roots. But since the problem says "exactly two positive real numbers x", it likely means two distinct points, so the equation has two simple roots or one double root and one simple root. However, if it has a double root, the total number of roots counting multiplicity would be three or four, depending on the case.

But how to find k such that the equation has a double root? This would require solving the system:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x,

and

-1/(x - 18)^2 - 1/(x - 72)^2 - 1/(x - 98)^2 - 1/(x - k)^2 = -1/x².

This is a system of two equations in variables x and k. Solving this system would give the values of x and k where the equation has a double root. This is complex, but let's try to proceed.

Let me denote the first equation as (1) and the second as (2). From (1), we have:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

From (2), we have:

1/(x - 18)^2 + 1/(x - 72)^2 + 1/(x - 98)^2 + 1/(x - k)^2 = 1/x².

Let me call the left side of (1) as S and the left side of (2) as T. So S = 1/x and T = 1/x².

Now, note that S and T are related to the derivatives of the function g(x) = 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k), where g(x) = 1/x.

But how to relate S and T? Let me consider that if x is a double root, then the function g(x) - 1/x has a double root at x, so both g(x) - 1/x and its derivative are zero at x. This is what we have.

Let me try to find a relationship between S and T. Let me square the first equation:

S² = [1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k)]² = 1/x².

Expanding the left side:

S² = [1/(x - 18)^2 + 1/(x - 72)^2 + 1/(x - 98)^2 + 1/(x - k)^2] + 2[1/(x - 18)(x - 72) + 1/(x - 18)(x - 98) + 1/(x - 18)(x - k) + 1/(x - 72)(x - 98) + 1/(x - 72)(x - k) + 1/(x - 98)(x - k)] = 1/x².

But from (2), we know that T = 1/x², so:

T + 2[...] = S² = 1/x².

Therefore:

T + 2[...] = T => 2[...] = 0 => [...] = 0.

So, the sum of the cross terms is zero. This gives us another equation:

1/(x - 18)(x - 72) + 1/(x - 18)(x - 98) + 1/(x - 18)(x - k) + 1/(x - 72)(x - 98) + 1/(x - 72)(x - k) + 1/(x - 98)(x - k) = 0.

This is a complex equation, but perhaps we can use the first equation to substitute some terms.

From the first equation, S = 1/x, so:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

Let me denote this as:

1/(x - 18) + 1/(x - 72) + 1/(x - 98) = 1/x - 1/(x - k).

Let me call this equation (1a).

Now, let me focus on the cross terms. Let me denote the cross terms as C:

C = 1/(x - 18)(x - 72) + 1/(x - 18)(x - 98) + 1/(x - 18)(x - k) + 1/(x - 72)(x - 98) + 1/(x - 72)(x - k) + 1/(x - 98)(x - k) = 0.

Let me group terms:

C = [1/(x - 18)(x - 72) + 1/(x - 18)(x - 98) + 1/(x - 18)(x - k)] + [1/(x - 72)(x - 98) + 1/(x - 72)(x - k)] + 1/(x - 98)(x - k).

Let me factor out 1/(x - 18) from the first group:

1/(x - 18) [1/(x - 72) + 1/(x - 98) + 1/(x - k)] + 1/(x - 72)[1/(x - 98) + 1/(x - k)] + 1/(x - 98)(x - k).

From equation (1a), we have:

1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x - 1/(x - 18).

Substitute this into the first group:

1/(x - 18) [1/x - 1/(x - 18)] + 1/(x - 72)[1/(x - 98) + 1/(x - k)] + 1/(x - 98)(x - k).

Now, let me focus on the second group: 1/(x - 72)[1/(x - 98) + 1/(x - k)]. From equation (1a), if I isolate 1/(x - 98) + 1/(x - k), it would be 1/x - 1/(x - 18) - 1/(x - 72). Let me verify:

From (1a): 1/(x - 18) + 1/(x - 72) + 1/(x - 98) + 1/(x - k) = 1/x.

So, 1/(x - 98) + 1/(x - k) = 1/x - 1/(x - 18) - 1/(x - 72).

Therefore, the second group becomes:

1/(x - 72)[1/x - 1/(x - 18) - 1/(x - 72)].

Therefore, now, C is:

1/(x - 18)[1/x - 1/(x - 18)] + 1/(x - 72)[1/x - 1/(x - 18) - 1/(x - 72)] + 1/(x - 98)(x - k) = 0.

This is still complex, but let's expand the terms:

First term: 1/(x - 18) * 1/x - 1/(x - 18)^2.

Second term: 1/(x - 72) * 1/x - 1/(x - 72) * 1/(x - 18) - 1/(x - 72)^2.

Third term: 1/(x - 98)(x - k).

So, combining all terms:

1/x(x - 18) - 1/(x - 18)^2 + 1/x(x - 72) - 1/[(x - 72)(x - 18)] - 1/(x - 72)^2 + 1/[(x - 98)(x - k)] = 0.

This is very complex. Let me combine the terms with 1/x:

1/x[1/(x - 18) + 1/(x - 72)] - 1/(x - 18)^2 - 1/(x - 72)^2 - 1/[(x - 72)(x - 18)] + 1/[(x - 98)(x - k)] = 0.

Let me combine the terms in the first group: 1/x[1/(x - 18) + 1/(x - 72)].

Let me compute 1/(x - 18) + 1/(x - 72) = [2x - 90]/[(x - 18)(x - 72)].

Therefore, the first group is 1/x * [2x - 90]/[(x - 18)(x - 72)] = [2x - 90]/[x(x - 18)(x - 72)].

Next, the terms -1/(x - 18)^2 - 1/(x - 72)^2 - 1/[(x - 72)(x - 18)].

Let me combine these:

Let me write them as -[1/(x - 18)^2 + 1/(x - 72)^2 + 1/((x - 18)(x - 72))].

Let me combine the last two terms: 1/((x - 18)(x - 72)) = [1/(x - 18) + 1/(x - 72)] / (x - 18 + x - 72)? No, better to find a common denominator.

Wait, let me note that 1/(x - 18)^2 + 1/(x - 72)^2 + 1/((x - 18)(x - 72)) = [1/(x - 18) + 1/(x - 72)]^2 - 2/(x - 18)(x - 72) + 1/((x - 18)(x - 72)) = [1/(x - 18) + 1/(x - 72)]^2 - 1/((x - 18)(x - 72)).

But this might not help. Let me compute it directly:

Let me denote A = x - 18, B = x - 72. Then, the expression is 1/A² + 1/B² + 1/(AB). This is (B² + A² + AB)/A²B². Not sure.

Alternatively, let me compute it as is:

-1/(x - 18)^2 - 1/(x - 72)^2 - 1/[(x - 18)(x - 72)].

Let me combine the terms:

Let me factor out -1:

- [1/(x - 18)^2 + 1/(x - 72)^2 + 1/((x - 18)(x - 72))].

Let me combine the first and third term:

1/(x - 18)^2 + 1/((x - 18)(x - 72)) = [1/(x - 18)] [1/(x - 18) + 1/(x - 72)].

Similarly, this is [1/(x - 18)] * [2x - 90]/[(x - 18)(x - 72)] = [2x - 90]/[(x - 18)^2(x - 72)].

Then, adding the second term -1/(x - 72)^2, the entire expression becomes:

- [ [2x - 90]/[(x - 18)^2(x - 72)] - 1/(x - 72)^2 ].

This is getting too complex, and I'm not making progress. Given the time I've spent and the lack of progress, I think I need to switch strategies.

Let me recall that in the original problem, there are three such k's, and the answer is the sum of these k's. In many such problems, the sum of the k's can be found using Vieta's formula, if the equation for k is a cubic, and the sum is the negative coefficient of x² divided by the coefficient of x³, or something like that.

Let me try to derive an equation for k.

From the earlier substitution, when I assumed x is a midpoint, I found k in terms of x. For example, for x = 45, k = 2025/98. Similarly, for x = 58, k = 841/18, and for x = 85, k = 7225/18. If these are the three k's, their sum is 2025/98 + 841/18 + 7225/18. Let me calculate this fraction:

Convert all to denominator 882 (which is 98*9):

2025/98 = (2025 * 9)/882 = 18225/882

841/18 = (841 * 49)/882 = 41209/882

7225/18 = (7225 * 49)/882 = 353,  let me compute 7225 * 49: 7225 * 50 = 361,250, minus 7225 = 361,250 - 7,225 = 354,025. So 354,025/882.

Total sum: 18225 + 41209 + 354,025 = 18225 + 41209 = 59434 + 354,025 = 413,459/882.

This is the same as before. Now, note that 413,459 and 882: let me check if 413,459 is divisible by 7: 7*59,000 = 413,000, and 413,459 - 413,000 = 459. 459 divided by 7 is 65.571... No. Divisible by 3? 4 + 1 + 3 + 4 + 5 + 9 = 26, not divisible by 3. Divisible by 2? No. So it's irreducible.

But if the problem is from a competition, it's unlikely to have such a complex fraction as an answer. This suggests that my initial assumption about the k's is incorrect.

Given that, I must have made a wrong assumption. Let me try to think differently.

Let me recall that the original function f(x) has a minimum at two points. This can happen if the function is tangent to a horizontal line at two points, which would mean that the equation f(x) = m has two solutions, and f'(x) = 0 at those points. This is similar to having a double root in the equation f(x) = m and f'(x) = 0.

But this is too abstract. Alternatively, let me consider that the function f(x) is a rational function, and its critical points are determined by the equation we derived. For the function to have exactly two minima, the equation f'(x) = 0 must have exactly two positive real solutions, both of which are minima. This would happen if the quartic equation has two positive real roots, and the other two roots are either complex or negative.

To find the values of k for which this happens, we can analyze the number of positive real roots of the quartic equation. This is a classic problem in algebra, and it can be approached using Descartes' Rule of Signs or other methods.

Let me recall that Descartes' Rule of Signs states that the number of positive real roots of a polynomial is either equal to the number of sign changes in the sequence of coefficients or less by an even number.

However, since our equation is not a polynomial but a rational equation, it's complex to apply this directly. But if we recall that we had earlier derived the equation P(x) = 0, which is a quartic polynomial, then we can apply Descartes' Rule to P(x).

But to apply Descartes' Rule, I need the coefficients of P(x). However, deriving P(x) is complex, but let me try.

Earlier, we had P(x) = x*N'(x) - N(x) = 0, where N(x) = (x - 18)(x - 72)(x - 98)(x - k). Let me expand N(x) and N'(x).

But this is very time-consuming, but let me try to find the general form.

Let me denote N(x) = (x - a)(x - b)(x - c)(x - d), where a=18, b=72, c=98, d=k.

Then, N(x) = x^4 - (a + b + c + d)x^3 + (ab + ac + ad + bc + bd + cd)x^2 - (abc + abd + acd + bcd)x + abcd.

Then, N'(x) = 4x^3 - 3(a + b + c + d)x^2 + 2(ab + ac + ad + bc + bd + cd)x - (abc + abd + acd + bcd).

Then, x*N'(x) - N(x) = x*4x^3 - x*3(a + b + c + d)x^2 + x*2(ab + ... )x - x*(abc + ... ) - [x^4 - ... ].

Let me compute term by term:

x*N'(x) = 4x^4 - 3(a + b + c + d)x^3 + 2(ab + ac + ad + bc + bd + cd)x^2 - (abc + abd + acd + bcd)x.

Subtract N(x) = x^4 - (a + b + c + d)x^3 + (ab + ac + ad + bc + bd + cd)x^2 - (abc + abd + acd + bcd)x + abcd.

Thus, P(x) = x*N'(x) - N(x) = [4x^4 - 3(a + b + c + d)x^3 + 2(ab + ... )x^2 - (abc + ... )x] - [x^4 - (a + b + c + d)x^3 + (ab + ... )x^2 - (abc + ... )x + abcd].

Simplify term by term:

4x^4 - x^4 = 3x^4.

-3(a + b + c + d)x^3 + (a + b + c + d)x^3 = -2(a + b + c + d)x^3.

2(ab + ... )x^2 - (ab + ... )x^2 = (ab + ... )x^2.

- (abc + ... )x + (abc + ... )x = 0.

- abcd.

Thus, P(x) = 3x^4 - 2(a + b + c + d)x^3 + (ab + ac + ad + bc + bd + cd)x^2 - abcd.

Therefore, P(x) = 3x^4 - 2(a + b + c + d)x^3 + (ab + ac + ad + bc + bd + cd)x^2 - abcd.

This is a quartic polynomial!

Now, this is a significant simplification! Therefore, the equation P(x) = 0 is:

3x^4 - 2(a + b + c + d)x^3 + (ab + ac + ad + bc + bd + cd)x^2 - abcd = 0.

Where a=18, b=72, c=98, d=k.

Now, this is the polynomial we need to analyze for the number of positive real roots. The problem states that there are exactly three values of k such that this polynomial has exactly two positive real roots. Our task is to find these k's and sum them.

Now, this is a crucial insight! Let me write this out:

P(x) = 3x^4 - 2(18 + 72 + 98 + k)x^3 + (18*72 + 18*98 + 18*k + 72*98 + 72*k + 98*k)x^2 - (18*72*98*k) = 0.

Let me compute the coefficients step by step.

First, compute the coefficient of x^3: -2(18 + 72 + 98 + k) = -2(188 + k).

Next, the coefficient of x^2: let's compute the sum S = 18*72 + 18*98 + 18*k + 72*98 + 72*k + 98*k.

Compute each term:

18*72 = 1296

18*98 = 1764

18*k = 18k

72*98 = 7056

72*k = 72k

98*k = 98k

So S = 1296 + 1764 + 18k + 7056 + 72k + 98k.

Combine like terms:

Constants: 1296 + 1764 + 7056 = 1296 + 1764 = 3060 + 7056 = 10116.

k terms: 18k + 72k + 98k = (18 + 72 + 98)k = 188k.

So S = 10116 + 188k.

Finally, the constant term is -18*72*98*k.

Compute 18*72 = 1296, then 1296*98. Let me calculate 1296*100 = 129600, minus 1296*2 = 2592, so 129600 - 2592 = 127,008. Thus, the constant term is -127,008k.

Therefore, the polynomial P(x) is:

3x^4 - 2(188 + k)x^3 + (10116 + 188k)x^2 - 127008k = 0.

Now, we need to find the values of k for which this polynomial has exactly two positive real roots.

To analyze the number of positive real roots, we can use Descartes' Rule of Signs, which tells us the number of positive real roots is equal to the number of sign changes in the sequence of coefficients, or less by an even number.

Let me write the coefficients of P(x):

3x^4 - 2(188 + k)x^3 + (10116 + 188k)x^2 - 127008k.

Let me denote the coefficients as:

a4 = 3

a3 = -2(188 + k)

a2 = 10116 + 188k

a1 = 0 (since there is no x term)

a0 = -127008k.

Wait, but the polynomial is 3x^4 + a3x^3 + a2x^2 + 0x + a0 = 0.

So the coefficients are: 3, a3, a2, 0, a0.

Now, applying Descartes' Rule of Signs for positive real roots, we need to count the number of sign changes in the sequence of coefficients: 3, a3, a2, 0, a0.

But note that the coefficient of x is 0, which we can ignore for the purpose of counting sign changes, as it doesn't contribute to a sign change.

Let me analyze the sign of each coefficient:

1. a4 = 3 > 0

2. a3 = -2(188 + k). Since k is positive (given), 188 + k > 0, so a3 < 0.

3. a2 = 10116 + 188k. Since k > 0, this is positive.

4. a0 = -127008k < 0.

So the sequence of signs is: +, -, +, 0, -.

Now, the number of sign changes:

From + to -: 1

From - to +: 2

From + to 0: no change (0 is neither positive nor negative)

From 0 to -: 3

So there are 3 sign changes. Therefore, by Descartes' Rule of Signs, the number of positive real roots is either 3 or 1.

But the problem states that there are exactly two positive real roots. This contradicts Descartes' Rule, which says it can only have 3 or 1 positive real roots. This suggests that my earlier derivation is incorrect, or there's a miscalculation.

Wait, let me check the polynomial derivation again. Earlier, I derived P(x) = 3x^4 - 2(a + b + c + d)x^3 + (ab + ac + ad + bc + bd + cd)x^2 - abcd. Is this correct?

Let me verify with a simple example. Let me take a simple case where N(x) = (x - a)(x - b)(x - c)(x - d), then P(x) = x*N'(x) - N(x). Let me compute this for a simple N(x).

Let N(x) = (x - a)(x - b)(x - c)(x - d). Then N'(x) = (x - b)(x - c)(x - d) + (x - a)(x - c)(x - d) + (x - a)(x - b)(x - d) + (x - a)(x - b)(x - c).

Then x*N'(x) - N(x) = x[(x - b)(x - c)(x - d) + ...] - (x - a)(x - b)(x - c)(x - d).

Let me factor out (x - a)(x - b)(x - c)(x - d). Wait, no. Let me take a specific example with a=0, b=0, c=0, d=0. No, that's trivial. Let me take a=1, b=2, c=3, d=4. Then N(x) = (x - 1)(x - 2)(x - 3)(x - 4). Let me compute P(x) = x*N'(x) - N(x).

But this would be time-consuming, but let me try for x=1, but this is not helpful.

Alternatively, let me trust the derivation, since it followed from expanding x*N'(x) - N(x) and simplifying, and the coefficients were derived step by step. If the sign changes are 3, then the number of positive roots is 3 or 1, but the problem says there are exactly two positive roots, which is not possible. This suggests a contradiction, which means my earlier derivation must be incorrect.

Wait, but the problem says that there are exactly three values of k such that the function has its minimum at exactly two positive real numbers x. This implies that for these three k's, the equation P(x) = 0 has exactly two positive real roots, which would violate Descartes' Rule of Signs, which says it should have 3 or 1. This suggests that there is an error in the derivation of P(x).

Let me check the derivation again. P(x) = x*N'(x) - N(x). Let me compute this for a simple N(x). Let N(x) = (x - a)(x - b)(x - c)(x - d). Let me expand x*N'(x) - N(x).

Let me take N(x) = (x - a)(x - b)(x - c)(x - d). Then N'(x) is the derivative, and x*N'(x) - N(x) is the derivative of x*N(x) - N(x)? No, it's x*N'(x) - N(x). Let me compute this:

Let me recall that d/dx [N(x)/x] = [N'(x)x - N(x)] / x² = f'(x). So, P(x) = x*N'(x) - N(x) is the numerator of f'(x). But this is a quartic polynomial, as we derived.

But if for N(x) = (x - a)(x - b)(x - c)(x - d), then P(x) = 3x^4 - 2(a + b + c + d)x^3 + ... as derived. But let me check with a simple case where a = b = c = d = 0. Then N(x) = x^4, N'(x) = 4x^3, P(x) = x*4x^3 - x^4 = 4x^4 - x^4 = 3x^4, which matches the formula: 3x^4 - 2(0 + 0 + 0 + 0)x^3 + (0 + 0 + 0 + 0 + 0 + 0)x^2 - 0 = 3x^4, which is correct. So the derivation is correct.

Therefore, the polynomial P(x) is correct, and the sign changes are 3, implying 3 or 1 positive real roots. But the problem states that for three values of k, there are exactly two positive real roots, which is impossible under Descartes' Rule. This suggests a contradiction, which must mean that my application of Descartes' Rule is incorrect.

Wait, the polynomial P(x) is 3x^4 - 2(188 + k)x^3 + (10116 + 188k)x^2 - 127008k. Let me check the signs again, but with specific values of k.

Let me choose a value of k and see.

Let me choose k = 18. Then the polynomial becomes:

3x^4 - 2(188 + 18)x^3 + (10116 + 188*18)x^2 - 127008*18.

Compute each term:

188 + 18 = 206, so -2*206 = -412.

188*18 = 3384, so 10116 + 3384 = 13500.

-127008*18 = -2,286,144.

So P(x) = 3x^4 - 412x^3 + 13500x^2 - 2,286,144.

Now, applying Descartes' Rule of Signs: coefficients are 3, -412, 13500, 0, -2,286,144. Sign changes: + to - (1), - to + (2), + to - (3). So 3 sign changes, implying 3 or 1 positive roots. Let me check if this polynomial has 3 or 1 positive roots. It's difficult to say without graphing, but let's assume it has 3 positive roots.

Now, if I choose k such that the constant term is zero, i.e., -127008k = 0, which implies k = 0, but k must be positive. So no.

Alternatively, if k is such that the coefficient of x^2 is zero: 10116 + 188k = 0, which would require k negative, which is not allowed.

Therefore, the number of positive roots is always 3 or 1, depending on k. But the problem states that for three values of k, there are exactly two positive real roots. This is a contradiction, which suggests that there is a misunderstanding in the problem.

Wait, but the original function f(x) is defined over positive real numbers, but the polynomial P(x) = x*N'(x) - N(x) is derived from f'(x) = 0, which includes x=0 as a potential root, but x=0 is not in the domain of f(x). Therefore, the roots of P(x) = 0 include x=0? Let me check.

If x=0, P(0) = 0 - 0 + 0 - 127008k = -127008k ≠ 0, since k > 0. So x=0 is not a root. Therefore, all roots of P(x) = 0 are in the domain of f(x) (positive reals). But according to Descartes' Rule, the number of positive roots is 3 or 1. But the problem says there are exactly two. This suggests that there is an error in the problem statement or my understanding of it.

Wait, but the original problem says "achieves its minimum value at exactly two positive real numbers x". This could mean that the function has two minima, but the equation f'(x) = 0 has three positive roots, one of which is a maximum. Therefore, the number of minima is two, and the number of maxima is one. In this case, the total number of critical points is three, but the number of minima is two. However, how to determine when this occurs?

This would require analyzing the second derivative or using the first derivative test. However, this is complex without knowing the exact roots.

Alternatively, if the polynomial P(x) has 3 positive roots, but one of them is a point of inflection or a maximum, then the function f(x) would have two minima and one maximum. In this case, the problem's condition would be satisfied. Therefore, the problem is asking for the values of k for which P(x) = 0 has three positive roots, but two of them are minima and one is a maximum. However, the number of positive roots is still three, not two. The problem says "achieves its minimum value at exactly two positive real numbers x", which could mean that there are two points where the function is minimized, even if there are three critical points. But how to determine when this happens?

This suggests that the problem is not about the number of critical points, but about the number of minima. This is a crucial distinction. Therefore, even if P(x) = 0 has three positive roots, if two of them are minima and one is a maximum, then the function achieves its minimum at two points. Therefore, the problem is not about the number of roots of P(x) = 0, but about the number of minima, which requires analyzing the nature of the critical points.

This complicates things, but perhaps we can use the fact that for the function to have two minima, the derivative must change from negative to positive at two points and from positive to negative at one point. This would mean that P(x) = 0 has three positive roots, with two of them being minima and one being a maximum. However, how to find the values of k for which this occurs?

This is very complex, but perhaps we can use the fact that for the function to have exactly two minima, the polynomial P(x) must have three positive real roots, and the second derivative at these roots must be positive (for minima) and negative (for maxima). This would require solving for k such that P(x) = 0 and P''(x) > 0 or < 0 at the roots, but this is too complex.

Given the time I've spent and the complexity of the problem, I think I need to refer back to the initial assumption where I found k values by assuming x is the midpoint between two roots, and despite the complex sum, it might be the correct answer, and perhaps the sum simplifies.

Let me compute the sum again: 2025/98 + 841/18 + 7225/18.

Let me compute each fraction as decimals:

2025/98 ≈ 20.6632653061

841/18 ≈ 46.7222222222

7225/18 ≈ 401.3888888889

Sum ≈ 20.6632653061 + 46.7222222222 + 401.3888888889 ≈ 468.7743764172

Now, let me check if this is a rational number. The exact value is 413,459/882. L

Download .txt

gitextract_wk0u_t6d/

├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── book_issue.yml
│       └── config.yml
├── .gitignore
├── Extra-Chapter/
│   ├── CDDRS/
│   │   ├── CDDRS.ipynb
│   │   └── readme.md
│   ├── Readme.md
│   ├── generation-method/
│   │   ├── llm_generation.py
│   │   ├── model_down.py
│   │   └── readme.md
│   ├── s1-vllm-thinking-budget/
│   │   ├── output/
│   │   │   ├── output_1754208752.txt
│   │   │   └── output_1754209653.txt
│   │   ├── readme.md
│   │   └── s1.py
│   ├── text-data-processing/
│   │   └── readme.md
│   ├── transformer-architecture/
│   │   └── readme.md
│   ├── vlm-concatenation-finetune/
│   │   └── README.md
│   └── why-fine-tune-small-large-language-models/
│       ├── readme.md
│       └── why-fine-tune-small-large-language-models.ipynb
├── LICENSE.txt
├── README.md
├── README_en.md
└── docs/
    ├── .nojekyll
    ├── README.md
    ├── _sidebar.md
    ├── chapter1/
    │   └── 第一章 NLP基础概念.md
    ├── chapter2/
    │   ├── code/
    │   │   ├── requirements.txt
    │   │   └── transformer.py
    │   └── 第二章 Transformer架构.md
    ├── chapter3/
    │   └── 第三章 预训练语言模型.md
    ├── chapter4/
    │   └── 第四章 大语言模型.md
    ├── chapter5/
    │   ├── code/
    │   │   ├── dataset.py
    │   │   ├── ddp_pretrain.py
    │   │   ├── ddp_sft_full.py
    │   │   ├── deal_dataset.py
    │   │   ├── download_dataset.sh
    │   │   ├── export_model.py
    │   │   ├── k_model.py
    │   │   ├── model_sample.py
    │   │   ├── requirements.txt
    │   │   ├── tokenizer_k/
    │   │   │   ├── special_tokens_map.json
    │   │   │   ├── tokenizer.json
    │   │   │   └── tokenizer_config.json
    │   │   ├── train_tokenizer.py
    │   │   └── windows_download_dataset.sh
    │   └── 第五章 动手搭建大模型.md
    ├── chapter6/
    │   ├── 6.4[WIP] 偏好对齐.md
    │   ├── code/
    │   │   ├── download_dataset.py
    │   │   ├── download_model.py
    │   │   ├── ds_config_zero2.json
    │   │   ├── finetune.py
    │   │   ├── finetune.sh
    │   │   ├── pretrain.ipynb
    │   │   ├── pretrain.py
    │   │   ├── pretrain.sh
    │   │   ├── process_dataset.ipynb
    │   │   ├── requirements.txt
    │   │   └── whole.ipynb
    │   ├── readme.md
    │   └── 第六章 大模型训练流程实践.md
    ├── chapter7/
    │   ├── Agent/
    │   │   ├── demo.py
    │   │   ├── requirements.txt
    │   │   ├── src/
    │   │   │   ├── __init__.py
    │   │   │   ├── core.py
    │   │   │   ├── tools.py
    │   │   │   └── utils.py
    │   │   └── web_demo.py
    │   ├── RAG/
    │   │   ├── .env_example
    │   │   ├── Embeddings.py
    │   │   ├── LLM.py
    │   │   ├── VectorBase.py
    │   │   ├── demo.py
    │   │   ├── requirements.txt
    │   │   └── utils.py
    │   └── 第七章 大模型应用.md
    ├── index.html
    └── 前言.md

Download .txt

SYMBOL INDEX (157 symbols across 21 files)

FILE: Extra-Chapter/generation-method/llm_generation.py
  function test_decoding_strategies (line 4) | def test_decoding_strategies():
  function test_original_generation (line 117) | def test_original_generation():

FILE: Extra-Chapter/s1-vllm-thinking-budget/s1.py
  function build_input (line 5) | def build_input(prompt, tokenizer):
  function count_thinking_token (line 18) | def count_thinking_token(outputs, tokenizer):
  function count_token (line 24) | def count_token(string, tokenizer):
  function run_thinking_budget_sample (line 28) | def run_thinking_budget_sample(llm_model, tokenizer, user_input, thinkin...
  function run_sample (line 87) | def run_sample(llm_model, tokenizer, user_input):

FILE: docs/chapter2/code/transformer.py
  class ModelArgs (line 9) | class ModelArgs:
  class MultiHeadAttention (line 21) | class MultiHeadAttention(nn.Module):
    method __init__ (line 23) | def __init__(self, args: ModelArgs, is_causal=False):
    method forward (line 55) | def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor):
  class LayerNorm (line 100) | class LayerNorm(nn.Module):
    method __init__ (line 102) | def __init__(self, features, eps=1e-6):
    method forward (line 109) | def forward(self, x):
  class MLP (line 116) | class MLP(nn.Module):
    method __init__ (line 118) | def __init__(self, dim: int, hidden_dim: int, dropout: float):
    method forward (line 127) | def forward(self, x):
  class EncoderLayer (line 134) | class EncoderLayer(nn.Module):
    method __init__ (line 135) | def __init__(self, args):
    method forward (line 144) | def forward(self, x):
  class Encoder (line 153) | class Encoder(nn.Module):
    method __init__ (line 155) | def __init__(self, args):
    method forward (line 161) | def forward(self, x):
  class DecoderLayer (line 167) | class DecoderLayer(nn.Module):
    method __init__ (line 169) | def __init__(self, args):
    method forward (line 182) | def forward(self, x, enc_out):
  class Decoder (line 194) | class Decoder(nn.Module):
    method __init__ (line 196) | def __init__(self, args):
    method forward (line 202) | def forward(self, x, enc_out):
  class PositionalEncoding (line 208) | class PositionalEncoding(nn.Module):
    method __init__ (line 211) | def __init__(self, args):
    method forward (line 229) | def forward(self, x):
  class Transformer (line 235) | class Transformer(nn.Module):
    method __init__ (line 238) | def __init__(self, args):
    method get_num_params (line 262) | def get_num_params(self, non_embedding=False):
    method _init_weights (line 272) | def _init_weights(self, module):
    method forward (line 283) | def forward(self, idx, targets=None):
  function main (line 322) | def main():

FILE: docs/chapter5/code/dataset.py
  class PretrainDataset (line 10) | class PretrainDataset(Dataset):
    method __init__ (line 11) | def __init__(self, data_path, tokenizer, max_length=512):
    method __len__ (line 25) | def __len__(self):
    method __getitem__ (line 28) | def __getitem__(self, index: int):
  class SFTDataset (line 48) | class SFTDataset(Dataset):
    method __init__ (line 49) | def __init__(self, data_path, tokenizer, max_length=512):
    method __len__ (line 62) | def __len__(self):
    method generate_loss_mask (line 65) | def generate_loss_mask(self, input_ids):
    method __getitem__ (line 101) | def __getitem__(self, index: int):

FILE: docs/chapter5/code/ddp_pretrain.py
  function Logger (line 25) | def Logger(content):
  function get_lr (line 34) | def get_lr(it, all):
  function train_epoch (line 68) | def train_epoch(epoch):
  function init_model (line 171) | def init_model():

FILE: docs/chapter5/code/ddp_sft_full.py
  function Logger (line 24) | def Logger(content):
  function get_lr (line 28) | def get_lr(it, all):
  function train_epoch (line 51) | def train_epoch(epoch):
  function init_model (line 122) | def init_model():

FILE: docs/chapter5/code/deal_dataset.py
  function split_text (line 14) | def split_text(text, chunk_size=512):
  function convert_message (line 29) | def convert_message(data):

FILE: docs/chapter5/code/export_model.py
  function count_parameters (line 9) | def count_parameters(model):
  function export_model (line 13) | def export_model(tokenizer_path, model_config, model_ckpt_path, save_dir...

FILE: docs/chapter5/code/k_model.py
  class ModelConfig (line 14) | class ModelConfig(PretrainedConfig):
    method __init__ (line 16) | def __init__(
  class RMSNorm (line 46) | class RMSNorm(nn.Module):
    method __init__ (line 47) | def __init__(self, dim: int, eps: float):
    method _norm (line 54) | def _norm(self, x):
    method forward (line 61) | def forward(self, x):
  function precompute_freqs_cis (line 70) | def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
  function reshape_for_broadcast (line 85) | def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
  function apply_rotary_emb (line 97) | def apply_rotary_emb(
  function repeat_kv (line 124) | def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
  class Attention (line 139) | class Attention(nn.Module):
    method __init__ (line 140) | def __init__(self, args: ModelConfig):
    method forward (line 182) | def forward(self, x: torch.Tensor, freqs_cos: torch.Tensor, freqs_sin:...
  class MLP (line 250) | class MLP(nn.Module):
    method __init__ (line 251) | def __init__(self, dim: int, hidden_dim: int, multiple_of: int, dropou...
    method forward (line 268) | def forward(self, x):
  class DecoderLayer (line 276) | class DecoderLayer(nn.Module):
    method __init__ (line 277) | def __init__(self, layer_id: int, args: ModelConfig):
    method forward (line 301) | def forward(self, x, freqs_cos, freqs_sin, attention_mask: Optional[to...
  class Transformer (line 309) | class Transformer(PreTrainedModel):
    method __init__ (line 313) | def __init__(self, args: ModelConfig = None):
    method _init_weights (line 355) | def _init_weights(self, module):
    method _prepare_attention_mask (line 364) | def _prepare_attention_mask(self, attention_mask: Optional[torch.Tenso...
    method _left_pad_by_attention_mask (line 378) | def _left_pad_by_attention_mask(
    method forward (line 403) | def forward(self, tokens: torch.Tensor, targets: Optional[torch.Tensor...
    method generate (line 463) | def generate(
    method _greedy_decode (line 527) | def _greedy_decode(self, logits: torch.Tensor) -> torch.Tensor:
    method _random_sample (line 540) | def _random_sample(self, logits: torch.Tensor, temperature: float = 1....
    method _beam_search (line 566) | def _beam_search(self, idx: torch.Tensor, max_new_tokens: int, num_bea...
    method generate_super (line 677) | def generate_super(self,

FILE: docs/chapter5/code/model_sample.py
  class TextGenerator (line 9) | class TextGenerator:
    method __init__ (line 10) | def __init__(self,
    method chat_template (line 62) | def chat_template(self, prompt):
    method sft_sample (line 69) | def sft_sample(self,
    method pretrain_sample (line 99) | def pretrain_sample(self,

FILE: docs/chapter5/code/train_tokenizer.py
  function read_texts_from_jsonl (line 17) | def read_texts_from_jsonl(file_path: str) -> Generator[str, None, None]:
  function create_tokenizer_config (line 33) | def create_tokenizer_config(save_dir: str) -> None:
  function train_tokenizer (line 77) | def train_tokenizer(data_path: str, save_dir: str, vocab_size: int = 819...
  function eval_tokenizer (line 128) | def eval_tokenizer(tokenizer_path: str) -> None:
  function main (line 174) | def main():

FILE: docs/chapter6/code/finetune.py
  class ModelArguments (line 45) | class ModelArguments:
  class DataTrainingArguments (line 70) | class DataTrainingArguments:
  function preprocess (line 87) | def preprocess(sources, tokenizer, max_len, system_message: str = "You a...
  class SupervisedDataset (line 157) | class SupervisedDataset(Dataset):
    method __init__ (line 159) | def __init__(self, raw_data, tokenizer, max_len: int):
    method __len__ (line 169) | def __len__(self):
    method __getitem__ (line 172) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function main (line 180) | def main():

FILE: docs/chapter6/code/pretrain.py
  class ModelArguments (line 41) | class ModelArguments:
  class DataTrainingArguments (line 72) | class DataTrainingArguments:
  function main (line 92) | def main():

FILE: docs/chapter7/Agent/src/core.py
  class Agent (line 14) | class Agent:
    method __init__ (line 15) | def __init__(self, client: OpenAI, model: str = "Qwen/Qwen2.5-32B-Inst...
    method get_tool_schema (line 24) | def get_tool_schema(self) -> List[Dict[str, Any]]:
    method handle_tool_call (line 28) | def handle_tool_call(self, tool_call):
    method get_completion (line 42) | def get_completion(self, prompt) -> str:

FILE: docs/chapter7/Agent/src/tools.py
  function get_current_datetime (line 6) | def get_current_datetime() -> str:
  function add (line 15) | def add(a: float, b: float):
  function mul (line 24) | def mul(a: float, b: float):
  function compare (line 33) | def compare(a: float, b: float):
  function count_letter_in_string (line 47) | def count_letter_in_string(a: str, b: str):
  function search_wikipedia (line 60) | def search_wikipedia(query: str) -> str:
  function get_current_temperature (line 84) | def get_current_temperature(latitude: float, longitude: float) -> str:

FILE: docs/chapter7/Agent/src/utils.py
  function function_to_json (line 5) | def function_to_json(func) -> dict:

FILE: docs/chapter7/Agent/web_demo.py
  function load_agent (line 22) | def load_agent():

FILE: docs/chapter7/RAG/Embeddings.py
  class BaseEmbeddings (line 21) | class BaseEmbeddings:
    method __init__ (line 25) | def __init__(self, path: str, is_api: bool) -> None:
    method get_embedding (line 35) | def get_embedding(self, text: str, model: str) -> List[float]:
    method cosine_similarity (line 49) | def cosine_similarity(cls, vector1: List[float], vector2: List[float])...
  class OpenAIEmbedding (line 82) | class OpenAIEmbedding(BaseEmbeddings):
    method __init__ (line 86) | def __init__(self, path: str = '', is_api: bool = True) -> None:
    method get_embedding (line 95) | def get_embedding(self, text: str, model: str = "BAAI/bge-m3") -> List...

FILE: docs/chapter7/RAG/LLM.py
  class BaseModel (line 29) | class BaseModel:
    method __init__ (line 30) | def __init__(self, model) -> None:
    method chat (line 33) | def chat(self, prompt: str, history: List[dict], content: str) -> str:
    method load_model (line 36) | def load_model(self):
  class OpenAIChat (line 39) | class OpenAIChat(BaseModel):
    method __init__ (line 40) | def __init__(self, model: str = "Qwen/Qwen2.5-32B-Instruct") -> None:
    method chat (line 43) | def chat(self, prompt: str, history: List[dict], content: str) -> str:

FILE: docs/chapter7/RAG/VectorBase.py
  class VectorStore (line 19) | class VectorStore:
    method __init__ (line 20) | def __init__(self, document: List[str] = ['']) -> None:
    method get_vector (line 23) | def get_vector(self, EmbeddingModel: BaseEmbeddings) -> List[List[floa...
    method persist (line 30) | def persist(self, path: str = 'storage'):
    method load_vector (line 39) | def load_vector(self, path: str = 'storage'):
    method get_similarity (line 45) | def get_similarity(self, vector1: List[float], vector2: List[float]) -...
    method query (line 48) | def query(self, query: str, EmbeddingModel: BaseEmbeddings, k: int = 1...

FILE: docs/chapter7/RAG/utils.py
  class ReadFiles (line 25) | class ReadFiles:
    method __init__ (line 30) | def __init__(self, path: str) -> None:
    method get_files (line 34) | def get_files(self):
    method get_content (line 50) | def get_content(self, max_token_len: int = 600, cover_content: int = 1...
    method get_chunk (line 61) | def get_chunk(cls, text: str, max_token_len: int = 600, cover_content:...
    method read_file_content (line 136) | def read_file_content(cls, file_path: str):
    method read_pdf (line 148) | def read_pdf(cls, file_path: str):
    method read_markdown (line 158) | def read_markdown(cls, file_path: str):
    method read_text (line 171) | def read_text(cls, file_path: str):
  class Documents (line 177) | class Documents:
    method __init__ (line 181) | def __init__(self, path: str = '') -> None:
    method get_content (line 184) | def get_content(self):

Download .json

Condensed preview — 76 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,243K chars).

[
  {
    "path": ".github/ISSUE_TEMPLATE/book_issue.yml",
    "chars": 2250,
    "preview": "name: Happy-LLM 问题反馈 / Textbook Issue Report\ndescription: 报告 Happy-LLM 中的代码/内容问题 / Report code/content issues in textboo"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "chars": 343,
    "preview": "blank_issues_enabled: false\ncontact_links:\n  - name: 📚 Happy-LLM 教程文档\n    url: https://github.com/datawhalechina/happy-l"
  },
  {
    "path": ".gitignore",
    "chars": 3121,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n.idea/\n# C extensions\n*.so\n\n# Distribution / p"
  },
  {
    "path": "Extra-Chapter/CDDRS/CDDRS.ipynb",
    "chars": 71544,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4d1f9d85\",\n   \"metadata\": {},\n   \"source\": [\n    \"# 建筑文档智能RAG审查系"
  },
  {
    "path": "Extra-Chapter/CDDRS/readme.md",
    "chars": 29537,
    "preview": "# 建筑文档智能RAG审查系统\n\n一个从零开始实现的建筑文档智能审查系统，旨在帮助开发者理解知识引导检索在专业领域文档审查中的核心原理和实现细节。\n\n## 项目动机\n\n建筑施工交底文档的合规性审查是保障施工项目安全性、经济性的关键环节。在施"
  },
  {
    "path": "Extra-Chapter/Readme.md",
    "chars": 2440,
    "preview": "<div align=\"center\">\n  <h2>🚀 Happy-LLM 扩展内容</h2>\n  <p><em>社区驱动的大语言模型学习资源</em></p>\n</div>\n\n---\n\n## 📖 为什么会有 Extra Chapter？"
  },
  {
    "path": "Extra-Chapter/generation-method/llm_generation.py",
    "chars": 4823,
    "preview": "import torch\nfrom modelscope import AutoModelForCausalLM, AutoTokenizer\n\ndef test_decoding_strategies():\n    \"\"\"\n    测试三"
  },
  {
    "path": "Extra-Chapter/generation-method/model_down.py",
    "chars": 148,
    "preview": "from modelscope import snapshot_download\n\nmodel_dir = snapshot_download('kmno4zx/happy-llm-215M-sft', cache_dir='your/ca"
  },
  {
    "path": "Extra-Chapter/generation-method/readme.md",
    "chars": 10567,
    "preview": "# 大模型生成Token的方式\n\n> 代码已更新到 Happy-LLM 仓库第五章的代码中。\n\n## 贪婪解码（Greedy Decoding）\n\n### 原理说明\n贪婪解码是最简单直接的文本生成策略。在每一步生成时，它总是选择概率最大的那"
  },
  {
    "path": "Extra-Chapter/s1-vllm-thinking-budget/output/output_1754208752.txt",
    "chars": 93050,
    "preview": "<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n<|im_start|>user\nT"
  },
  {
    "path": "Extra-Chapter/s1-vllm-thinking-budget/output/output_1754209653.txt",
    "chars": 96598,
    "preview": "<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n<|im_start|>user\nT"
  },
  {
    "path": "Extra-Chapter/s1-vllm-thinking-budget/readme.md",
    "chars": 5537,
    "preview": "# S1: Thinking Budget with vLLM\n\n首先，我们来了解一下AI教母李飞飞教授关于 Test-time scaling 的论文：[*《s1: Simple test-time scaling》*](http://a"
  },
  {
    "path": "Extra-Chapter/s1-vllm-thinking-budget/s1.py",
    "chars": 4778,
    "preview": "from vllm import LLM, SamplingParams\nfrom transformers import AutoTokenizer\nimport time\n\ndef build_input(prompt, tokeniz"
  },
  {
    "path": "Extra-Chapter/text-data-processing/readme.md",
    "chars": 52168,
    "preview": "# Extra-Chapter：text-data-processing\n\n本补充章节旨在对大模型数据处理做一个简单的梳理与具体的代码实现，帮助大家对于大模型的数据处理有一个更加清晰的认识。\n\n## 1.理解词嵌入\n\n在深度神经网络模型中，"
  },
  {
    "path": "Extra-Chapter/transformer-architecture/readme.md",
    "chars": 11941,
    "preview": "# transformer-architecture\n\n当本节内容作为第二章 Transformer架构中2.2章节Encoder-Decoder的补充，结合Pytorch的API源码，从Transformer的整体设计上来解释Transf"
  },
  {
    "path": "Extra-Chapter/vlm-concatenation-finetune/README.md",
    "chars": 26292,
    "preview": "# Qwen3-\"VL\"——超小中文多模态模型的“拼接微调”之路1（附代码和SwanLab记录）\n\n* 作者：情感机器实验室——陈少宏\n\n* 邮箱：<shaohon_chen@115lab.club>\n\n* GitHub：[https://"
  },
  {
    "path": "Extra-Chapter/why-fine-tune-small-large-language-models/readme.md",
    "chars": 9967,
    "preview": "# 大模型都这么厉害了，微调0.6B的小模型有什么意义？\n\n大家在日常使用Deepseek-R1或者是阿里新发布的Qwen3模型，他们的模型都是能力很强，所提供的API服也都可以满足大家的日常或者是公司开发所需。但大家也可以想一个简单的问题"
  },
  {
    "path": "Extra-Chapter/why-fine-tune-small-large-language-models/why-fine-tune-small-large-language-models.ipynb",
    "chars": 32846,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"QspZPC3nYqVv\"\n   },\n   \"source\": [\n    \"# 大模型都"
  },
  {
    "path": "LICENSE.txt",
    "chars": 20845,
    "preview": "Attribution-NonCommercial-ShareAlike 4.0 International\n\n================================================================"
  },
  {
    "path": "README.md",
    "chars": 7295,
    "preview": "<div align='center'>\n    <img src=\"./images/head.jpg\" alt=\"alt text\" width=\"100%\">\n    <h1>Happy-LLM</h1>\n</div>\n\n<div a"
  },
  {
    "path": "README_en.md",
    "chars": 11621,
    "preview": "<div align='center'>\n    <img src=\"./images/head.jpg\" alt=\"alt text\" width=\"100%\">\n    <h1>Happy-LLM</h1>\n</div>\n\n<div a"
  },
  {
    "path": "docs/.nojekyll",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "docs/README.md",
    "chars": 5298,
    "preview": "<div align='center'>\n    <img src=\"./images/head.jpg\" alt=\"alt text\" width=\"100%\">\n    <h1>Happy-LLM</h1>\n</div>\n\n<div a"
  },
  {
    "path": "docs/_sidebar.md",
    "chars": 382,
    "preview": "- [Happy-LLM](./README.md)\n  - [前言](./前言.md)\n  - [第一章 NLP 基础概念](./chapter1/第一章%20NLP基础概念.md)\n  - [第二章 Transformer 架构](./"
  },
  {
    "path": "docs/chapter1/第一章 NLP基础概念.md",
    "chars": 11552,
    "preview": "# 第一章 NLP 基础概念\n\n自然语言处理（Natural Language Processing，NLP）作为人工智能领域的一个重要分支，旨在使计算机能够理解和处理人类语言，实现人机之间的自然交流。随着信息技术的飞速发展，文本数据已成为"
  },
  {
    "path": "docs/chapter2/code/requirements.txt",
    "chars": 590,
    "preview": "certifi==2025.4.26\ncharset-normalizer==3.4.2\nfilelock==3.18.0\nfsspec==2025.5.1\nhf-xet==1.1.3\nhuggingface-hub==0.32.4\nidn"
  },
  {
    "path": "docs/chapter2/code/transformer.py",
    "chars": 12062,
    "preview": "import torch\nimport math\nfrom torch import nn\nfrom dataclasses import dataclass\nfrom transformers import BertTokenizer\ni"
  },
  {
    "path": "docs/chapter2/第二章 Transformer架构.md",
    "chars": 32889,
    "preview": "# 第二章 Transformer 架构\n\n## 2.1 注意力机制\n\n### 2.1.1 什么是注意力机制\n\n随着 NLP 从统计机器学习向深度学习迈进，作为 NLP 核心问题的文本表示方法也逐渐从统计学习向深度学习迈进。正如我们在第一章"
  },
  {
    "path": "docs/chapter3/第三章 预训练语言模型.md",
    "chars": 33461,
    "preview": "# 第三章 预训练语言模型\n\n## 3.1 Encoder-only PLM\n\n在上一章，我们详细讲解了给 NLP 领域带来巨大变革注意力机制以及使用注意力机制搭建的模型 Transformer，NLP 模型的里程碑式转变也就自此而始。在上"
  },
  {
    "path": "docs/chapter4/第四章 大语言模型.md",
    "chars": 22201,
    "preview": "# 第四章 大语言模型\n\n## 4.1 什么是 LLM\n\n在前三章，我们从 NLP 的定义与主要任务出发，介绍了引发 NLP 领域重大变革的核心思想——注意力机制与 Transformer 架构。随着 Transformer 架构的横空出世"
  },
  {
    "path": "docs/chapter5/code/dataset.py",
    "chars": 4515,
    "preview": "import json\nimport random\nimport re\nimport pandas as pd\nimport numpy as np\nfrom torch.utils.data import Dataset, DataLoa"
  },
  {
    "path": "docs/chapter5/code/ddp_pretrain.py",
    "chars": 9940,
    "preview": "# -*- coding: utf-8 -*-\nimport os\nimport platform\nimport argparse\nimport time\nimport warnings\nimport math\nimport pandas "
  },
  {
    "path": "docs/chapter5/code/ddp_sft_full.py",
    "chars": 7764,
    "preview": "import os\nimport platform\nimport argparse\nimport time\nimport warnings\nimport math\nimport pandas as pd\nimport torch\nfrom "
  },
  {
    "path": "docs/chapter5/code/deal_dataset.py",
    "chars": 1745,
    "preview": "import os\nimport json\nfrom tqdm import tqdm\n\n# pretrain_data 为运行download_dataset.sh时，下载的pretrain_data本地路径\npretrain_data "
  },
  {
    "path": "docs/chapter5/code/download_dataset.sh",
    "chars": 576,
    "preview": "#!/bin/bash\n\n# 设置环境变量\nexport HF_ENDPOINT=https://hf-mirror.com\n\n# dataset dir 下载到本地目录\ndataset_dir=\"your local dataset di"
  },
  {
    "path": "docs/chapter5/code/export_model.py",
    "chars": 1827,
    "preview": "import torch\nimport warnings\nfrom transformers import AutoTokenizer\nfrom k_model import Transformer, ModelConfig\n\nwarnin"
  },
  {
    "path": "docs/chapter5/code/k_model.py",
    "chars": 30249,
    "preview": "import math\nimport inspect\nfrom dataclasses import dataclass\nfrom typing import Any, Optional, Tuple\nimport torch\nimport"
  },
  {
    "path": "docs/chapter5/code/model_sample.py",
    "chars": 7241,
    "preview": "import os\nimport pickle\nfrom contextlib import nullcontext\nimport torch\nfrom k_model import ModelConfig, Transformer\nfro"
  },
  {
    "path": "docs/chapter5/code/requirements.txt",
    "chars": 379,
    "preview": "torch==2.4.0 \ntorchvision==0.19.0\ndatasets==2.16.1\njsonlines==4.0.0\nmarshmallow==3.22.0\nmatplotlib==3.5.1\nngrok==1.4.0\nn"
  },
  {
    "path": "docs/chapter5/code/tokenizer_k/special_tokens_map.json",
    "chars": 195,
    "preview": "{\n    \"bos_token\": \"<|im_start|>\",\n    \"eos_token\": \"<|im_end|>\",\n    \"unk_token\": \"<unk>\",\n    \"pad_token\": \"<|im_end|>"
  },
  {
    "path": "docs/chapter5/code/tokenizer_k/tokenizer.json",
    "chars": 221907,
    "preview": "{\n  \"version\": \"1.0\",\n  \"truncation\": null,\n  \"padding\": null,\n  \"added_tokens\": [\n    {\n      \"id\": 0,\n      \"content\":"
  },
  {
    "path": "docs/chapter5/code/tokenizer_k/tokenizer_config.json",
    "chars": 791,
    "preview": "{\n    \"add_bos_token\": false,\n    \"add_eos_token\": false,\n    \"add_prefix_space\": false,\n    \"bos_token\": \"<|im_start|>\""
  },
  {
    "path": "docs/chapter5/code/train_tokenizer.py",
    "chars": 6088,
    "preview": "import random\nimport json\nimport os\nfrom transformers import AutoTokenizer, PreTrainedTokenizerFast\nfrom tokenizers impo"
  },
  {
    "path": "docs/chapter5/code/windows_download_dataset.sh",
    "chars": 1144,
    "preview": "# Windows下载方式\n# 使用PowerShell下载\n# 暂时为当前PowerShell界面设置环境，关闭Powershell环境自动消失\n$env:HF_ENDPOINT = \"https://hf-mirror.com\"\n\n# "
  },
  {
    "path": "docs/chapter5/第五章 动手搭建大模型.md",
    "chars": 75705,
    "preview": "# 第五章 动手搭建大模型\n\n\n\n## 5.1 动手实现一个 LLaMA2 大模型\n\nMeta（原Facebook）于2023年2月发布第一款基于Transformer结构的大型语言模型LLaMA，并于同年7月发布同系列模型LLaMA2。我"
  },
  {
    "path": "docs/chapter6/6.4[WIP] 偏好对齐.md",
    "chars": 4277,
    "preview": "# 6.4 通过强化学习进行偏好对齐\n\n在我们进入强化学习的细节之前，让我们先看看它的起源。强化学习（Reinforcement Learning，简称RL）其实并不是什么新鲜事物。它的理论基础可以追溯到20世纪初的行为心理学，尤其是Edw"
  },
  {
    "path": "docs/chapter6/code/download_dataset.py",
    "chars": 572,
    "preview": "import os\nimport json\nfrom tqdm import tqdm\n\n# 下载预训练数据集\n# os.system(\"modelscope download --dataset ddzhu123/seq-monkey m"
  },
  {
    "path": "docs/chapter6/code/download_model.py",
    "chars": 186,
    "preview": "import os\n\n# 设置环境变量\nos.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n\n# 下载模型\nos.system('huggingface-cli download --re"
  },
  {
    "path": "docs/chapter6/code/ds_config_zero2.json",
    "chars": 1227,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_"
  },
  {
    "path": "docs/chapter6/code/finetune.py",
    "chars": 8359,
    "preview": "'''\nSFT 脚本\n'''\n\nimport logging\nimport math\nimport os\nimport sys\nfrom dataclasses import dataclass, field\nfrom torchdata."
  },
  {
    "path": "docs/chapter6/code/finetune.sh",
    "chars": 821,
    "preview": "CUDA_VISIBLE_DEVICES=0,1\n\ndeepspeed finetune.py \\\n    --model_name_or_path autodl-tmp/qwen-1.5b \\\n    --train_files auto"
  },
  {
    "path": "docs/chapter6/code/pretrain.ipynb",
    "chars": 46263,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"bb9102c3-5b8d-4295-8f29-113b35ec5679\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "docs/chapter6/code/pretrain.py",
    "chars": 8130,
    "preview": "'''\n预训练脚本\n'''\n\nimport logging\nimport math\nimport os\nimport sys\nfrom dataclasses import dataclass, field\nfrom torchdata.d"
  },
  {
    "path": "docs/chapter6/code/pretrain.sh",
    "chars": 930,
    "preview": "CUDA_VISIBLE_DEVICES=0,1\n\ndeepspeed pretrain.py \\\n    --config_name autodl-tmp/qwen-1.5b \\\n    --tokenizer_name autodl-t"
  },
  {
    "path": "docs/chapter6/code/process_dataset.ipynb",
    "chars": 1035,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"8c3d4501-a268-418a-b5f6-59078094aab5\",\n   \""
  },
  {
    "path": "docs/chapter6/code/requirements.txt",
    "chars": 70,
    "preview": "transformers\ndatasets\ntorch\ntorchdata==0.9.0\ndeepspeed\npandas\nswanlab\n"
  },
  {
    "path": "docs/chapter6/code/whole.ipynb",
    "chars": 2058,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"整体代码拆分\"\n   ]\n  },\n  {\n   \"cell_type"
  },
  {
    "path": "docs/chapter6/readme.md",
    "chars": 363,
    "preview": "# 第六章 基于 transformers 的 LLM 训练\n\n注：本章的核心内容是，基于 transformers 框架实现 LLM 预训练和微调\n\n1. 框架简述：\n   1. transformers\n   2. deepspeed\n"
  },
  {
    "path": "docs/chapter6/第六章 大模型训练流程实践.md",
    "chars": 34754,
    "preview": "# 第六章 大模型训练流程实践\n\n第五章中，我们从零开始动手搭建了 LLaMA2 模型，并完整实现了其预训练和微调的全流程。在本章中，我们将深入探讨大模型的训练流程实践，重点介绍如何利用主流的大模型框架高效地进行模型训练和性能优化。\n\n##"
  },
  {
    "path": "docs/chapter7/Agent/demo.py",
    "chars": 771,
    "preview": "from src.core import Agent\nfrom src.tools import add, count_letter_in_string, compare, get_current_datetime, search_wiki"
  },
  {
    "path": "docs/chapter7/Agent/requirements.txt",
    "chars": 930,
    "preview": "altair==5.5.0\nannotated-types==0.7.0\nanyio==4.9.0\nattrs==25.3.0\nbeautifulsoup4==4.13.4\nblinker==1.9.0\ncachetools==6.1.0\n"
  },
  {
    "path": "docs/chapter7/Agent/src/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "docs/chapter7/Agent/src/core.py",
    "chars": 3218,
    "preview": "from openai import OpenAI\nimport json\nfrom typing import List, Dict, Any\nfrom src.utils import function_to_json\nfrom src"
  },
  {
    "path": "docs/chapter7/Agent/src/tools.py",
    "chars": 3289,
    "preview": "import datetime\nimport wikipedia\nimport requests\n\n# 获取当前日期和时间\ndef get_current_datetime() -> str:\n    \"\"\"\n    获取真实的当前日期和时"
  },
  {
    "path": "docs/chapter7/Agent/src/utils.py",
    "chars": 1834,
    "preview": "import inspect\nfrom datetime import datetime\nimport pprint\n\ndef function_to_json(func) -> dict:\n    # 定义 Python 类型到 JSON"
  },
  {
    "path": "docs/chapter7/Agent/web_demo.py",
    "chars": 1711,
    "preview": "import streamlit as st\nfrom src.core import Agent\nfrom src.tools import add, count_letter_in_string, compare, get_curren"
  },
  {
    "path": "docs/chapter7/RAG/.env_example",
    "chars": 130,
    "preview": "# 此处默认使用国内可访问的硅基流动平台 https://cloud.siliconflow.cn/\n\nOPENAI_API_KEY='your api key'\nOPENAI_BASE_URL='https://api.siliconfl"
  },
  {
    "path": "docs/chapter7/RAG/Embeddings.py",
    "chars": 2904,
    "preview": "#!/usr/bin/env python\r\n# -*- coding: utf-8 -*-\r\n'''\r\n@File    :   Embedding.py\r\n@Time    :   2025/06/20 13:50:47\r\n@Autho"
  },
  {
    "path": "docs/chapter7/RAG/LLM.py",
    "chars": 1438,
    "preview": "#!/usr/bin/env python\r\n# -*- coding: utf-8 -*-\r\n'''\r\n@File    :   LLM.py\r\n@Time    :   2025/06/20 13:50:47\r\n@Author  :  "
  },
  {
    "path": "docs/chapter7/RAG/VectorBase.py",
    "chars": 1953,
    "preview": "#!/usr/bin/env python\r\n# -*- coding: utf-8 -*-\r\n'''\r\n@File    :   VectorBase.py\r\n@Time    :   2025/06/20 10:11:13\r\n@Auth"
  },
  {
    "path": "docs/chapter7/RAG/demo.py",
    "chars": 663,
    "preview": "from VectorBase import VectorStore\nfrom utils import ReadFiles\nfrom LLM import OpenAIChat\nfrom Embeddings import OpenAIE"
  },
  {
    "path": "docs/chapter7/RAG/requirements.txt",
    "chars": 473,
    "preview": "annotated-types==0.7.0\nanyio==4.9.0\nbeautifulsoup4==4.13.4\nbs4==0.0.2\ncertifi==2025.6.15\ncharset-normalizer==3.4.2\ndistr"
  },
  {
    "path": "docs/chapter7/RAG/utils.py",
    "chars": 6241,
    "preview": "#!/usr/bin/env python\r\n# -*- coding: utf-8 -*-\r\n'''\r\n@File    :   utils.py\r\n@Time    :   2025/06/20 13:50:47\r\n@Author  :"
  },
  {
    "path": "docs/chapter7/第七章 大模型应用.md",
    "chars": 27084,
    "preview": "# 大模型应用\n\n在前面的章节中，我们系统地介绍了大模型的基础知识、训练方法和微调技术。本章将重点探讨大模型在实际应用中的关键技术和框架，涵盖大模型评测、RAG（检索增强生成）以及Agent（智能体）等核心内容，帮助读者深入理解大模型的实际"
  },
  {
    "path": "docs/index.html",
    "chars": 8320,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n\n<head>\n    <meta charset=\"UTF-8\">\n    <title>Happy-LLM</title>\n    <meta http-equiv=\"X"
  },
  {
    "path": "docs/前言.md",
    "chars": 2889,
    "preview": "# 前言\n\n2022年底，ChatGPT 的横空出世改变了人们对人工智能的认知，也给自然语言处理（Natural Language Process，NLP）领域带来了阶段性的变革，以 GPT 系列模型为代表的大语言模型（Large Lang"
  }
]

About this extraction

This page contains the full source code of the datawhalechina/happy-llm GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 76 files (1.1 MB), approximately 447.3k tokens, and a symbol index with 157 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo