Full Code of OpenBMB/MiniCPM-CookBook for AI

main f93206aab7ec cached
190 files
30.8 MB
543.0k tokens
546 symbols
1 requests
Download .txt
Showing preview only (2,606K chars total). Download the full file or copy to clipboard to get everything.
Repository: OpenBMB/MiniCPM-CookBook
Branch: main
Commit: f93206aab7ec
Files: 190
Total size: 30.8 MB

Directory structure:
gitextract_cc8zu0m1/

├── 4G_memory_rag/
│   └── langchain_demo.py
├── AIPC/
│   ├── Mac_feature_and_shortcut.json
│   └── spider.py
├── MiniCPM-o-long_video_inference/
│   ├── README.md
│   ├── infer.py
│   └── requirements.txt
├── MiniCPMV2_6_awq/
│   ├── modeling_minicpmv.py
│   └── quantize.py
├── OCR_Multimodal_Search/
│   ├── asset/
│   │   └── README.md
│   ├── finetune/
│   │   ├── __init__.py
│   │   ├── app.log
│   │   ├── dataset.py
│   │   ├── dataset_original.py
│   │   ├── ds_config_zero2.json
│   │   ├── ds_config_zero3.json
│   │   ├── finetune.py
│   │   ├── finetune_ds.sh
│   │   ├── finetune_lora.sh
│   │   ├── readme.md
│   │   └── trainer.py
│   └── infer/
│       ├── app.py
│       ├── cli_demo.py
│       ├── dataset.py
│       ├── inference.py
│       └── utils.py
├── OCR_VG/
│   ├── chat.py
│   ├── data_demo/
│   │   ├── img_gt.json
│   │   └── train_data_demo.json
│   ├── gt_test.py
│   ├── merge_box.py
│   ├── omnilmm/
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   ├── conversation.py
│   │   ├── model/
│   │   │   ├── __init__.py
│   │   │   ├── omnilmm.py
│   │   │   ├── resampler.py
│   │   │   └── utils.py
│   │   ├── train/
│   │   │   └── train_utils.py
│   │   └── utils.py
│   └── simsun.ttc
├── README.md
├── README_application.md
├── README_en.md
├── agent_auto_plan/
│   ├── README.md
│   ├── autoplan/
│   │   ├── all_param_inference.py
│   │   ├── bing_search.py
│   │   ├── fuctions.py
│   │   ├── load_model.py
│   │   ├── lora_inference_nomerge.py
│   │   ├── main.py
│   │   ├── prompt_plamte.py
│   │   └── tools_introduction.py
│   ├── finetune_language/
│   │   ├── README.md
│   │   ├── dataset.py
│   │   ├── ds_config_zero2.json
│   │   ├── ds_config_zero3.json
│   │   ├── finetune.py
│   │   ├── finetune_ds.sh
│   │   ├── finetune_lora.sh
│   │   ├── merge_lora.py
│   │   └── replace_file/
│   │       └── modeling_minicpmv.py
│   ├── qwen_vllm.py
│   ├── test_plan.json
│   └── test_react.json
├── agent_demo/
│   ├── agent_demo.py
│   ├── build_react_prompt.py
│   ├── get_react_data.py
│   └── react_qa_react.json
├── ft_language_replace_file/
│   └── finetune/
│       ├── __init__.py
│       ├── dataset.py
│       ├── ds_config_zero2.json
│       ├── ds_config_zero3.json
│       ├── finetune.py
│       ├── finetune_ds.sh
│       ├── finetune_lora.sh
│       ├── merge_lora.py
│       ├── only_language_web_demo.py
│       ├── readme.md
│       ├── replace_file/
│       │   ├── modeling_minicpmv.py
│       │   └── resampler.py
│       └── trainer.py
├── get_minicpmv2.6_embeding/
│   ├── dataset.py
│   ├── inference.py
│   ├── modeling_minicpmv.py
│   └── readme.md
├── mbti_role_play/
│   ├── mbti_demo.py
│   ├── mbti_sft_dpo_data/
│   │   └── get_rank_data.py
│   └── self_awareness/
│       └── get_all_awarness_data.py
├── md/
│   ├── finetune/
│   │   ├── minicpm2.0/
│   │   │   ├── llama_factory.md
│   │   │   ├── mlx_sft.md
│   │   │   └── sft.md
│   │   ├── minicpm3.0/
│   │   │   ├── llama_factory.md
│   │   │   ├── pip_list.md
│   │   │   └── sft.md
│   │   ├── minicpmv2.5/
│   │   │   ├── sft.md
│   │   │   └── swift.md
│   │   └── minicpmv2.6/
│   │       ├── pip_list.md
│   │       └── sft.md
│   ├── inference/
│   │   ├── minicpm2.0/
│   │   │   ├── llama.cpp_android.md
│   │   │   ├── llama.cpp_pc.md
│   │   │   ├── mlx.md
│   │   │   ├── ollama.md
│   │   │   ├── powerinfer_android.md
│   │   │   ├── powerinfer_pc.md
│   │   │   ├── transformers.md
│   │   │   └── vllm.md
│   │   ├── minicpm3.0/
│   │   │   ├── llamcpp.md
│   │   │   ├── ollama.md
│   │   │   ├── sglang.md
│   │   │   ├── transformers.md
│   │   │   └── vllm.md
│   │   ├── minicpmv2.5/
│   │   │   ├── LMdeploy.md
│   │   │   ├── llamacpp_pc.md
│   │   │   ├── ollama.md
│   │   │   ├── swift_commandline.md
│   │   │   ├── swift_python.md
│   │   │   ├── transformers_multi_gpu.md
│   │   │   ├── vllm.md
│   │   │   └── xinference.md
│   │   └── minicpmv2.6/
│   │       ├── llamacpp.md
│   │       ├── ollama.md
│   │       ├── transformers_mult_gpu.md
│   │       ├── vllm.md
│   │       └── vllm_api_server.md
│   ├── integrate/
│   │   ├── function_call.md
│   │   ├── langchain.md
│   │   └── openai_api.md
│   ├── md_en/
│   │   ├── finetune/
│   │   │   ├── minicpm2.0/
│   │   │   │   ├── llama_factory.md
│   │   │   │   ├── mlx_sft.md
│   │   │   │   └── sft.md
│   │   │   ├── minicpm3.0/
│   │   │   │   ├── llama_factory.md
│   │   │   │   ├── pip_list.md
│   │   │   │   └── sft.md
│   │   │   ├── minicpmv2.5/
│   │   │   │   ├── sft.md
│   │   │   │   └── swift.md
│   │   │   └── minicpmv2.6/
│   │   │       ├── pip_list.md
│   │   │       └── sft.md
│   │   ├── inegrate/
│   │   │   ├── function_call.md
│   │   │   ├── langchain.md
│   │   │   └── openai_api.md
│   │   ├── inference/
│   │   │   ├── minicpm2.0/
│   │   │   │   ├── llama.cpp_android.md
│   │   │   │   ├── llama.cpp_pc.md
│   │   │   │   ├── mlx.md
│   │   │   │   ├── ollama.md
│   │   │   │   ├── powerinfer_android.md
│   │   │   │   ├── powerinfer_pc.md
│   │   │   │   ├── transformers.md
│   │   │   │   └── vllm.md
│   │   │   ├── minicpm3.0/
│   │   │   │   ├── llamacpp.md
│   │   │   │   ├── sglang.md
│   │   │   │   ├── transfomers.md
│   │   │   │   └── vllm.md
│   │   │   ├── minicpmv2.5/
│   │   │   │   ├── LMdeploy.md
│   │   │   │   ├── llamacpp_pc.md
│   │   │   │   ├── ollama.md
│   │   │   │   ├── swift_commandline.md
│   │   │   │   ├── swift_python.md
│   │   │   │   ├── transformers_multi_gpu.md
│   │   │   │   ├── vllm.md
│   │   │   │   └── xinference.md
│   │   │   └── minicpmv2.6/
│   │   │       ├── llamacpp.md
│   │   │       ├── ollama.md
│   │   │       ├── transformers_mult_gpu.md
│   │   │       ├── vllm.md
│   │   │       └── vllm_api_server.md
│   │   └── quantize/
│   │       ├── minicpm2.0/
│   │       │   ├── awq.md
│   │       │   ├── bnb.md
│   │       │   └── gptq.md
│   │       ├── minicpm3.0/
│   │       │   ├── awq.md
│   │       │   ├── bnb.md
│   │       │   └── gptq.md
│   │       ├── minicpmv2.5/
│   │       │   └── bnb.md
│   │       └── minicpmv2.6/
│   │           ├── awq.md
│   │           └── bnb.md
│   └── quantize/
│       ├── minicpm2.0/
│       │   ├── awq.md
│       │   ├── bnb.md
│       │   └── gptq.md
│       ├── minicpm3.0/
│       │   ├── awq.md
│       │   ├── bnb.md
│       │   └── gptq.md
│       ├── minicpmv2.5/
│       │   └── bnb.md
│       └── minicpmv2.6/
│           ├── awq.md
│           └── bnb.md
└── windows_minicpm3.0_agent/
    ├── app.py
    ├── cli_demo.py
    ├── dataset.py
    ├── get_reponse.py
    ├── inference.py
    ├── utils.py
    └── windows_agent.py

================================================
FILE CONTENTS
================================================

================================================
FILE: 4G_memory_rag/langchain_demo.py
================================================
"""
你只需要最少6g显存(足够)的显卡就能在消费级显卡上体验流畅的rag。

使用方法:
1. 运行pull_request/rag/langchain_demo.py
2. 上传pdf/txt文件(同一目录下可传多个)
3. 输入问题。

极低显存(4g)使用方法:
1. 根据MiniCPM/quantize/readme.md进行量化,推荐量化MiniCPM-1B-sft-bf16
2. 将cpm_model_path修改为量化后模型地址
3. 保证encode_model_device设置为cpu
"""

from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings
from argparse import ArgumentParser
from langchain.llms.base import LLM
from typing import Any, List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch
from langchain.prompts import PromptTemplate
from pydantic.v1 import Field
import re
import gradio as gr

parser = ArgumentParser()
# 大语言模型参数设置
parser.add_argument(
    "--cpm_model_path",
    type=str,
    default="openbmb/MiniCPM-1B-sft-bf16",
)
parser.add_argument(
    "--cpm_device", type=str, default="cuda:0", choices=["auto", "cuda:0"]
)
parser.add_argument("--backend", type=str, default="torch", choices=["torch", "vllm"])

# 嵌入模型参数设置
parser.add_argument(
    "--encode_model", type=str, default="BAAI/bge-base-zh"
)
parser.add_argument(
    "--encode_model_device", type=str, default="cpu", choices=["cpu", "cuda:0"]
)
parser.add_argument("--query_instruction", type=str, default="")
parser.add_argument(
    "--file_path", type=str, default="/root/ld/pull_request/rag/红楼梦.pdf"
)

# 生成参数
parser.add_argument("--top_k", type=int, default=3)
parser.add_argument("--top_p", type=float, default=0.7)
parser.add_argument("--temperature", type=float, default=0.7)
parser.add_argument("--max_new_tokens", type=int, default=4096)
parser.add_argument("--repetition_penalty", type=float, default=1.02)

# retriever参数设置
parser.add_argument("--embed_top_k", type=int, default=5)
parser.add_argument("--chunk_size", type=int, default=256)
parser.add_argument("--chunk_overlap", type=int, default=50)
args = parser.parse_args()


def clean_text(text):
    """
    清理文本,去除中英文字符、数字及常见标点。
    
    参数:
    text (str): 需要清理的原始文本。
    
    返回:
    str: 清理后的文本。
    """
    # 定义需要去除的字符模式:中文、英文、数字、常见标点
    pattern = r'[\u4e00-\u9fa5]|[A-Za-z0-9]|[.,;!?()"\']'

    # 使用正则表达式替换这些字符为空字符串
    cleaned_text = re.sub(pattern, "", text)

    # 去除多余的空格
    cleaned_text = re.sub(r"\s+", " ", cleaned_text)

    return cleaned_text


class MiniCPM_LLM(LLM):
    tokenizer: Any = Field(default=None)
    model: Any = Field(default=None)

    def __init__(self, model_path: str):
        """
        继承langchain的MiniCPM模型
        
        参数:
        model_path (str): 需要加载的MiniCPM模型路径。

        返回:
        self.model: 加载的MiniCPM模型。
        self.tokenizer: 加载的MiniCPM模型的tokenizer。
        """
        super().__init__()
        if args.backend == "vllm":
            from vllm import LLM

            self.model = LLM(
                model=model_path, trust_remote_code=True, enforce_eager=True
            )
        else:
            self.tokenizer = AutoTokenizer.from_pretrained(
                model_path, trust_remote_code=True
            )
            self.model = AutoModelForCausalLM.from_pretrained(
                model_path, trust_remote_code=True, torch_dtype=torch.float16
            ).to(args.cpm_device)
            self.model = self.model.eval()

    def _call(self, prompt, stop: Optional[List[str]] = None):
        """
        langchain.llm的调用
        
        参数:
        prompt (str): 传入的prompt文本

        返回:
        responds (str): 模型在prompt下生成的文本
        """
        if args.backend == "torch":
            inputs = self.tokenizer("<用户>{}".format(prompt), return_tensors="pt")
            inputs = inputs.to(args.cpm_device)
            # Generate
            generate_ids = self.model.generate(
                inputs.input_ids,
                max_length=args.max_new_tokens,
                temperature=args.temperature,
                top_p=args.top_p,
                repetition_penalty=args.repetition_penalty,
            )
            responds = self.tokenizer.batch_decode(
                generate_ids,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=False,
            )[0]
            # responds, history = self.model.chat(self.tokenizer, prompt, temperature=args.temperature, top_p=args.top_p, repetition_penalty=1.02)
        else:
            from vllm import SamplingParams

            params_dict = {
                "n": 1,
                "best_of": 1,
                "presence_penalty": args.repetition_penalty,
                "frequency_penalty": 0.0,
                "temperature": args.temperature,
                "top_p": args.top_p,
                "top_k": args.top_k,
                "use_beam_search": False,
                "length_penalty": 1,
                "early_stopping": False,
                "stop": None,
                "stop_token_ids": None,
                "ignore_eos": False,
                "max_tokens": args.max_new_tokens,
                "logprobs": None,
                "prompt_logprobs": None,
                "skip_special_tokens": True,
            }
            sampling_params = SamplingParams(**params_dict)
            prompt = "<用户>{}<AI>".format(prompt)
            responds = self.model.generate(prompt, sampling_params)
            responds = responds[0].outputs[0].text

        return responds

    @property
    def _llm_type(self) -> str:
        return "MiniCPM_LLM"


# 加载PDF和TXT文件
def load_documents(file_paths):
    """
        加载文本和pdf文件中的字符串,并进行简单的清洗
        
        参数:
        file_paths (str or list): 传入的文件地址或者文件列表

        返回:
        documents (list): 读取的文本列表
    """
    files_list = []
    if type(file_paths) == list:
        files_list = file_paths
    else:
        files_list = [file_paths]
    documents = []
    for file_path in files_list:
        if file_path.endswith(".pdf"):
            loader = PyPDFLoader(file_path)
        elif file_path.endswith(".txt"):
            loader = TextLoader(file_path)
        else:
            raise ValueError("Unsupported file type")
        doc = loader.load()
        doc[0].page_content = clean_text(doc[0].page_content)
        documents.extend(doc)

    return documents


def load_models():
    """
    加载模型和embedding模型
    
    返回:
    llm: MiniCPM模型
    embedding_models: embedding模型
    """
    llm = MiniCPM_LLM(model_path=args.cpm_model_path)
    embedding_models = HuggingFaceBgeEmbeddings(
        model_name=args.encode_model,
        model_kwargs={"device": args.encode_model_device},  # 或者 'cuda' 如果你有GPU
        encode_kwargs={
            "normalize_embeddings": True,  # 是否归一化嵌入
            "show_progress_bar": True,  # 是否显示进度条
            "convert_to_numpy": True,  # 是否将输出转换为numpy数组
            "batch_size": 8,  # 批处理大小'
        },
        query_instruction=args.query_instruction,
    )
    return llm, embedding_models


# 分割并嵌入文档
def embed_documents(documents, embedding_models):
    """
    对文档进行分割和嵌入
    
    参数:
    documents (list): 读取的文本列表
    embedding_models: embedding模型

    返回:
    vectorstore:向量数据库
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap
    )
    texts = text_splitter.split_documents(documents)
    vectorstore = Chroma.from_documents(texts, embedding_models)
    return vectorstore


def create_prompt_template():
    """
    创建自定义的prompt模板
    
    返回:
    PROMPT:自定义的prompt模板
    """
    custom_prompt_template = """请使用以下内容片段对问题进行最终回复,如果内容中没有提到的信息不要瞎猜,严格按照内容进行回答,不要编造答案,如果无法从内容中找到答案,请回答“片段中未提及,无法回答”,不要编造答案。
    Context:
    {context}

    Question: {question}
    FINAL ANSWER:"""
    PROMPT = PromptTemplate(
        template=custom_prompt_template, input_variables=["context", "question"]
    )
    return PROMPT


# 创建RAG链
def create_rag_chain(llm, prompt):
    # qa=load_qa_with_sources_chain(llm, chain_type="stuff")
    qa = prompt | llm
    return qa


def analysis_links(docs):
    """
    分析链接
    
    参数:
    docs (list): 读取的文本列表
    
    返回:
    links_string:相关文档引用字符串,docname page content

    示例:
    >>> docs = [
    ...     {'source': 'Document1', 'page': 1, 'content': 'This is the first document.'},
    ...     {'source': 'Document2', 'page': 2, 'content': 'This is the second document.'}
    ... ]
    >>> extract_links(docs)
    'Document1 page:1 \n\nThis is the first document.\nDocument2 page:2 \n\nThis is the second document.'
    """
    links_string = ""
    for i in docs:
        i.metadata["source"] = i.metadata["source"].split("/")[-1]
        i.metadata["content"] = i.page_content
        links_string += f"{i.metadata['source']} page:{i.metadata['page']}\n\n{i.metadata['content']}\n\n"
    return links_string


# 主函数
def main():
    # 加载文档
    documents = load_documents(args.file_path)

    # 嵌入文档
    vectorstore = embed_documents(documents, embedding_models)

    # 自建prompt模版
    Prompt = create_prompt_template()

    # 创建RAG链
    rag_chain = create_rag_chain(llm, Prompt)

    # 用户查询
    while True:
        query = input("请输入查询:")
        if query == "exit":
            break
        docs = vectorstore.similarity_search(query, k=args.embed_top_k)
        all_links = analysis_links(docs)
        final_result = rag_chain.invoke({"context": all_links, "question": query})
        # result = rag_chain({"input_documents": docs, "question": query}, return_only_outputs=True)
        print(final_result)


exist_file = None


def process_query(file, query):
    global exist_file, documents, vectorstore, rag_chain

    if file != exist_file:

        # 加载文档
        documents = load_documents(file if isinstance(file, list) else file.name)

        # 嵌入文档
        vectorstore = embed_documents(documents, embedding_models)

        # 自建prompt模版
        Prompt = create_prompt_template()

        # 创建RAG链
        rag_chain = create_rag_chain(llm, Prompt)

        exist_file = file

    # 搜索并获取结果
    docs = vectorstore.similarity_search(query, k=args.embed_top_k)
    all_links = analysis_links(docs)
    final_result = rag_chain.invoke({"context": all_links, "question": query})
    # result = rag_chain({"input_documents": docs, "question": query}, return_only_outputs=False)
    print(final_result)
    final_result = final_result.split("FINAL ANSWER:")[-1]
    return final_result, all_links


if __name__ == "__main__":

    llm, embedding_models = load_models()

    # 如果不需要web界面可以直接运行main函数
    #main()

    with gr.Blocks(css="#textbox { height: 380%; }") as demo:
        with gr.Row():
            with gr.Column():
                link_content = gr.Textbox(label="link_content", lines=30, max_lines=40)
            with gr.Column():
                file_input = gr.File(label="upload_files", file_count="multiple")
                final_anser = gr.Textbox(label="final_anser", lines=5, max_lines=10)
                query_input = gr.Textbox(
                    label="User",
                    placeholder="Input your query here!",
                    lines=5,
                    max_lines=10,
                )
                submit_button = gr.Button("Submit")
        submit_button.click(
            fn=process_query,
            inputs=[file_input, query_input],
            outputs=[final_anser, link_content],
        )
    demo.launch(share=True, show_error=True)


================================================
FILE: AIPC/Mac_feature_and_shortcut.json
================================================
{
    "复制": "Command ⌘ + C",
    "粘贴": "Command ⌘ + V",
    "剪切": "Option ⌥ + Command ⌘ + V",
    "打开文件": "Command ⌘ + O",
    "删除文件": "Command ⌘ + 删除 ⌫",
    "打印文件": "Command ⌘ + P",
    "存储文件": "Command ⌘ + S",
    "打开新标签页": "Command ⌘ + T",
    "撤销": "Command ⌘ + Z",
    "撤销恢复": "Shift ⇧ + Command ⌘ + Z",
    "全选": "Command ⌘ + A",
    "隐藏最前面的 App 的窗口": "Command ⌘ + H",
    "隐藏其他所有 App 的窗口": "Option ⌥ + Command ⌘ + H",
    "关闭最前面的 App 的窗口": "Command ⌘ + W",
    "关闭所有 App 的窗口": "Option ⌥ + Command ⌘ + W",
    "强制退出 App": "Option ⌥ + Command ⌘ + Esc",
    "显示/隐藏聚焦搜索": "Command ⌘ + 空格",
    "显示字符检视器(选择表情)": "Control ⌃ + Command ⌘ + 空格",
    "全屏 App": "Control ⌃ + Command ⌘ + F",
    "切换应用": "Command ⌘ + Tab",
    "切换 App 窗口": "Command ⌘ + 重音符 (`)",
    "截全屏图": "Shift ⇧ + Command ⌘ + 3",
    "画区域截图": "Shift ⇧ + Command ⌘ + 4",
    "截屏或录制屏幕": "Shift ⇧ + Command ⌘ + 5",
    "重命名文件": "选中具体文件后按 Enter ⏎ 键",
    "新建文件夹": "Shift ⇧ + Command ⌘ + N",
    "访达和系统快捷键": "Command ⌘ + D",
    "复制所选文件": "Command ⌘ + E",
    "推出磁盘或宗卷。": "Command ⌘ + F",
    "查找": "Command ⌘ + G",
    "再次查找": "Shift ⇧ + Command ⌘ + G",
    "查找上一个位置": "Shift ⇧ + Command ⌘ + C",
    "打开“电脑”窗口": "Shift ⇧ + Command ⌘ + D",
    "打开“桌面”文件夹": "Shift ⇧ + Command ⌘ + F",
    "打开“最近使用”窗口": "Shift ⇧ + Command ⌘ + G 后输入目录路径",
    "打开“前往文件夹”窗口": "Shift ⇧ + Command ⌘ + H",
    "打开“用户目录”文件夹": "Shift ⇧ + Command ⌘ + I",
    "打开 iCloud 云盘": "Shift ⇧ + Command ⌘ + K",
    "打开“网络”窗口": "Option ⌥ + Command ⌘ + L",
    "打开“下载”文件夹": "Shift ⇧ + Command ⌘ + O",
    "打开“文稿”文件夹": "Shift ⇧ + Command ⌘ + P",
    "显示或隐藏预览面板": "Shift ⇧ + Command ⌘ + R",
    "打开“隔空投送”窗口": "Shift ⇧ + Command ⌘ + T",
    "显示或隐藏窗口标签页栏": "Control ⌃ + Shift ⇧ + Command ⌘ + T",
    "将所选项目添加到“程序坞”": "Shift ⇧ + Command ⌘ + U",
    "打开“实用工具”文件夹": "Option ⌥ + Command ⌘ + D",
    "显示或隐藏“程序坞”": "Control ⌃ + Command ⌘ + T",
    "将所选项添加到边栏": "Option ⌥ + Command ⌘ + P",
    "隐藏或显示路径栏": "Option ⌥ + Command ⌘ + S",
    "隐藏或显示侧边栏": "Command ⌘ + 斜杠 (/)",
    "隐藏或显示状态栏": "Command ⌘ + J",
    "显示“显示”选项": "Command ⌘ + K",
    "打开“连接服务器”窗口": "Command ⌘ + N",
    "打开新的“访达”窗口": "Option ⌥ + Command ⌘ + V",
    "复制后剪切": "Command ⌘ + Y",
    "“快速查看”预览所选文件": "Option ⌥ + Command ⌘ + Y",
    "幻灯片显示所选文件": "Command ⌘ + 左中括号 ([)",
    "文件展示形式": "Command ⌘ + 右中括号 (])",
    "前往上一个文件夹": "Command ⌘ + 上箭头 ▴",
    "前往下一个文件夹": "Command ⌘ + Control ⌃ + 上箭头 ▴",
    "打开包含当前文件夹的文件夹": "Command ⌘ + 下箭头 ▾",
    "新窗口中打开包含当前文件夹的文件夹": "Command ⌘ + Delete",
    "打开所选项": "有确认对话框",
    "将所选项移到废纸篓": "Shift ⇧ + Command ⌘ + Delete",
    "清倒废纸篓": "无确认对话框",
    "立即清倒废纸篓": "Option ⌥ + Shift ⇧ + Command ⌘ + Delete",
    "锁定屏幕": "Control ⌃ + Command ⌘ + Q"
}

================================================
FILE: AIPC/spider.py
================================================
import requests
from bs4 import BeautifulSoup
import json
# 目标URL
url = 'https://liubing.me/article/mac/mac-shortcut-keys.html#复制'

# 发送HTTP请求获取页面内容
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    # 获取网页的编码方式
    response.encoding = response.apparent_encoding
    
    # 解析HTML文档
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 找到指定路径下的div标签
    div_tag = soup.select_one('#main-content > div:nth-of-type(3)')
    feature = []
    Shortcut = []
    if div_tag:
        # 获取div标签下的一级子标签中的h标签和p标签
        first_level_h_tags = div_tag.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'], recursive=False)
        first_level_p_tags = div_tag.find_all('p', recursive=False)
        
        print(len(first_level_h_tags) == len(first_level_p_tags))
        print("找到了指定路径下的所有一级子标签中的h标签:")
        for tag in first_level_h_tags[2:-1]:
            print(tag.name)
            print(tag.text.strip())
            feature.append(tag.text.strip())
        
        print("\n找到了指定路径下的所有一级子标签中的p标签:")
        for tag in first_level_p_tags[1:]:
            print(tag.name)
            print(tag.text.strip())
            Shortcut.append(tag.text.strip())
    
    else:
        print("没有找到指定路径下的div标签")
else:
    print(f"请求失败,状态码: {response.status_code}")
dictionary = {key: value for key, value in zip(feature , Shortcut)}
with open('/Users/liudan/ai/MiniCPM_Series_Tutorial/AIPC/Mac_feature_and_shortcut.json', 'w') as json_file:
    json.dump(dictionary, json_file, indent=4,ensure_ascii=False)

================================================
FILE: MiniCPM-o-long_video_inference/README.md
================================================
# MiniCPM-o 长视频推理脚本说明

## 1. 如何使用该脚本进行长视频推理

### 环境依赖
请先确保已安装以下依赖,可通过如下命令一键安装:

```bash
pip install -r requirements.txt
```

### 输入准备
- 将待分析的长视频文件(如 `long_video.mp4`)放在本目录下,或修改 `infer.py` 脚本中的 `video_path` 变量为你的视频路径。
- 确保有充足的显存(建议8GB及以上),否则可适当调小 `max_frames_per_chunk` 或 `max_slice_nums`。

### 运行脚本
在命令行中进入本目录,执行:

```bash
python infer.py
```

运行后,脚本会自动:
- 读取并分析指定视频
- 处理视觉和音频内容
- 最终在本目录下生成 `long_video_audio_result.json`,并在终端输出最终总结结果

### 输出说明
- `long_video_audio_result.json`:包含每个视频块的分析结果和最终综合总结。
- 终端会打印最终的自然语言总结。

### 注意事项
- 默认使用 `openbmb/MiniCPM-o-2_6` 模型,如需更换模型请在 `infer.py` 中修改 `model_path`。
- 支持GPU和CPU,建议优先使用GPU。
- 视频越长、分辨率越高,所需内存和显存越大。

---

## 2. 脚本实现原理简介

本脚本专为长视频的多模态(视觉+音频)理解设计,核心流程如下:

### 1)分块处理
- 脚本会将长视频按帧数分割为多个小块(如每64帧为一块,块间有重叠),逐块推理,避免显存溢出。

### 2)音视频对齐
- 每帧提取对应的1秒音频片段,实现视觉帧与音频的同步分析。

### 3)记忆库机制
- 脚本维护视觉、音频和文本摘要的"记忆库",每处理一块就将关键信息存入记忆库。
- 采用时间衰减加权采样,优先关注近期内容,兼顾全局上下文。

### 4)多轮推理与总结
- 每块推理时,结合记忆库历史摘要、当前帧和音频,生成结构化分析结果。
- 所有块处理完后,脚本会再次调用模型对所有块的结果进行整合,输出最终总结。

### 5)高分辨率与自适应
- 支持高清模式(默认开启),如显存不足可关闭或降低分辨率。

---

如需自定义参数(如采样帧率、记忆库大小等),可直接修改 `infer.py` 脚本中的相关参数。

如有问题欢迎提issue或交流!


================================================
FILE: MiniCPM-o-long_video_inference/infer.py
================================================
import json
import math
import tempfile
from datetime import datetime

import numpy as np
import torch
import librosa
import soundfile as sf
from PIL import Image
from decord import VideoReader, cpu
from transformers import AutoModel, AutoTokenizer
from moviepy.editor import VideoFileClip
from tqdm import tqdm

import librosa
import soundfile as sf
from moviepy.editor import VideoFileClip
import tempfile

# 视频预处理
def extract_frames_and_audio(video_path, sample_fps=2, max_frames=None, audio_processor=None):
    """从视频文件中提取帧和对应的音频"""
    # 使用decord读取视频
    vr = VideoReader(video_path, ctx=cpu(0))
    
    # 获取视频信息
    fps = vr.get_avg_fps()
    total_frames = len(vr)
    
    # 计算采样间隔
    sample_interval = int(fps / sample_fps)
    
    # 确定要提取的帧索引
    frame_indices = list(range(0, total_frames, sample_interval))
    
    # 限制最大帧数
    if max_frames and len(frame_indices) > max_frames:
        # 均匀采样以获取指定数量的帧
        step = len(frame_indices) / max_frames
        indices = [int(i * step) for i in range(max_frames)]
        frame_indices = [frame_indices[i] for i in indices]
    
    # 提取帧
    frames = vr.get_batch(frame_indices).asnumpy()
    
    # 转换为PIL图像
    pil_frames = [Image.fromarray(frame) for frame in frames]
    
    # 提取音频
    audio_segments = None
    if audio_processor:
        # 提取完整音频
        audio, duration = audio_processor.extract_audio_from_video(video_path)
        
        # 将音频分段,与视频帧对齐
        frame_times = [idx / fps for idx in frame_indices]
        audio_segments = []
        
        for time in frame_times:
            start_sample = int(time * audio_processor.sample_rate)
            end_sample = start_sample + int(1.0 * audio_processor.sample_rate)  # 取1秒音频
            if end_sample <= len(audio):
                segment = audio[start_sample:end_sample]
            else:
                segment = np.pad(audio[start_sample:], (0, end_sample - len(audio)), 'constant')
            audio_segments.append(segment)
    
    return pil_frames, audio_segments

# 长视频处理
class LongVideoAudioProcessor:
    def __init__(self, 
                 model_path="openbmb/MiniCPM-o-2_6",
                 max_frames_per_chunk=64,  # 每个块的最大帧数
                 max_slice_nums=9,         # 每帧图像的最大切片数
                 scale_resolution=448,     # 每个切片的分辨率
                 memory_bank_size=32,      # 记忆库大小
                 overlap_frames=8,         # 块之间的重叠帧数
                 audio_sample_rate=16000,  # 音频采样率
                 time_decay_factor=0.8,    # 时间衰减因子
                 sample_fps=2,            # 视频采样频率
                 device="cuda" if torch.cuda.is_available() else "cpu",
                 high_res_mode=True        # 高清模式开关,默认开启
                 ):
        
        # 加载模型
        self.model = AutoModel.from_pretrained(
            model_path, 
            trust_remote_code=True,
            attn_implementation='sdpa', 
            torch_dtype=torch.bfloat16
        )
        self.model = self.model.eval().to(device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        
        self.max_frames_per_chunk = max_frames_per_chunk
        self.max_slice_nums = max_slice_nums
        self.scale_resolution = scale_resolution
        self.memory_bank_size = memory_bank_size
        self.overlap_frames = overlap_frames
        self.device = device
        self.sample_fps = sample_fps
        
        # 初始化记忆库 - 包含视频帧、音频和时间戳
        self.visual_memory_bank = []
        self.audio_memory_bank = []
        self.timestamps = []  # 帧时间戳
        self.text_summaries = []  # 历史文本摘要
        
        self.time_decay_factor = time_decay_factor
        self.high_res_mode = high_res_mode  # 保存高清模式开关
        
        # 初始化音频处理器
        self.audio_processor = AudioProcessor(sample_rate=audio_sample_rate)
    
    def preprocess_frame(self, frame):
        """预处理单个视频帧"""
        width, height = frame.size
        # 如果关闭高清模式,先缩小为原分辨率1/2
        if not self.high_res_mode:
            frame = frame.resize((width // 2, height // 2), Image.LANCZOS)
            width, height = frame.size
        max_size = max(width, height)
        if max_size > self.scale_resolution:
            scale = self.scale_resolution / max_size
            new_width = int(width * scale)
            new_height = int(height * scale)
            frame = frame.resize((new_width, new_height), Image.LANCZOS)
        
        return frame
    
    def update_memory_bank(self, frames, audio_segments, current_time):
        """更新记忆库 - 同时存储视频帧、音频和时间戳"""
        # 将新的帧添加到视觉记忆库
        self.visual_memory_bank.extend(frames)
        
        # 更新时间戳
        new_timestamps = [current_time + i/self.sample_fps for i in range(len(frames))]  # 使用类属性中的采样频率
        self.timestamps.extend(new_timestamps)
        
        # 如果视觉记忆库超出大小限制,移除最旧的帧
        if len(self.visual_memory_bank) > self.memory_bank_size:
            self.visual_memory_bank = self.visual_memory_bank[-self.memory_bank_size:] 
            self.timestamps = self.timestamps[-self.memory_bank_size:]
        
        # 如果有音频,更新音频记忆库
        if audio_segments:
            self.audio_memory_bank.extend(audio_segments)
            if len(self.audio_memory_bank) > self.memory_bank_size:
                self.audio_memory_bank = self.audio_memory_bank[-self.memory_bank_size:]
    
    def calculate_time_weights(self, current_time):
        """计算基于时间的衰减权重"""
        weights = []
        for timestamp in self.timestamps:
            # 计算时间差(秒)
            time_diff = current_time - timestamp
            # 使用指数衰减函数计算权重
            weight = self.time_decay_factor ** time_diff
            weights.append(weight)
        return np.array(weights)
    
    def weighted_sampling(self, frames, weights, sample_count):
        """基于权重的采样"""
        # 归一化权重
        weights = weights / np.sum(weights)
        # 使用权重进行采样
        indices = np.random.choice(len(frames), size=sample_count, p=weights, replace=False)
        return sorted(indices)  # 返回排序后的索引以保持时间顺序
    
    def update_text_summary(self, new_summary):
        """更新历史文本摘要"""
        self.text_summaries.append(new_summary)
        # 保持摘要数量在合理范围内
        if len(self.text_summaries) > 5:  # 最多保留5个摘要
            self.text_summaries = self.text_summaries[-5:]
    
    def process_long_video(self, video_path, query):
        """处理长视频,包括音频"""
        # 提取视频帧和音频
        video_frames, audio_segments = extract_frames_and_audio(
            video_path, 
            sample_fps=self.sample_fps,  # 使用类属性中的采样频率
            max_frames=None, 
            audio_processor=self.audio_processor
        )
        
        # 将视频分成多个块
        chunks = []
        audio_chunks = []
        
        for i in range(0, len(video_frames), self.max_frames_per_chunk - self.overlap_frames):
            end_idx = min(i + self.max_frames_per_chunk, len(video_frames))
            chunk = video_frames[i:end_idx]
            chunks.append(chunk)
            
            # 如果有音频,也分块
            if audio_segments:
                audio_chunk = audio_segments[i:end_idx]
                audio_chunks.append(audio_chunk)
        
        # 逐块处理视频
        all_results = []
        for i, chunk in enumerate(tqdm(chunks, desc="Processing chunks", unit="chunks", ncols=80)):
            # 预处理当前块的帧
            processed_frames = [self.preprocess_frame(frame) for frame in chunk]
            
            # 获取对应的音频块
            audio_chunk = audio_chunks[i] if audio_segments else None
            
            # 使用记忆库和当前块进行推理
            result = self._inference_with_memory(processed_frames, audio_chunk, query, i, len(chunks), datetime.now().timestamp())
            all_results.append(result)
            
            # 更新记忆库
            self.update_memory_bank(processed_frames, audio_chunk, datetime.now().timestamp())
        
        # 合并所有块的结果
        final_result = self._merge_results(all_results, query)
        
        return final_result
    
    def _inference_with_memory(self, frames, audio_segments, query, chunk_idx, total_chunks, current_time):
        """使用记忆库和当前帧进行推理,包括音频和时间加权"""
        combined_frames = []
        combined_audio = []
        
        # 使用时间加权采样
        if self.visual_memory_bank:
            # 计算时间权重
            time_weights = self.calculate_time_weights(current_time)
            # 从记忆库中加权采样
            memory_sample_count = min(16, len(self.visual_memory_bank))
            memory_indices = self.weighted_sampling(self.visual_memory_bank, time_weights, memory_sample_count)
            
            for idx in memory_indices:
                combined_frames.append(self.visual_memory_bank[idx])
                if idx < len(self.audio_memory_bank):
                    combined_audio.append(self.audio_memory_bank[idx])
        
        # 添加当前块的帧
        current_sample_count = min(self.max_frames_per_chunk - len(combined_frames), len(frames))
        if current_sample_count < len(frames):
            current_indices = np.linspace(0, len(frames)-1, current_sample_count, dtype=int)
            for idx in current_indices:
                combined_frames.append(frames[idx])
                if audio_segments and idx < len(audio_segments):
                    combined_audio.append(audio_segments[idx])
        else:
            combined_frames.extend(frames)
            if audio_segments:
                combined_audio.extend(audio_segments)
        
        # 构建系统提示
        system_prompt = """
        你是一个专业的视频理解助手,具有以下能力:
        1. 视觉分析:能够准确识别视频中的场景、物体、人物和动作
        2. 音频理解:能够识别背景音乐、对话、环境声音等
        3. 时序理解:能够理解视频中的时间顺序和事件发展
        4. 上下文关联:能够将当前内容与历史信息关联起来
        5. 回答要简洁、准确、客观
        6. 优先描述视觉和音频的关键信息
        7. 保持时间顺序的连贯性
        8. 如果信息不足,明确说明
        9. 不要编造或推测不存在的信息
        """

        # 构建带有视频帧、音频和历史摘要的消息
        content = []
        
        # 添加历史摘要,使用更结构化的格式
        if self.text_summaries:
            content.append("=== 历史上下文 ===")
            for i, summary in enumerate(self.text_summaries):
                content.append(f"[时间点 {i+1}] {summary}")
            content.append("=== 当前内容 ===")
        
        # 添加当前时间信息
        content.append(f"当前处理第 {chunk_idx + 1}/{total_chunks} 个视频块")
        
        # 交替添加视频帧和音频,使用更清晰的标记
        for i in range(len(combined_frames)):
            content.append("<visual_audio_unit>")
            content.append(f"[帧 {i+1}]")
            content.append(combined_frames[i])
            if i < len(combined_audio):
                content.append(f"[音频 {i+1}]")
                content.append(combined_audio[i])
        
        # 添加查询,使用更结构化的格式
        content.append("\n=== 用户查询 ===")
        content.append(query)
        content.append("\n请基于以上信息,按照以下格式回答:")
        content.append("1. 视觉内容:[描述当前视频块中的主要视觉内容]")
        content.append("2. 音频内容:[描述当前视频块中的主要音频内容]")
        content.append("3. 上下文关联:[说明与历史内容的关联]")
        content.append("4. 总结:[简要总结当前块的关键信息]")
        
        video_msg = [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user", 
                "content": content
            }
        ]
        
        # 设置视频处理参数
        params = {
            "use_image_id": False,
            "max_slice_nums": 2,  # 如果CUDA内存不足,可以设为1
            "omni_input": True,   # 启用全模态输入处理
        }
        
        # 调用模型进行推理
        response = self.model.chat(
            msgs=video_msg,
            tokenizer=self.tokenizer,
            **params
        )
        
        # 更新文本摘要
        self.update_text_summary(response)
        
        return {
            "chunk_idx": chunk_idx,
            "total_chunks": total_chunks,
            "frames_count": len(frames),
            "audio_count": len(audio_segments) if audio_segments else 0,
            "memory_bank_size": len(self.visual_memory_bank),
            "audio_memory_bank_size": len(self.audio_memory_bank),
            "query": query,
            "response": response
        }
    
    def _merge_results(self, all_results, query):
        """合并所有块的结果"""
        # 构建系统提示
        system_prompt = """
        你是一个专业的视频总结助手,负责整合多个视频块的分析结果。
        请遵循以下规则:
        1. 保持时间顺序的连贯性
        2. 突出重要事件和关键信息
        3. 确保信息的完整性和准确性
        4. 避免重复和冗余信息
        5. 如果信息有冲突,选择最可信的信息"""

        content = []
        content.append("=== 视频块分析结果 ===")
        
        for i, result in enumerate(all_results):
            content.append(f"\n[视频块 {i+1}/{len(all_results)}]")
            content.append(result['response'])
        
        content.append("\n=== 用户查询 ===")
        content.append(query)
        content.append("\n请按照以下格式提供最终答案:")
        content.append("1. 时间线:[按时间顺序总结主要事件]")
        content.append("2. 关键信息:[提取最重要的信息点]")
        content.append("3. 综合分析:[将各个块的信息整合分析]")
        content.append("4. 直接回答:[针对用户查询的具体回答]")
        
        msgs = [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": content
            }
        ]
        
        # 调用模型进行推理
        final_answer = self.model.chat(
            msgs=msgs,
            tokenizer=self.tokenizer
        )
        
        # 构建最终结果
        final_result = {
            "total_chunks": len(all_results),
            "visual_memory_bank_size": len(self.visual_memory_bank),
            "audio_memory_bank_size": len(self.audio_memory_bank),
            "query": query,
            "chunk_responses": [result["response"] for result in all_results],
            "final_answer": final_answer
        }
        
        return final_result

# 音频处理
class AudioProcessor:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
    
    def extract_audio_from_video(self, video_path):
        """从视频文件中提取音频"""
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
            temp_audio_path = temp_audio_file.name
            video = VideoFileClip(video_path)
            video.audio.write_audiofile(temp_audio_path, codec="pcm_s16le", fps=self.sample_rate, verbose=False, logger=None)
            audio, _ = librosa.load(temp_audio_path, sr=self.sample_rate, mono=True)
        return audio, video.duration
    
    def segment_audio(self, audio, duration, segment_duration=1.0):
        """将音频分割成固定时长的片段"""
        num_segments = int(np.ceil(duration / segment_duration))
        segments = []
        samples_per_segment = int(self.sample_rate * segment_duration)
        
        for i in range(num_segments):
            start_idx = i * samples_per_segment
            end_idx = min(start_idx + samples_per_segment, len(audio))
            segment = audio[start_idx:end_idx]
            # 如果最后一个片段长度不足,用0填充
            if len(segment) < samples_per_segment:
                segment = np.pad(segment, (0, samples_per_segment - len(segment)), 'constant')
            segments.append(segment)
        
        return segments
    
    def extract_audio_features(self, audio_segment):
        """提取音频特征(可选)"""
        # 这里可以添加更复杂的特征提取,如MFCC等
        return audio_segment
    

def main():
    video_path = "long_video.mp4" # 修改为您的视频路径

    processor = LongVideoAudioProcessor(
        model_path="openbmb/MiniCPM-o-2_6", # 修改为您的模型路径
        max_frames_per_chunk=64,  # 每个块的最大帧数
        max_slice_nums=9,         # 每帧图像的最大切片数
        scale_resolution=448,     # 每个切片的分辨率
        memory_bank_size=32,      # 记忆库大小
        overlap_frames=8,         # 块之间的重叠帧数
        audio_sample_rate=16000,  # 音频采样率
        time_decay_factor=0.8,    # 时间衰减因子
        sample_fps=2,             # 视频采样频率
        high_res_mode=True        # 高清模式开关,默认开启
    )
    
    query = "视频中发生了什么事情?请详细描述视觉内容和音频内容。"
    result = processor.process_long_video(video_path, query)

    with open("long_video_audio_result.json", "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)

    print(result["final_answer"])

if __name__ == "__main__":
    main()

================================================
FILE: MiniCPM-o-long_video_inference/requirements.txt
================================================
decord==0.6.0
librosa==0.9.0
moviepy==1.0.3
numpy==2.3.0
Pillow==11.2.1
soundfile==0.12.1
torch==2.5.1
transformers==4.52.3


================================================
FILE: MiniCPMV2_6_awq/modeling_minicpmv.py
================================================
import math
from typing import List, Optional
import json
import torch
import torchvision
from threading import Thread
from copy import deepcopy
from PIL import Image
from transformers import AutoProcessor, Qwen2PreTrainedModel, Qwen2ForCausalLM, TextIteratorStreamer

from .configuration_minicpm import MiniCPMVConfig
from .modeling_navit_siglip import SiglipVisionTransformer
from .resampler import Resampler



class MiniCPMVPreTrainedModel(Qwen2PreTrainedModel):
    config_class = MiniCPMVConfig


class MiniCPMV(MiniCPMVPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.llm = Qwen2ForCausalLM(config)
        self.vpm = self.init_vision_module()
        self.vision_dim = self.vpm.embed_dim
        self.embed_dim = self.llm.config.hidden_size
        self.resampler = self.init_resampler(self.embed_dim, self.vision_dim)
        self.processor = None

        self.terminators = ['<|im_end|>', '<|endoftext|>']

    def init_vision_module(self):
        # same as HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit add tgt_sizes
        if self.config._attn_implementation == 'flash_attention_2':
            self.config.vision_config._attn_implementation = 'flash_attention_2'
        else:
            # not suport sdpa
            self.config.vision_config._attn_implementation = 'eager'
        model = SiglipVisionTransformer(self.config.vision_config)
        if self.config.drop_vision_last_layer:
            model.encoder.layers = model.encoder.layers[:-1]

        setattr(model, 'embed_dim', model.embeddings.embed_dim)
        setattr(model, 'patch_size', model.embeddings.patch_size)

        return model

    def init_resampler(self, embed_dim, vision_dim):
        return Resampler(
            num_queries=self.config.query_num,
            embed_dim=embed_dim,
            num_heads=embed_dim // 128,
            kv_dim=vision_dim,
            adaptive=True
        )

    def get_input_embeddings(self):
        return self.llm.get_input_embeddings()

    def set_input_embeddings(self, value):
        self.llm.embed_tokens = value

    def get_output_embeddings(self):
        return self.llm.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.llm.lm_head = new_embeddings

    def set_decoder(self, decoder):
        self.llm = decoder
    
    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        attention_mask=None,
        inputs_embeds=None,
        cache_position=None,
        position_ids=None,
        use_cache=True,
        **kwargs,
    ):
        # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
        # Exception 1: when passing input_embeds, input_ids may be missing entries
        # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
        if past_key_values is not None:
            if inputs_embeds is not None:  # Exception 1
                input_ids = input_ids[:, -cache_position.shape[0] :]
            elif input_ids.shape[1] != cache_position.shape[0]:  # Default case (the "else", a no op, is Exception 2)
                input_ids = input_ids[:, cache_position]

        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

                # This `clone` call is needed to avoid recapturing cuda graphs with `torch.compile`'s  `mode="reduce-overhead`, as otherwise the input `position_ids` would have various stride during the decoding. Here, simply using `.contiguous()` is not sufficient as in the batch size = 1 case, `position_ids` is already contiguous but with varying stride which retriggers a capture.
                position_ids = position_ids.clone(memory_format=torch.contiguous_format)

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and cache_position[0] == 0:
            model_inputs = {"inputs_embeds": inputs_embeds, "input_ids": None}
        else:
            # The clone here is for the same reason as for `position_ids`.
            model_inputs = {"input_ids": input_ids.clone(memory_format=torch.contiguous_format), "inputs_embeds": None}
    

        model_inputs.update(
            {
                "position_ids": position_ids,
                "cache_position": cache_position,
                "past_key_values": past_key_values,
                "use_cache": use_cache,
                "attention_mask": attention_mask,
            }
        )
        return model_inputs
    def get_decoder(self):
        return self.llm

    def get_vllm_embedding(self, data):
        if 'vision_hidden_states' not in data:
            dtype = self.llm.model.embed_tokens.weight.dtype
            device = self.llm.model.embed_tokens.weight.device
            tgt_sizes = data['tgt_sizes']
            pixel_values_list = data['pixel_values']
            vision_hidden_states = []
            all_pixel_values = []
            img_cnt = []
            for pixel_values in pixel_values_list:
                img_cnt.append(len(pixel_values))
                all_pixel_values.extend([i.flatten(end_dim=1).permute(1, 0) for i in pixel_values])

            # exist image
            if all_pixel_values:
                tgt_sizes = [tgt_size for tgt_size in tgt_sizes if isinstance(tgt_size, torch.Tensor)]
                tgt_sizes = torch.vstack(tgt_sizes).type(torch.int32)

                max_patches = torch.max(tgt_sizes[:, 0] * tgt_sizes[:, 1])

                all_pixel_values = torch.nn.utils.rnn.pad_sequence(all_pixel_values, batch_first=True,
                                                                   padding_value=0.0)
                B, L, _ = all_pixel_values.shape
                all_pixel_values = all_pixel_values.permute(0, 2, 1).reshape(B, 3, -1, L)

                patch_attn_mask = torch.zeros((B, 1, max_patches), dtype=torch.bool, device=device)
                for i in range(B):
                    patch_attn_mask[i, 0, :tgt_sizes[i][0] * tgt_sizes[i][1]] = True

                vision_batch_size = self.config.vision_batch_size
                all_pixel_values = all_pixel_values.type(dtype)
                if B > vision_batch_size:
                    hs = []
                    for i in range(0, B, vision_batch_size):
                        start_idx = i
                        end_idx = i + vision_batch_size
                        tmp_hs = self.vpm(all_pixel_values[start_idx:end_idx], patch_attention_mask=patch_attn_mask[start_idx:end_idx], tgt_sizes=tgt_sizes[start_idx:end_idx]).last_hidden_state
                        hs.append(tmp_hs)
                    vision_embedding = torch.cat(hs, dim=0)
                else:
                    vision_embedding = self.vpm(all_pixel_values, patch_attention_mask=patch_attn_mask, tgt_sizes=tgt_sizes).last_hidden_state
                vision_embedding = self.resampler(vision_embedding, tgt_sizes)

                start = 0
                for pixel_values in pixel_values_list:
                    img_cnt = len(pixel_values)
                    if img_cnt > 0:
                        vision_hidden_states.append(vision_embedding[start: start + img_cnt])
                        start += img_cnt
                    else:
                        vision_hidden_states.append([])
            else: # no image
                if self.training:
                    dummy_image = torch.zeros(
                        (1, 3, 224, 224),
                        device=device, dtype=dtype
                    )
                    tgt_sizes = torch.Tensor([[(224 // self.config.patch_size), math.ceil(224 / self.config.patch_size)]]).type(torch.int32)
                    dummy_feature = self.resampler(self.vpm(dummy_image).last_hidden_state, tgt_sizes)
                else:
                    dummy_feature = []
                for _ in range(len(pixel_values_list)):
                    vision_hidden_states.append(dummy_feature)

        else:
            vision_hidden_states = data['vision_hidden_states']

        if hasattr(self.llm.config, 'scale_emb'):
            vllm_embedding = self.llm.model.embed_tokens(data['input_ids']) * self.llm.config.scale_emb
        else:
            vllm_embedding = self.llm.model.embed_tokens(data['input_ids'])

        vision_hidden_states = [i.type(vllm_embedding.dtype) if isinstance(
            i, torch.Tensor) else i for i in vision_hidden_states]

        bs = len(data['input_ids'])
        for i in range(bs):
            cur_vs_hs = vision_hidden_states[i]
            if len(cur_vs_hs) > 0:
                cur_vllm_emb = vllm_embedding[i]
                cur_image_bound = data['image_bound'][i]
                if len(cur_image_bound) > 0:
                    image_indices = torch.stack(
                        [torch.arange(r[0], r[1], dtype=torch.long) for r in cur_image_bound]
                    ).to(vllm_embedding.device)

                    cur_vllm_emb.scatter_(0, image_indices.view(-1, 1).repeat(1, cur_vllm_emb.shape[-1]),
                                          cur_vs_hs.view(-1, cur_vs_hs.shape[-1]))
                elif self.training:
                    cur_vllm_emb += cur_vs_hs[0].mean() * 0

        return vllm_embedding, vision_hidden_states

    def forward(self, data, **kwargs):
        if isinstance(data, torch.Tensor):
            return self.llm(
            input_ids=data,
            **kwargs
        )
        else:
            vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)
            position_ids = data["position_ids"]
            if position_ids.dtype != torch.int64:
                position_ids = position_ids.long()

            return self.llm(
                input_ids=None,
                position_ids=position_ids,
                inputs_embeds=vllm_embedding,
                **kwargs
            )
    
    def _decode(self, inputs_embeds, tokenizer, attention_mask, decode_text=False, **kwargs):
        terminators = [tokenizer.convert_tokens_to_ids(i) for i in self.terminators]
        output = self.llm.generate(
            inputs_embeds=inputs_embeds,
            pad_token_id=0,
            eos_token_id=terminators,
            attention_mask=attention_mask,
            **kwargs
        )
        if decode_text:
            return self._decode_text(output, tokenizer)
        return output

    def _decode_stream(self, inputs_embeds, tokenizer, **kwargs):
        terminators = [tokenizer.convert_tokens_to_ids(i) for i in self.terminators]
        streamer = TextIteratorStreamer(tokenizer=tokenizer)
        generation_kwargs = {
            'inputs_embeds': inputs_embeds,
            'pad_token_id': 0,
            'eos_token_id': terminators,
            'streamer': streamer
        }
        generation_kwargs.update(kwargs)

        thread = Thread(target=self.llm.generate, kwargs=generation_kwargs)
        thread.start()
    
        return streamer

    def _decode_text(self, result_ids, tokenizer):
        terminators = [tokenizer.convert_tokens_to_ids(i) for i in self.terminators]
        result_text = []
        for result in result_ids:
            result = result[result != 0]
            if result[0] == tokenizer.bos_id:
                result = result[1:]
            if result[-1] in terminators:
                result = result[:-1]
            result_text.append(tokenizer.decode(result).strip())
        return result_text

    def generate(
        self,
        input_ids=None,
        pixel_values=None,
        tgt_sizes=None,
        image_bound=None,
        attention_mask=None,
        tokenizer=None,
        vision_hidden_states=None,
        return_vision_hidden_states=False,
        stream=False,
        decode_text=False,
        **kwargs
    ):
        assert input_ids is not None
        assert len(input_ids) == len(pixel_values)

        model_inputs = {
            "input_ids": input_ids,
            "image_bound": image_bound,
        }

        if vision_hidden_states is None:
            model_inputs["pixel_values"] = pixel_values
            model_inputs['tgt_sizes'] = tgt_sizes
        else:
            model_inputs["vision_hidden_states"] = vision_hidden_states

        with torch.inference_mode():
            (
                model_inputs["inputs_embeds"],
                vision_hidden_states,
            ) = self.get_vllm_embedding(model_inputs)

            if stream:
                result = self._decode_stream(model_inputs["inputs_embeds"], tokenizer, **kwargs)
            else:
                result = self._decode(model_inputs["inputs_embeds"], tokenizer, attention_mask, decode_text=decode_text, **kwargs)

        if return_vision_hidden_states:
            return result, vision_hidden_states
        
        return result

    def chat(
        self,
        image,
        msgs,
        tokenizer,
        processor=None,
        vision_hidden_states=None,
        max_new_tokens=2048,
        min_new_tokens=0,
        sampling=True,
        max_inp_length=8192,
        system_prompt='',
        stream=False,
        max_slice_nums=None,
        use_image_id=None,
        **kwargs
    ):
        if isinstance(msgs[0], list):
            batched = True
        else:
            batched = False
        msgs_list = msgs
        images_list = image
        
        if batched is False:
            images_list, msgs_list = [images_list], [msgs_list]
        else:
            assert images_list is None, "Please integrate image to msgs when using batch inference."
            images_list = [None] * len(msgs_list)
        assert len(images_list) == len(msgs_list), "The batch dim of images_list and msgs_list should be the same."

        if processor is None:
            if self.processor is None:
                self.processor = AutoProcessor.from_pretrained(self.config._name_or_path, trust_remote_code=True)
            processor = self.processor
        
        assert self.config.query_num == processor.image_processor.image_feature_size, "These two values should be the same. Check `config.json` and `preprocessor_config.json`."
        assert self.config.patch_size == processor.image_processor.patch_size, "These two values should be the same. Check `config.json` and `preprocessor_config.json`."
        assert self.config.use_image_id == processor.image_processor.use_image_id, "These two values should be the same. Check `config.json` and `preprocessor_config.json`."
        assert self.config.slice_config.max_slice_nums == processor.image_processor.max_slice_nums, "These two values should be the same. Check `config.json` and `preprocessor_config.json`."
        assert self.config.slice_mode == processor.image_processor.slice_mode, "These two values should be the same. Check `config.json` and `preprocessor_config.json`."

        prompts_lists = []
        input_images_lists = []
        for image, msgs in zip(images_list, msgs_list):
            if isinstance(msgs, str):
                msgs = json.loads(msgs)
            copy_msgs = deepcopy(msgs)

            assert len(msgs) > 0, "msgs is empty"
            assert sampling or not stream, "if use stream mode, make sure sampling=True"

            if image is not None and isinstance(copy_msgs[0]["content"], str):
                copy_msgs[0]["content"] = [image, copy_msgs[0]["content"]]

            images = []
            for i, msg in enumerate(copy_msgs):
                role = msg["role"]
                content = msg["content"]
                assert role in ["user", "assistant"]
                if i == 0:
                    assert role == "user", "The role of first msg should be user"
                if isinstance(content, str):
                    content = [content]
                cur_msgs = []
                for c in content:
                    if isinstance(c, Image.Image):
                        images.append(c)
                        cur_msgs.append("(<image>./</image>)")
                    elif isinstance(c, str):
                        cur_msgs.append(c)
                msg["content"] = "\n".join(cur_msgs)

            if system_prompt:
                sys_msg = {'role': 'system', 'content': system_prompt}
                copy_msgs = [sys_msg] + copy_msgs        

            prompts_lists.append(processor.tokenizer.apply_chat_template(copy_msgs, tokenize=False, add_generation_prompt=True))
            input_images_lists.append(images)

        inputs = processor(
            prompts_lists, 
            input_images_lists, 
            max_slice_nums=max_slice_nums,
            use_image_id=use_image_id,
            return_tensors="pt", 
            max_length=max_inp_length
        ).to(self.device)

        if sampling:
            generation_config = {
                "top_p": 0.8,
                "top_k": 100,
                "temperature": 0.7,
                "do_sample": True,
                "repetition_penalty": 1.05
            }
        else:
            generation_config = {
                "num_beams": 3,
                "repetition_penalty": 1.2,
            }
            
        if min_new_tokens > 0:
            generation_config['min_new_tokens'] = min_new_tokens

        generation_config.update(
            (k, kwargs[k]) for k in generation_config.keys() & kwargs.keys()
        )

        inputs.pop("image_sizes")
        with torch.inference_mode():
            res = self.generate(
                **inputs,
                tokenizer=tokenizer,
                max_new_tokens=max_new_tokens,
                vision_hidden_states=vision_hidden_states,
                stream=stream,
                decode_text=True,
                **generation_config
            )
        
        if stream:
            def stream_gen():
                for text in res:
                    for term in self.terminators:
                        text = text.replace(term, '')
                    yield text
            return stream_gen()

        else:
            if batched:
                answer = res
            else:
                answer = res[0]
            return answer


================================================
FILE: MiniCPMV2_6_awq/quantize.py
================================================


from datasets import load_dataset
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os
import shutil


model_path = '/root/ld/ld_model_pretrain/MiniCPM-V-2_6'
quant_path = '/root/ld/ld_model_pretrain/MiniCPM-V-2_6_awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True,device_map={"": "cuda:0"})
# 保证原始模型的各个文件不遗漏保存到merge_path中
def copy_files_not_in_B(A_path, B_path):
    """
    Copies files from directory A to directory B if they exist in A but not in B.

    :param A_path: Path to the source directory (A).
    :param B_path: Path to the destination directory (B).
    """
    # 保证路径存在
    if not os.path.exists(A_path):
        raise FileNotFoundError(f"The directory {A_path} does not exist.")
    if not os.path.exists(B_path):
        os.makedirs(B_path)

    # 获取路径A中所有非权重文件
    files_in_A = os.listdir(A_path)
    files_in_A = set([file for file in files_in_A if not (".bin" in file or "safetensors" in file )])
    # List all files in directory B
    files_in_B = set(os.listdir(B_path))

    # 找到所有A中存在但B中不存在的文件
    files_to_copy = files_in_A - files_in_B

    # 将这些文件复制到B路径下
    for file in files_to_copy:
        src_file = os.path.join(A_path, file)
        dst_file = os.path.join(B_path, file)
        shutil.copy2(src_file, dst_file)
# Define data loading methods
def load_alpaca():
    data = load_dataset('/root/ld/pull_request/MiniCPM/quantize/quantize_data/alpaca', split="train")

    # concatenate data
    def concatenate_data(x):
        msgs=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": x['input']},{"role": "system", "content": x['output']}]
        data=tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
        return {"text": data}
    
    concatenated = data.map(concatenate_data)
    return [text for text in concatenated["text"]][:1000]

def load_wikitext():
    data = load_dataset('wikitext', 'wikitext-2-raw-v1', split="train")
    return [text for text in data["text"] if text.strip() != '' and len(text.split(' ')) > 20]

# Quantize
model.quantize(tokenizer, quant_config=quant_config, calib_data=load_alpaca())

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

copy_files_not_in_B(model_path,quant_path)

================================================
FILE: OCR_Multimodal_Search/asset/README.md
================================================
# MiniCPM_Series_Tutorial

本人是openbmb负责开源社区的同学,modelbest(面壁智能)一直致力降低大模型使用门槛,提高模型知识密度,让大模型飞入千家万户。

为此我写了MiniCPM和MiniCPMV的教程,包括推理,量化,边端部署,微调,技术报告,应用六个主题,基于MiniCPM的应用我会上传到本仓库。
配套完整教程地址如下:

https://modelbest.feishu.cn/wiki/D2tFw8Pcsi5CIzkaHNacLK64npg?from=from_copylink

b站配套视频:

https://space.bilibili.com/669720247?spm_id_from=333.1007.0.0

b站up名称:

面壁的车辆工程师

## 有趣的项目
以下项目都是个人原创,如果需要可自取,但是注意保护我的个人知识产权,用了给个星星。

### OCR_VG
同时将OCR和定位任务融合,考虑排版问题,该项目在OCR_VG的文件夹下,在可以自取[文字识别与定位教程](https://modelbest.feishu.cn/wiki/HLRiwNgKEic6cckGyGucFvxQnJw?from=from_copylink)。以下是项目效果
![alt text](./OCR_VG/out/1.jpg)
![alt text](./OCR_VG/out/4.jpg)

### mbti角色扮演
与北大Chatlaw团队每个人格训练一个模型不同,仅使用一个2b模型完成了16种人格的无缝切换(可玩人格分裂),教程自取[角色扮演](https://modelbest.feishu.cn/docx/EcNjdGwvwoLkDrxpVrQcLwlknCg?from=from_copylink)
![ESTP](./mbti_role_play/demo_img/ESTP.PNG)
![INTJ](./mbti_role_play/demo_img/INTJ.PNG)
![ESTP1](./mbti_role_play/demo_img/ESTP1.PNG)
![INTJ1](./mbti_role_play/demo_img/INTJ1.PNG)
### 混合模态微调
MiniCPMV的微调仅仅开放了图文双模态的训练,本项目修改了纯文本和图文对的混合训练模式,放在了MIniCPM_Series_Tutorial/ft_language_replace_file文件夹下,可以自取[混合模态微调教程](https://modelbest.feishu.cn/wiki/Y1NbwYijHiuiqvkSf0jcUOvFnTe?from=from_copylink)
对于对齐训练导致的语言模态能力下降是指的对齐后的多模态模型mllm,对于纯语言输入的回复能力有所下降,俗称对齐税(本质上也许是另外一种灾难性遗忘)。
对于抑制灾难性遗忘一种比较简单的方法是混入原始数据,对于多模态的语言能力丢失,则是混入语言数据。这就迎来了另外一个问题,混入哪些语言数据,占比又是多少,这不是本文的重点,笔者亦无力解决这个问题。
但是对于应用来说,mllm并不需要十项全能的语言能力,更多的是在有优秀的多模态能力下保持基础问答以及某一个领域的专业的回复能力。

### 4g显存玩转rag
![alt text](./4G_memory_rag/image.png)
![alt text](./4G_memory_rag/image1.png)
这个没什么好解释的,可以在极低显存下运行rag,教程自取[RAG](https://modelbest.feishu.cn/wiki/G5NlwYGGAiJWGmkCc4NcQ3sAnms?from=from_copylink)

### MiniCPMV2.6的awq量化
由于bnb量化的minicpmv2.6无法用vllm加载,因此适配了autoawq,目前已经向autoawq提了pr,等合并后可以直接使用。
使用方法如下:

1. 获取个人autoawq分支
```bash
git clone https://github.com/LDLINGLINGLING/AutoAWQ
cd AutoAWQ
pip install e .
```
2. 将MiniCPM_Series_Tutorial/MiniCPMV2_6_awq/modeling_minicpmv.py文件替换掉minicpmv2.6模型保存路径下的同名文件
3. 修改MiniCPM_Series_Tutorial/MiniCPMV2_6_awq/quantize.py中的model_path为你minicpmv2.6的保存路径。
4. 运行quantize.py

获得minicpmv2.6的awq模型后可以使用原来的vllm进行部署,部署方式完全相同,模型从16g显存将为7g显存
![alt text](./MiniCPMV2_6_awq/image.png)

================================================
FILE: OCR_Multimodal_Search/finetune/__init__.py
================================================


================================================
FILE: OCR_Multimodal_Search/finetune/app.log
================================================
[File too large to display: 18.7 MB]

================================================
FILE: OCR_Multimodal_Search/finetune/dataset.py
================================================
import copy
import json
import logging
import math
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional

import numpy as np
import torch
from PIL import Image
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset
from transformers import AutoProcessor, AutoTokenizer
import logging

# 创建一个 logger
logger = logging.getLogger('my_logger')
logger.setLevel(logging.DEBUG)  # 设置 logger 的级别

# 创建一个 handler,用于写入日志文件
fh = logging.FileHandler('app.log')
fh.setLevel(logging.DEBUG)

# 再创建一个 handler,用于输出到控制台
ch = logging.StreamHandler()
ch.setLevel(logging.ERROR)

# 定义 handler 的输出格式
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)

# 给 logger 添加 handler
logger.addHandler(fh)
logger.addHandler(ch)

# 记录一条日志
logger.info('This is an info message')
logger.error('This is an error message')
llama3_chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}"

class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(
        self,
        raw_data,
        transform,
        tokenizer,
        slice_config,
        llm_type="minicpm",
        patch_size=14,
        query_nums=64,
        batch_vision=False,
    ):
        super(SupervisedDataset, self).__init__()
        self.raw_data = raw_data
        self.tokenizer = tokenizer
        self.transform = transform
        self.slice_config = slice_config
        self.llm_type = llm_type
        self.patch_size = patch_size
        self.query_nums=query_nums
        self.batch_vision = batch_vision

    def __len__(self):
        return len(self.raw_data)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        image = Image.open(self.raw_data[i]["image"]).convert("RGB")
        query = self.raw_data[i]["query"]
        query_ids = conversation_to_ids_minicpm(conversation=[{
                "content": query,
                "role": "user"
            },{
                "content": "",
                "role": "assistant"
            }], tokenizer=self.tokenizer)[0]
        if "cpm" in self.llm_type:
            query_ids = [item for sublist in query_ids for item in sublist]
        ret = preprocess(
            image,
            self.raw_data[i]["conversations"],
            self.tokenizer,
            self.transform,
            query_nums=self.query_nums,
            slice_config=self.slice_config,
            llm_type=self.llm_type,
            patch_size=self.patch_size,
            batch_vision=self.batch_vision,
        )
        ret['query_ids'] = torch.tensor(query_ids)
        ret = dict(
            input_ids=ret["input_ids"],
            query_ids = ret['query_ids'],
            position_ids=ret["position_ids"],
            labels=ret["target"],
            attention_mask=torch.ones_like(ret["input_ids"], dtype=torch.bool),
            pixel_values=ret["pixel_values"],
            tgt_sizes=ret["tgt_sizes"],
            image_bound=ret["image_bound"],
        )

        return ret

def data_collator(examples, padding_value=0, max_length=2048):
    def trim_and_pad(seq, batch_first, padding_value):
        return pad_sequence([s[:max_length] for s in seq], batch_first=True, padding_value=padding_value)
    query_ids = trim_and_pad(
        [example["query_ids"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    input_ids = trim_and_pad(
        [example["input_ids"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    position_ids = trim_and_pad(
        [example["position_ids"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    targets = trim_and_pad(
        [example["labels"] for example in examples],
        batch_first=True,
        padding_value=-100,
    )
    attention_mask = trim_and_pad(
        [example["attention_mask"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    pixel_values = [example["pixel_values"] for example in examples]
    image_bound = [example["image_bound"] for example in examples]
    tgt_sizes = [example["tgt_sizes"] for example in examples]
    return {
        "input_ids": input_ids,
        "query_ids" : query_ids,
        "position_ids": position_ids,
        "labels": targets,
        "attention_mask": attention_mask,
        "image_bound": image_bound,
        "tgt_sizes": tgt_sizes,
        "pixel_values": pixel_values,
    }


def conversation_to_ids(conversation, tokenizer, llm_type=None, new_schema=False):
    """
    for single image multi-turn conversation
    conversation: [{'role': 'user', 'content': 'Describe this image'},
                   {'role': 'assistant', 'content': 'This is a cat.'}]
    """
    if llm_type == "llama3":
        input_ids, context, raw_msg = conversation_to_ids_llama3(
            conversation, tokenizer
        )
    elif llm_type == "qwen2":
        input_ids, context, raw_msg = conversation_to_ids_qwen2(
            conversation, tokenizer
        )
    else:
        input_ids, context, raw_msg = conversation_to_ids_minicpm(
            conversation, tokenizer
        )

    ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))
    context = torch.from_numpy(np.hstack(context, dtype=np.int8))

    # build target
    target = torch.full_like(ids, -100, dtype=torch.int32)
    
    for i in range(1, len(ids)):
        if context[i] == 0:
            target[i - 1] = ids[i]
        if context[i] == 1 and context[i - 1] == 0:
            if hasattr(tokenizer, "eot_id"):
                target[i - 1] = tokenizer.eot_id
            else:
                target[i - 1] = tokenizer.eos_id
    
    # build image bound
    if new_schema:
        start_cond = (ids == tokenizer.im_start_id) | (ids == tokenizer.slice_start_id)
        end_cond = (ids == tokenizer.im_end_id) | (ids == tokenizer.slice_end_id)
        image_start_tokens = torch.where(start_cond)[0]
        image_start_tokens += 1
        image_end_tokens = torch.where(end_cond)[0]
    else:
        image_start_tokens = torch.where(ids == tokenizer.im_start_id)[0]
        image_start_tokens += 1
        image_end_tokens = torch.where(ids == tokenizer.im_end_id)[0]
    if len(image_start_tokens) != len(image_end_tokens):
        print("image start token != image end tokens")
    
    if len(image_start_tokens) > 0:
        image_bound = torch.hstack(
            [image_start_tokens.unsqueeze(-1), image_end_tokens.unsqueeze(-1)]
        )
    else:
        image_bound = []

    position_ids = torch.arange(ids.size(0)).long()
    return {
        "input_ids": ids,
        "target": target,
        "image_bound": image_bound,
        "raw_msg": raw_msg,
        "position_ids": position_ids
    }


def conversation_to_ids_minicpm(conversation, tokenizer):
    raw_msg = ""
    input_ids = []
    context = []
    for idx, msg in enumerate(conversation):
        role = msg["role"]
        message = msg["content"]
        assert role in ["user", "assistant"]
        if role == "user":
            prefix = "<用户>"
        else:
            prefix = "<AI>"
        # append eos
        if idx == len(conversation) - 1:
            message = message + tokenizer.eos_token
        prefix_ids = tokenizer.encode(prefix)[1:]  # remove bos
        message_ids = tokenizer.encode(message)[1:]

        input_ids.append(prefix_ids)
        input_ids.append(message_ids)

        context.append(np.ones((len(prefix_ids),), dtype=np.int8))
        if role == "assistant":
            context.append(np.zeros((len(message_ids),), dtype=np.int8))
        else:
            context.append(np.ones((len(message_ids),), dtype=np.int8))

        raw_msg += prefix + message

    return input_ids, context, raw_msg


def conversation_to_ids_llama3(conversation, tokenizer):
    raw_msg = ""
    input_ids = []
    context = []
    raw_msg = tokenizer.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=False, chat_template=llama3_chat_template,
    )
    input_ids = tokenizer.apply_chat_template(
        conversation, tokenize=True, add_generation_prompt=False, chat_template=llama3_chat_template,
    )
    input_ids = np.array(input_ids)

    start_header_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|start_header_id|>")
    )[0]
    assistant_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("assistant")
    )[0]
    end_header_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|end_header_id|>")
    )[0]
    eot_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|eot_id|>"))[0]

    context = np.ones_like(input_ids, dtype=np.int8)

    for assistant_idx in assistant_idxs:
        if assistant_idx in set((start_header_idxs + end_header_idxs) / 2):
            st = assistant_idx + 3  # assistant<|end_header_id|>\n\n
            for eot_idx in eot_idxs:
                if eot_idx > st:
                    context[st: eot_idx + 1] = 0
                    break

    input_ids = np.hstack(input_ids)
    context = np.hstack(context)

    return input_ids, context, raw_msg


def conversation_to_ids_qwen2(conversation, tokenizer):
    raw_msg = ""
    chat = []
    context = []
    for idx, msg in enumerate(conversation):
        role = msg["role"]
        message = msg["content"]
        assert role in ["user", "assistant"]
        if role == "user":
            prefix = "user"
        else:
            prefix = "assistant"
        chat.append({"role":prefix, "content":message})
        raw_msg += prefix + message
    #assert set([i['role'] for i in chat]) & set(['assistant'])

    ret = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)
    input_ids = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=False)
    input_ids = np.array(input_ids)

    start_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_start|>'))[0]
    assistant_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('assistant'))[0]
    end_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_end|>'))[0]

    context = np.ones_like(input_ids, dtype=np.int8)

    for assistant_idx in assistant_idxs:
        if assistant_idx-1 in set(start_idxs):
            st = assistant_idx + 1
            for end_idx in end_idxs:
                if end_idx > st:
                    context[st: end_idx + 1] = 0
                    break
                    
    input_ids = np.hstack(input_ids)
    context = np.hstack(context)
    return input_ids, context, raw_msg



def preprocess(
    image,
    conversation,
    tokenizer,
    transform,
    query_nums=64,
    slice_config=None,
    llm_type=None,
    patch_size=14,
    batch_vision=False,
):
    """
    single image preprocess, the image will be placed at the top of the conversation
    """
    conversation = copy.deepcopy(conversation)
    assert len(conversation) > 1, "conversation length must large than 2"
    assert conversation[0]["role"] == "user", "the first role must be user"

    if slice_config is not None:
        assert isinstance(slice_config, Dict)
        assert "patch_size" in slice_config
        assert "max_slice_nums" in slice_config
        assert "scale_resolution" in slice_config
    default_image_placeholder = (
        tokenizer.im_start + tokenizer.unk_token * query_nums + tokenizer.im_end
    )
    new_schema = False
    use_image_id = False
    if llm_type=='qwen2':
        new_schema = True
        use_image_id = True
    if slice_config:
        images = []
        image_id_cnt = 0 
        source_image, patches, best_grid = slice_image(
            image,
            slice_config["max_slice_nums"],
            slice_config["scale_resolution"],
            slice_config["patch_size"],
        )
        images.append(source_image)
        image_placeholder = default_image_placeholder
        if len(patches) > 0:
            for i in range(len(patches)):
                for j in range(len(patches[0])):
                    images.append(patches[i][j])
            if use_image_id:
                image_placeholder = f'{tokenizer.im_id_start}{image_id_cnt}{tokenizer.im_id_end}' + image_placeholder
                image_id_cnt += 1
            image_placeholder += get_grid_placeholder(
                tokenizer, best_grid, query_nums, new_schema = new_schema)
        images = [transform(i) for i in images]
    else:
        images = [transform(image)]
        image_placeholder = default_image_placeholder
    if "<image>" in conversation[0]["content"]:
        conversation[0]["content"] = conversation[0]["content"].replace(
            "<image>", image_placeholder
        )
    else:
        conversation[0]["content"] = (
            image_placeholder + "\n" + conversation[0]["content"]
        )

    input_dict = conversation_to_ids(conversation, tokenizer, llm_type, new_schema)

    if batch_vision:
        tgt_sizes = []
        reshape_images = []
        for image in images:
            H, W = image.shape[1:]
            reshape_image = reshape_by_patch(image, patch_size)
            reshape_images.append(reshape_image)
            tgt_sizes.append([H // patch_size, W // patch_size])
        if tgt_sizes:
            tgt_sizes = torch.Tensor(tgt_sizes).type(torch.int32)

        input_dict["pixel_values"] = reshape_images
        input_dict["tgt_sizes"] = tgt_sizes

    else:
        input_dict["pixel_values"] = images
        input_dict["tgt_sizes"] = []

    return input_dict


def slice_image(
    image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False
):
    original_size = image.size
    original_width, original_height = original_size
    log_ratio = math.log(original_width / original_height)
    ratio = original_width * original_height / \
        (scale_resolution * scale_resolution)
    multiple = min(math.ceil(ratio), max_slice_nums)

    source_image = None
    best_grid = None
    patches = []

    if multiple <= 1 or never_split:
        # dont need to slice, upsample
        best_size = find_best_resize(
            original_size, scale_resolution, patch_size, allow_upscale=True
        )
        source_image = image.resize(best_size, Image.Resampling.BICUBIC)
    else:
        candidate_split_grids_nums = []
        for i in [multiple - 1, multiple, multiple + 1]:
            if i == 1 or i > max_slice_nums:
                continue
            candidate_split_grids_nums.append(i)

        # source image, down-sampling and ensure divided by patch_size
        best_resize = find_best_resize(
            original_size, scale_resolution, patch_size)
        source_image = image.copy().resize(best_resize, Image.Resampling.BICUBIC)
        candidate_grids = []

        # find best grid
        for split_grids_nums in candidate_split_grids_nums:
            m = 1
            while m <= split_grids_nums:
                if split_grids_nums % m == 0:
                    candidate_grids.append([m, split_grids_nums // m])
                m += 1

        best_grid = [1, 1]
        min_error = float("inf")
        for grid in candidate_grids:
            error = abs(log_ratio - math.log(grid[0] / grid[1]))
            if error < min_error:
                best_grid = grid
                min_error = error

        refine_size = get_refine_size(
            original_size, best_grid, scale_resolution, patch_size, allow_upscale=True
        )

        refine_image = image.resize(refine_size, Image.Resampling.BICUBIC)
        patches = split_to_patches(refine_image, best_grid)

    return source_image, patches, best_grid


def ensure_divide(length, patch_size):
    return max(round(length / patch_size) * patch_size, patch_size)


def find_best_resize(original_size, scale_resolution, patch_size, allow_upscale=False):
    width, height = original_size
    if (width * height > scale_resolution * scale_resolution) or allow_upscale:
        r = width / height
        height = int(scale_resolution / math.sqrt(r))
        width = int(height * r)
    best_width = ensure_divide(width, patch_size)
    best_height = ensure_divide(height, patch_size)
    return (best_width, best_height)


def get_refine_size(
    original_size, grid, scale_resolution, patch_size, allow_upscale=False
):
    width, height = original_size
    grid_x, grid_y = grid

    refine_width = ensure_divide(width, grid_x)
    refine_height = ensure_divide(height, grid_y)

    grid_width = refine_width / grid_x
    grid_height = refine_height / grid_y

    best_grid_size = find_best_resize(
        (grid_width, grid_height),
        scale_resolution,
        patch_size,
        allow_upscale=allow_upscale,
    )

    refine_size = (best_grid_size[0] * grid_x, best_grid_size[1] * grid_y)

    return refine_size


def split_to_patches(image, grid):
    patches = []
    width, height = image.size
    grid_x = int(width / grid[0])
    grid_y = int(height / grid[1])

    for i in range(0, height, grid_y):
        images = []
        for j in range(0, width, grid_x):
            box = (j, i, j + grid_x, i + grid_y)
            patch = image.crop(box)
            images.append(patch)
        patches.append(images)

    return patches


def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
    image_placeholder = (
        tokenizer.im_start + tokenizer.unk_token * query_num + tokenizer.im_end
    )

    cols = grid[0]
    rows = grid[1]
    slices = []
    for i in range(rows):
        lines = []
        for j in range(cols):
            lines.append(image_placeholder)
        slices.append("".join(lines))
    if new_schema:
        slice_placeholder = '\n'.join(slices)
    else:
        slice_placeholder = tokenizer.slice_start + \
        "\n".join(slices) + tokenizer.slice_end
    return slice_placeholder


def reshape_by_patch(image_tensor, patch_size):
    """
    :param image_tensor: shape [3, H, W]
    :param patch_size:
    :return: [3, patch_size, HW/patch_size]
    """
    patches = torch.nn.functional.unfold(
        image_tensor, (patch_size, patch_size), stride=(patch_size, patch_size)
    )

    patches = patches.reshape(image_tensor.size(0), patch_size, patch_size, -1)
    patches = patches.permute(0, 1, 3, 2).reshape(
        image_tensor.size(0), patch_size, -1)
    return patches

================================================
FILE: OCR_Multimodal_Search/finetune/dataset_original.py
================================================
import copy
import json
import logging
import math
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional

import numpy as np
import torch
from PIL import Image
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset
from transformers import AutoProcessor, AutoTokenizer

llama3_chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}"

class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(
        self,
        raw_data,
        transform,
        tokenizer,
        slice_config,
        llm_type="minicpm",
        patch_size=14,
        query_nums=64,
        batch_vision=False,
    ):
        super(SupervisedDataset, self).__init__()
        self.raw_data = raw_data
        self.tokenizer = tokenizer
        self.transform = transform
        self.slice_config = slice_config
        self.llm_type = llm_type
        self.patch_size = patch_size
        self.query_nums=query_nums
        self.batch_vision = batch_vision

    def __len__(self):
        return len(self.raw_data)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        image = Image.open(self.raw_data[i]["image"]).convert("RGB")
        ret = preprocess(
            image,
            self.raw_data[i]["conversations"],
            self.tokenizer,
            self.transform,
            query_nums=self.query_nums,
            slice_config=self.slice_config,
            llm_type=self.llm_type,
            patch_size=self.patch_size,
            batch_vision=self.batch_vision,
        )
        ret = dict(
            input_ids=ret["input_ids"],
            position_ids=ret["position_ids"],
            labels=ret["target"],
            attention_mask=torch.ones_like(ret["input_ids"], dtype=torch.bool),
            pixel_values=ret["pixel_values"],
            tgt_sizes=ret["tgt_sizes"],
            image_bound=ret["image_bound"],
        )

        return ret

def data_collator(examples, padding_value=0, max_length=2048):
    def trim_and_pad(seq, batch_first, padding_value):
        return pad_sequence([s[:max_length] for s in seq], batch_first=True, padding_value=padding_value)

    input_ids = trim_and_pad(
        [example["input_ids"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    position_ids = trim_and_pad(
        [example["position_ids"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    targets = trim_and_pad(
        [example["labels"] for example in examples],
        batch_first=True,
        padding_value=-100,
    )
    attention_mask = trim_and_pad(
        [example["attention_mask"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    pixel_values = [example["pixel_values"] for example in examples]
    image_bound = [example["image_bound"] for example in examples]
    tgt_sizes = [example["tgt_sizes"] for example in examples]
    return {
        "input_ids": input_ids,
        "position_ids": position_ids,
        "labels": targets,
        "attention_mask": attention_mask,
        "image_bound": image_bound,
        "tgt_sizes": tgt_sizes,
        "pixel_values": pixel_values,
    }


def conversation_to_ids(conversation, tokenizer, llm_type=None, new_schema=False):
    """
    for single image multi-turn conversation
    conversation: [{'role': 'user', 'content': 'Describe this image'},
                   {'role': 'assistant', 'content': 'This is a cat.'}]
    """
    if llm_type == "llama3":
        input_ids, context, raw_msg = conversation_to_ids_llama3(
            conversation, tokenizer
        )
    elif llm_type == "qwen2":
        input_ids, context, raw_msg = conversation_to_ids_qwen2(
            conversation, tokenizer
        )
    else:
        input_ids, context, raw_msg = conversation_to_ids_minicpm(
            conversation, tokenizer
        )

    ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))
    context = torch.from_numpy(np.hstack(context, dtype=np.int8))

    # build target
    target = torch.full_like(ids, -100, dtype=torch.int32)
    
    for i in range(1, len(ids)):
        if context[i] == 0:
            target[i - 1] = ids[i]
        if context[i] == 1 and context[i - 1] == 0:
            if hasattr(tokenizer, "eot_id"):
                target[i - 1] = tokenizer.eot_id
            else:
                target[i - 1] = tokenizer.eos_id
    
    # build image bound
    if new_schema:
        start_cond = (ids == tokenizer.im_start_id) | (ids == tokenizer.slice_start_id)
        end_cond = (ids == tokenizer.im_end_id) | (ids == tokenizer.slice_end_id)
        image_start_tokens = torch.where(start_cond)[0]
        image_start_tokens += 1
        image_end_tokens = torch.where(end_cond)[0]
    else:
        image_start_tokens = torch.where(ids == tokenizer.im_start_id)[0]
        image_start_tokens += 1
        image_end_tokens = torch.where(ids == tokenizer.im_end_id)[0]
    if len(image_start_tokens) != len(image_end_tokens):
        print("image start token != image end tokens")
    
    if len(image_start_tokens) > 0:
        image_bound = torch.hstack(
            [image_start_tokens.unsqueeze(-1), image_end_tokens.unsqueeze(-1)]
        )
    else:
        image_bound = []

    position_ids = torch.arange(ids.size(0)).long()
    return {
        "input_ids": ids,
        "target": target,
        "image_bound": image_bound,
        "raw_msg": raw_msg,
        "position_ids": position_ids
    }


def conversation_to_ids_minicpm(conversation, tokenizer):
    raw_msg = ""
    input_ids = []
    context = []
    for idx, msg in enumerate(conversation):
        role = msg["role"]
        message = msg["content"]
        assert role in ["user", "assistant"]
        if role == "user":
            prefix = "<用户>"
        else:
            prefix = "<AI>"
        # append eos
        if idx == len(conversation) - 1:
            message = message + tokenizer.eos_token
        prefix_ids = tokenizer.encode(prefix)[1:]  # remove bos
        message_ids = tokenizer.encode(message)[1:]

        input_ids.append(prefix_ids)
        input_ids.append(message_ids)

        context.append(np.ones((len(prefix_ids),), dtype=np.int8))
        if role == "assistant":
            context.append(np.zeros((len(message_ids),), dtype=np.int8))
        else:
            context.append(np.ones((len(message_ids),), dtype=np.int8))

        raw_msg += prefix + message

    return input_ids, context, raw_msg


def conversation_to_ids_llama3(conversation, tokenizer):
    raw_msg = ""
    input_ids = []
    context = []
    raw_msg = tokenizer.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=False, chat_template=llama3_chat_template,
    )
    input_ids = tokenizer.apply_chat_template(
        conversation, tokenize=True, add_generation_prompt=False, chat_template=llama3_chat_template,
    )
    input_ids = np.array(input_ids)

    start_header_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|start_header_id|>")
    )[0]
    assistant_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("assistant")
    )[0]
    end_header_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|end_header_id|>")
    )[0]
    eot_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|eot_id|>"))[0]

    context = np.ones_like(input_ids, dtype=np.int8)

    for assistant_idx in assistant_idxs:
        if assistant_idx in set((start_header_idxs + end_header_idxs) / 2):
            st = assistant_idx + 3  # assistant<|end_header_id|>\n\n
            for eot_idx in eot_idxs:
                if eot_idx > st:
                    context[st: eot_idx + 1] = 0
                    break

    input_ids = np.hstack(input_ids)
    context = np.hstack(context)

    return input_ids, context, raw_msg


def conversation_to_ids_qwen2(conversation, tokenizer):
    raw_msg = ""
    chat = []
    context = []
    for idx, msg in enumerate(conversation):
        role = msg["role"]
        message = msg["content"]
        assert role in ["user", "assistant"]
        if role == "user":
            prefix = "user"
        else:
            prefix = "assistant"
        chat.append({"role":prefix, "content":message})
        raw_msg += prefix + message
    assert set([i['role'] for i in chat]) & set(['assistant'])

    ret = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)
    input_ids = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=False)
    input_ids = np.array(input_ids)

    start_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_start|>'))[0]
    assistant_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('assistant'))[0]
    end_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_end|>'))[0]

    context = np.ones_like(input_ids, dtype=np.int8)

    for assistant_idx in assistant_idxs:
        if assistant_idx-1 in set(start_idxs):
            st = assistant_idx + 1
            for end_idx in end_idxs:
                if end_idx > st:
                    context[st: end_idx + 1] = 0
                    break
                    
    input_ids = np.hstack(input_ids)
    context = np.hstack(context)
    return input_ids, context, raw_msg



def preprocess(
    image,
    conversation,
    tokenizer,
    transform,
    query_nums=64,
    slice_config=None,
    llm_type=None,
    patch_size=14,
    batch_vision=False,
):
    """
    single image preprocess, the image will be placed at the top of the conversation
    """
    conversation = copy.deepcopy(conversation)
    assert len(conversation) > 1, "conversation length must large than 2"
    assert conversation[0]["role"] == "user", "the first role must be user"

    if slice_config is not None:
        assert isinstance(slice_config, Dict)
        assert "patch_size" in slice_config
        assert "max_slice_nums" in slice_config
        assert "scale_resolution" in slice_config
    default_image_placeholder = (
        tokenizer.im_start + tokenizer.unk_token * query_nums + tokenizer.im_end
    )
    new_schema = False
    use_image_id = False
    if llm_type=='qwen2':
        new_schema = True
        use_image_id = True
    if slice_config:
        images = []
        image_id_cnt = 0 
        source_image, patches, best_grid = slice_image(
            image,
            slice_config["max_slice_nums"],
            slice_config["scale_resolution"],
            slice_config["patch_size"],
        )
        images.append(source_image)
        image_placeholder = default_image_placeholder
        if len(patches) > 0:
            for i in range(len(patches)):
                for j in range(len(patches[0])):
                    images.append(patches[i][j])
            if use_image_id:
                image_placeholder = f'{tokenizer.im_id_start}{idx}{tokenizer.im_id_end}' + image_placeholder
                image_id_cnt += 1
            image_placeholder += get_grid_placeholder(
                tokenizer, best_grid, query_nums, new_schema = new_schema)
        images = [transform(i) for i in images]
    else:
        images = [transform(image)]
        image_placeholder = default_image_placeholder
    if "<image>" in conversation[0]["content"]:
        conversation[0]["content"] = conversation[0]["content"].replace(
            "<image>", image_placeholder
        )
    else:
        conversation[0]["content"] = (
            image_placeholder + "\n" + conversation[0]["content"]
        )

    input_dict = conversation_to_ids(conversation, tokenizer, llm_type, new_schema)

    if batch_vision:
        tgt_sizes = []
        reshape_images = []
        for image in images:
            H, W = image.shape[1:]
            reshape_image = reshape_by_patch(image, patch_size)
            reshape_images.append(reshape_image)
            tgt_sizes.append([H // patch_size, W // patch_size])
        if tgt_sizes:
            tgt_sizes = torch.Tensor(tgt_sizes).type(torch.int32)

        input_dict["pixel_values"] = reshape_images
        input_dict["tgt_sizes"] = tgt_sizes

    else:
        input_dict["pixel_values"] = images
        input_dict["tgt_sizes"] = []

    return input_dict


def slice_image(
    image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False
):
    original_size = image.size
    original_width, original_height = original_size
    log_ratio = math.log(original_width / original_height)
    ratio = original_width * original_height / \
        (scale_resolution * scale_resolution)
    multiple = min(math.ceil(ratio), max_slice_nums)

    source_image = None
    best_grid = None
    patches = []

    if multiple <= 1 or never_split:
        # dont need to slice, upsample
        best_size = find_best_resize(
            original_size, scale_resolution, patch_size, allow_upscale=True
        )
        source_image = image.resize(best_size, Image.Resampling.BICUBIC)
    else:
        candidate_split_grids_nums = []
        for i in [multiple - 1, multiple, multiple + 1]:
            if i == 1 or i > max_slice_nums:
                continue
            candidate_split_grids_nums.append(i)

        # source image, down-sampling and ensure divided by patch_size
        best_resize = find_best_resize(
            original_size, scale_resolution, patch_size)
        source_image = image.copy().resize(best_resize, Image.Resampling.BICUBIC)
        candidate_grids = []

        # find best grid
        for split_grids_nums in candidate_split_grids_nums:
            m = 1
            while m <= split_grids_nums:
                if split_grids_nums % m == 0:
                    candidate_grids.append([m, split_grids_nums // m])
                m += 1

        best_grid = [1, 1]
        min_error = float("inf")
        for grid in candidate_grids:
            error = abs(log_ratio - math.log(grid[0] / grid[1]))
            if error < min_error:
                best_grid = grid
                min_error = error

        refine_size = get_refine_size(
            original_size, best_grid, scale_resolution, patch_size, allow_upscale=True
        )

        refine_image = image.resize(refine_size, Image.Resampling.BICUBIC)
        patches = split_to_patches(refine_image, best_grid)

    return source_image, patches, best_grid


def ensure_divide(length, patch_size):
    return max(round(length / patch_size) * patch_size, patch_size)


def find_best_resize(original_size, scale_resolution, patch_size, allow_upscale=False):
    width, height = original_size
    if (width * height > scale_resolution * scale_resolution) or allow_upscale:
        r = width / height
        height = int(scale_resolution / math.sqrt(r))
        width = int(height * r)
    best_width = ensure_divide(width, patch_size)
    best_height = ensure_divide(height, patch_size)
    return (best_width, best_height)


def get_refine_size(
    original_size, grid, scale_resolution, patch_size, allow_upscale=False
):
    width, height = original_size
    grid_x, grid_y = grid

    refine_width = ensure_divide(width, grid_x)
    refine_height = ensure_divide(height, grid_y)

    grid_width = refine_width / grid_x
    grid_height = refine_height / grid_y

    best_grid_size = find_best_resize(
        (grid_width, grid_height),
        scale_resolution,
        patch_size,
        allow_upscale=allow_upscale,
    )

    refine_size = (best_grid_size[0] * grid_x, best_grid_size[1] * grid_y)

    return refine_size


def split_to_patches(image, grid):
    patches = []
    width, height = image.size
    grid_x = int(width / grid[0])
    grid_y = int(height / grid[1])

    for i in range(0, height, grid_y):
        images = []
        for j in range(0, width, grid_x):
            box = (j, i, j + grid_x, i + grid_y)
            patch = image.crop(box)
            images.append(patch)
        patches.append(images)

    return patches


def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
    image_placeholder = (
        tokenizer.im_start + tokenizer.unk_token * query_num + tokenizer.im_end
    )

    cols = grid[0]
    rows = grid[1]
    slices = []
    for i in range(rows):
        lines = []
        for j in range(cols):
            lines.append(image_placeholder)
        slices.append("".join(lines))
    if new_schema:
        slice_placeholder = '\n'.join(slices)
    else:
        slice_placeholder = tokenizer.slice_start + \
        "\n".join(slices) + tokenizer.slice_end
    return slice_placeholder


def reshape_by_patch(image_tensor, patch_size):
    """
    :param image_tensor: shape [3, H, W]
    :param patch_size:
    :return: [3, patch_size, HW/patch_size]
    """
    patches = torch.nn.functional.unfold(
        image_tensor, (patch_size, patch_size), stride=(patch_size, patch_size)
    )

    patches = patches.reshape(image_tensor.size(0), patch_size, patch_size, -1)
    patches = patches.permute(0, 1, 3, 2).reshape(
        image_tensor.size(0), patch_size, -1)
    return patches

================================================
FILE: OCR_Multimodal_Search/finetune/ds_config_zero2.json
================================================
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}


================================================
FILE: OCR_Multimodal_Search/finetune/ds_config_zero3.json
================================================

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}



================================================
FILE: OCR_Multimodal_Search/finetune/finetune.py
================================================
import glob
import json
import logging
import os
from dataclasses import dataclass, field
from functools import partial
from typing import Dict, List, Optional, Union, Literal, Tuple
from types import MethodType
from torchvision import transforms

import torch
import transformers

from accelerate.utils import DistributedType
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus

from transformers import AutoModel, AutoTokenizer,BitsAndBytesConfig
from transformers.integrations import deepspeed
from transformers import AutoModel, AutoTokenizer

from dataset import SupervisedDataset, data_collator
from trainer import CPMTrainer

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="openbmb/MiniCPM-V-2")


@dataclass
class DataArguments:
    data_path: str = field(
        default=None, metadata={"help": "Path to the training data."}
    )
    eval_data_path: str = field(
        default=None, metadata={"help": "Path to the evaluation data."}
    )


@dataclass
class TrainingArguments(transformers.TrainingArguments):
    #gradient_checkpointing_kwarg={'use_reentrant':False}
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(
        default=2048,
        metadata={
            "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
        },
    )
    tune_vision: Optional[bool] = field(default=True)
    tune_llm: Optional[bool] = field(default=True)
    llm_type: str = field(default="minicpm")
    use_lora: Optional[bool] = field(default=False)
    max_slice_nums: Optional[int] = field(default=9)
    lr_scheduler_type: Union[str] = field(default="cosine")
    # optim: Union[str] = field(default="galore_adamw")
    # optim_target_modules: Union[str]=field(default=r"(.*\.self_attn\..*)|(.*\.mlp\..*)")


@dataclass
class LoraArguments:
    lora_r: int = 64
    lora_alpha: int = 64
    lora_dropout: float = 0.05
    lora_target_modules: str = r"llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj|v_proj)"
    lora_weight_path: str = ""
    lora_bias: str = "none"
    q_lora: bool = False
    lora_modules_to_save: str = ""
    lora_layer_replication: Optional[List[Tuple[int, int]]] = None
    lora_layers_to_transform: Optional[List[int]] = None
    lora_layers_pattern: Optional[str] = None

local_rank = None
def rank0_print(*args):
    if local_rank == 0:
        print(*args)


def safe_save_model_for_hf_trainer(trainer, output_dir: str, bias="none"):
    """Collects the state dict and dump to disk."""
    if trainer.args.should_save and trainer.args.local_rank == 0:
        trainer.save_model(output_dir,)


def make_supervised_data_module(
    tokenizer: transformers.PreTrainedTokenizer,
    data_args,
    transform,
    data_collator=None,
    llm_type="minicpm",
    slice_config=None,
    patch_size=14,
    query_nums=64,
    batch_vision=False,
    max_length=2048,
) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    dataset_cls = SupervisedDataset

    rank0_print("Loading data...")

    train_json = json.load(open(data_args.data_path, "r"))
    train_dataset = dataset_cls(
        train_json,
        transform,
        tokenizer,
        slice_config=slice_config,
        llm_type=llm_type,
        patch_size=patch_size,
        query_nums=query_nums,
        batch_vision=batch_vision,
    )

    if data_args.eval_data_path:
        eval_json = json.load(open(data_args.eval_data_path, "r"))
        eval_dataset = dataset_cls(
            eval_json,
            transform,
            tokenizer,
            slice_config=slice_config,
            llm_type=llm_type,
            patch_size=patch_size,
            query_nums=query_nums,
            batch_vision=batch_vision,
        )
    else:
        eval_dataset = None

    return dict(
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator= partial(data_collator, max_length=max_length),
    )


def build_transform():
    IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5) # timm.data.IMAGENET_INCEPTION_MEAN
    IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)  # timm.data.IMAGENET_INCEPTION_STD
    return transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD
                ),
            ]
        )

def get_parameter_number(model):
    trainable_params, all_param = 0, 0
    for param in model.parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
        
    return {'Total': all_param, 'Trainable': trainable_params}


local_rank = 0


def train():
    global local_rank
    parser = transformers.HfArgumentParser(
        (ModelArguments, DataArguments, TrainingArguments, LoraArguments)
    )

    (
        model_args,
        data_args,
        training_args,
        lora_args,
    ) = parser.parse_args_into_dataclasses()

    if getattr(training_args, "deepspeed", None) : 
        training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED

    compute_dtype = (
        torch.float16
        if training_args.fp16
        else (torch.bfloat16 if training_args.bf16 else torch.float32)
    )

    local_rank = training_args.local_rank
    world_size = int(os.environ.get("WORLD_SIZE", 1))
    ddp = world_size != 1
    device_map = None
    if lora_args.q_lora:
        device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,  # 是否进行4bit量化
            load_in_8bit=False,  # 是否进行8bit量化
            bnb_4bit_compute_dtype=torch.float16,  # 计算精度设置
            bnb_4bit_quant_storage=torch.uint8,  # 量化权重的储存格式
            bnb_4bit_quant_type="nf4",  # 量化格式,这里用的是正太分布的int4
            bnb_4bit_use_double_quant=True,  # 是否采用双量化,即对zeropoint和scaling参数进行量化
            llm_int8_enable_fp32_cpu_offload=False,  # 是否llm使用int8,cpu上保存的参数使用fp32
            llm_int8_has_fp16_weight=False,  # 是否启用混合精度
            llm_int8_skip_modules=["text_proj","image_proj"],  # 不进行量化的模块
            llm_int8_threshold=6.0,  # llm.int8()算法中的离群值,根据这个值区分是否进行量化
        )
        model = AutoModel.from_pretrained(
            model_args.model_name_or_path,
            trust_remote_code=True,
            torch_dtype=compute_dtype,
            device_map=device_map,
            quantization_config=quantization_config,
        )
        if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled():
            logging.warning(
                "FSDP or ZeRO3 are not incompatible with QLoRA."
            )
    else:
        model = AutoModel.from_pretrained(
            model_args.model_name_or_path,
            trust_remote_code=True,
            torch_dtype=compute_dtype,
            device_map=device_map,
        )

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=True
    )
    model.text_proj.requires_grad_(True)

    if not training_args.tune_vision:
        model.vpm.requires_grad_(False)
    if not training_args.tune_llm:
        model.llm.requires_grad_(False)
        
    if training_args.use_lora:
        if training_args.use_lora and training_args.tune_llm:
            raise ValueError("The model cannot simultaneously adjust LLM parameters and apply LoRA.")
            
        rank0_print("Currently using LoRA for fine-tuning the MiniCPM-V model.")
        for name, param in model.llm.named_parameters():
            param.requires_grad = False
        modules_to_save = ['embed_tokens','resampler']
        if training_args.tune_vision:
            modules_to_save.append('vpm')
        lora_config = LoraConfig(
            r=lora_args.lora_r,
            lora_alpha=lora_args.lora_alpha,
            target_modules=lora_args.lora_target_modules,
            lora_dropout=lora_args.lora_dropout,
            bias=lora_args.lora_bias,
            layers_to_transform=lora_args.lora_layers_to_transform,
            modules_to_save=modules_to_save,
        )
        if not hasattr(model, 'get_input_embeddings'):
            def get_input_embeddings(self):
                return self.llm.get_input_embeddings()
            model.get_input_embeddings = MethodType(get_input_embeddings, model)
        if lora_args.q_lora:
            model = prepare_model_for_kbit_training(
                model, use_gradient_checkpointing=training_args.gradient_checkpointing
            )
        model = get_peft_model(model, lora_config)
        if training_args.gradient_checkpointing:
            model.enable_input_require_grads()

    rank0_print(get_parameter_number(model))

    llm_type = training_args.llm_type    
    
    rank0_print(f'llm_type={llm_type}')

    
    # Load data
    if hasattr(model.config, "slice_config"):
        model.config.slice_config.max_slice_nums = training_args.max_slice_nums
        slice_config = model.config.slice_config.to_dict()
    else:
        model.config.max_slice_nums = training_args.max_slice_nums
        slice_config = model.config.to_dict()

    if hasattr(model.config, "batch_vision_input"):
        batch_vision = model.config.batch_vision_input
    else:
        batch_vision = False

    transform_func = build_transform()
    data_module = make_supervised_data_module(
        tokenizer=tokenizer,
        data_args=data_args,
        transform=transform_func,
        data_collator=data_collator,
        slice_config=slice_config,
        llm_type=llm_type,
        patch_size=model.config.patch_size,
        query_nums=model.config.query_num,
        batch_vision=batch_vision,
        max_length=training_args.model_max_length,
    )
    training_args.gradient_checkpointing_kwargs = {
        "use_reentrant": False
}   
    
    trainer = CPMTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        **data_module,
    )

    trainer.train()
    trainer.save_state()

    safe_save_model_for_hf_trainer(
        trainer=trainer,
        output_dir=training_args.output_dir,
        bias=lora_args.lora_bias)


if __name__ == "__main__":
    train()


================================================
FILE: OCR_Multimodal_Search/finetune/finetune_ds.sh
================================================
#!/bin/bash

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="openbmb/MiniCPM-V-2_6"
# or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path/to/trainging_data"
EVAL_DATA="path/to/test_data"
LLM_TYPE="qwen2" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3"



DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
    --data_path $DATA \
    --eval_data_path $EVAL_DATA \
    --remove_unused_columns false \
    --label_names "labels" \
    --prediction_loss_only false \
    --bf16 true \
    --bf16_full_eval true \
    --fp16 false \
    --fp16_full_eval false \
    --do_train \
    --do_eval \
    --tune_vision true \
    --tune_llm true \
    --model_max_length 2048 \
    --max_slice_nums 9 \
    --max_steps 10000 \
    --eval_steps 1000 \
    --output_dir output/output_minicpmv26 \
    --logging_dir output/output_minicpmv26 \
    --logging_strategy "steps" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
    --deepspeed ds_config_zero2.json \
    --report_to "tensorboard" 


================================================
FILE: OCR_Multimodal_Search/finetune/finetune_lora.sh
================================================
#!/bin/bash

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="/root/ld/ld_model_pretrained/minicpm-v" # or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="/root/ld/ld_dataset/pdf_cn_30k_search_train1.json"
EVAL_DATA="/root/ld/ld_dataset/pdf_cn_30k_search_eval1.json"
LLM_TYPE="minicpm" 

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1 
# if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm
#if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE=llama3
DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
    --data_path $DATA \
    --eval_data_path $EVAL_DATA \
    --remove_unused_columns false \
    --label_names "labels" \
    --prediction_loss_only false \
    --bf16 false \
    --bf16_full_eval false \
    --fp16 true \
    --fp16_full_eval true \
    --do_train \
    --do_eval \
    --tune_vision false \
    --tune_llm false \
    --use_lora true \
    --model_max_length 700 \
    --max_slice_nums 9 \
    --max_steps 10000 \
    --eval_steps 1000 \
    --output_dir output/output_lora \
    --logging_dir output/output_lora \
    --logging_strategy "steps" \
    --per_device_train_batch_size 12 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-4 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
    --deepspeed /root/ld/ld_project/pull_request/MiniCPM-V/finetune/ds_config_zero2.json \
    --report_to "tensorboard" # wandb


================================================
FILE: OCR_Multimodal_Search/finetune/readme.md
================================================
# MiniCPM-V Finetuning


We offer the official scripts for easy finetuning of the pretrained **MiniCPM-V-2_6**, **MiniCPM-Llama3-V 2.5** and **MiniCPM-V 2.0** on downstream tasks. Our finetune scripts use transformers Trainer and DeepSpeed by default.

### Data preparation

To prepare your finetuning data, you should formulate each sample as a dictionary consisting of an id, an image path list with an image, and a list of conversations. Then save data samples in JSON files.

For the vision-language example with image, you are required to provide **\<image\>** to define the position to insert the image embeddings. If you don't provide \<image\>, the image will be placed at the front of the conversation.

<details>
  <summary>
    <b>vision-language example (vl_finetune_data.json) with 1 samples.</b>
  </summary>

```
  [
    {
      "id": "0",
      "image": 'path/to/image_0.jpg',
      "conversations": [
            {
              'role': 'user', 
              'content': '<image>\nHow many desserts are on the white plate?'
            }, 
            {
                'role': 'assistant', 
                'content': 'There are three desserts on the white plate.'
            },   
            {
                'role': 'user', 
                'content': 'What type of desserts are they?'
            },
            {
                'role': 'assistant', 
                'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.'
            }, 
            {
                'role': 'user', 
                'content': 'What is the setting of the image?'}, 
            {
                'role': 'assistant', 
                'content': 'The image is set on a table top with a plate containing the three desserts.'
            },
        ]
    },
  ]
```

</details>

### Full-parameter finetuning

Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. Please specify the correct MODEL path and DATA path in the shell scripts.

```shell
MODEL="openbmb/MiniCPM-V-2_6" # or openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
DATA="path/to/trainging_data" # json file
EVAL_DATA="path/to/test_data" # json file
```

To launch your training, run the following script:

```
sh finetune_ds.sh
```

#### Customizing Hyperparameters
To tailor the training process according to your specific requirements, you can adjust various hyperparameters. For comprehensive documentation on available hyperparameters and their functionalities, you can refer to the [official Transformers documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). Experimentation and fine-tuning of these parameters are essential for achieving optimal model performance tailored to your specific task and dataset.
# MiniCPM-V Finetuning


We offer the official scripts for easy finetuning of the pretrained **MiniCPM-Llama3-V 2.5** and **MiniCPM-V 2.0** on downstream tasks. Our finetune scripts use transformers Trainer and DeepSpeed by default.

### Data preparation

To prepare your finetuning data, you should formulate each sample as a dictionary consisting of an id, an image path list with an image, and a list of conversations. Then save data samples in JSON files.

For the vision-language example with image, you are required to provide **\<image\>** to define the position to insert the image embeddings. If you don't provide \<image\>, the image will be placed at the front of the conversation.

<details>
  <summary>
    <b>vision-language example (vl_finetune_data.json) with 1 samples.</b>
  </summary>

```
  [
    {
      "id": "0",
      "image": 'path/to/image_0.jpg',
      "conversations": [
            {
              'role': 'user', 
              'content': '<image>\nHow many desserts are on the white plate?'
            }, 
            {
                'role': 'assistant', 
                'content': 'There are three desserts on the white plate.'
            },   
            {
                'role': 'user', 
                'content': 'What type of desserts are they?'
            },
            {
                'role': 'assistant', 
                'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.'
            }, 
            {
                'role': 'user', 
                'content': 'What is the setting of the image?'}, 
            {
                'role': 'assistant', 
                'content': 'The image is set on a table top with a plate containing the three desserts.'
            },
        ]
    },
  ]
```

</details>

### Full-parameter finetuning

Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. Please specify the correct MODEL path, DATA path and LLM_TYPE in the shell scripts.

```shell
MODEL="openbmb/MiniCPM-V-2_6" # or openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
DATA="path/to/trainging_data" # json file
EVAL_DATA="path/to/test_data" # json file
LLM_TYPE="qwen2" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3"
```

To launch your training, run the following script:

```
sh finetune_ds.sh
```

Specially, Llama3 has a different chat_template for training and inference, we modified the chat_template for training, so please take care to restore the chat_template when inference on the training ckpt.

### LoRA finetuning

The LoRA allows light-weight model tuning with only a small subset of parameters updated. We provide the LoRA implementation based on `peft`. To launch your training, run the following script:

```
sh finetune_lora.sh
```

After training, you could load the model with the path to the adapter. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load.

```
from peft import PeftModel
from transformers import AutoModel
model_type=  "openbmb/MiniCPM-V-2_6"   # or openbmb/MiniCPM-Llama3-V-2_5 , openbmb/MiniCPM-V-2
path_to_adapter="path_to_your_fine_tuned_checkpoint"

model =  AutoModel.from_pretrained(
        model_type,
        trust_remote_code=True
        )

lora_model = PeftModel.from_pretrained(
    model,
    path_to_adapter,
    device_map="auto",
    trust_remote_code=True
).eval().cuda()
```


### Model Fine-tuning Memory Usage Statistics

The following table presents the memory usage of the model when fine-tuning using NVIDIA A100 (80GiB) GPUs under different numbers of GPUs. The fine-tuning was performed with the DeepSpeed Zero-3 optimization, Gradient Checkpointing techniques and offloading optimizer as well as parameters memory to cpu, with a maximum length set to 2048 and batch size set to 1. You refer to [deepspeed zero stage](https://huggingface.co/docs/transformers/v4.41.2/en/deepspeed#select-a-zero-stage) to reduce memory cost.

| Fine-tuning Method | GPUs: 2 | GPUs: 4 | GPUs: 8 |
|--------------------|---------|---------|---------|
| LoRA Fine-tuning   | 14.4 GiB| 13.6 GiB|   13.1 GiB   |
| Full Parameters Fine-tuning | 16.0 GiB | 15.8 GiB | 15.63GiB |

### Notes
- **Fine-tuning Method**: Displays two different fine-tuning strategies, LoRA fine-tuning and Full parameters fine-tuning.
- **Number of GPUs**: The table lists the memory usage for configurations with 2, 4, and 8 GPUs.
- **Memory Usage**: Expressed in GiB, this shows the required memory for each fine-tuning method under corresponding GPU configurations.
- **Out of memory**: Indicates that the memory was insufficient for full parameters fine-tuning under the current GPU configurations.

### Finetuning FAQs

<details>
<summary>Q:When you encounter Out of Memory (OOM) issues during training large models, you can try the following methods to resolve or mitigate the issue:</summary>

A:When you face Out of Memory (OOM) issues during training large models, the following strategies may help resolve or mitigate the problem:
#### Adjust Model Hyperparameters
- **Reduce `max_model_length`**: Decreasing the maximum sequence length the model processes can significantly reduce the memory required for each operation. For example, reducing the maximum length from 2048 to 1200 or another value suitable for your dataset.
```
--model_max_length 1200

```
- **Lower `batch_size`**: Reducing the amount of data processed in each batch helps decrease memory consumption.
```
--batch_size 1
 ```
- **Reduce the number of slices (`slice`)**: When handling large datasets such as large images files, reducing the number of slices processed each time can lower memory requirements.
```
--max_slice_nums 9 
```

#### Reduce Training Model Parameters
- **Do not train VPM (Visual Processing Module)**: You can adjust hyperparameters in the finetune script to opt out of training the visual processing module to save memory.
```
--tune_vision false
```
- **Use LoRA finetuning**: Refer to the [LoRA finetuning](#LoRA-finetuning) section.

#### Optimize with DeepSpeed
- **Configure DeepSpeed Zero Stage 2**: Use the following configuration to offload optimizer parameters to the CPU, reducing memory pressure on the GPU:
  ```json
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    }
  }
- **Configure DeepSpeed Zero Stage 3**:Further offload model parameters and optimizer parameters to the CPU, further reducing GPU memory usage:
```json
"zero_optimization": {
  "stage": 3,
  "offload_optimizer": {
    "device": "cpu",
    "pin_memory": true
  },
  "offload_param": {
    "device": "cpu",
    "pin_memory": true
  }
}
```
You can visit [huggingface deepspeed](https://huggingface.co/docs/transformers/deepspeed) to find out more about how to use DeepSpeed.
</details>
<details>
<summary>Q: Encounter an error while using the AutoPeftModelForCausalLM to load a checkpoint that has undergone lora fine-tuning</summary>

A: The error as described in [issues 168](https://github.com/OpenBMB/MiniCPM-V/issues/168) occurs because the model lacks `get_input_embeddings` and `set_input_embeddings` methods. Follow these steps to resolve this issue: 

1.**Reload the Fine-Tuned Model:** Make sure you correctly load the checkpoint that has been fine-tuned using lora techniques. Use the following code example to guide you:
   ```python
 from peft import AutoPeftModel

path_to_adapter="path_to_your_fine_tuned_checkpoint"

model = AutoPeftModel.from_pretrained(
    # path to the output directory
    path_to_adapter,
    device_map="auto",
    trust_remote_code=True
).eval().cuda()
   ```
  2.**Update the `model_minicpmv.py` File:**
   - **Verification:** Make sure you verify and update your `model_minicpmv.py` file to ensure it is the latest version.
   - **Update Hugging Face Library Code:** If the issue persists after updating the file, consider updating the related code in the Hugging Face library.
   - **Direct File Copy:** For a quick resolution, directly download and copy the latest `model_minicpmv.py` file into your project. This file is available from the following sources:
     - [MiniCPM-Llama3-V-2_5 on Hugging Face](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/tree/main)
     - [MiniCPM-V-2 on Hugging Face](https://huggingface.co/openbmb/MiniCPM-V-2)
</details>

<details>
<summary>Q: How do I use the `flash_attention_2` implementation when loading a pretrained model?</summary>

A: If your environment supports `flash_attn2`, you can add an argument `_attn_implementation="flash_attention_2"` when using the `AutoModel.from_pretrained` method to load a model. For example:

```python
model = AutoModel.from_pretrained('model_name', _attn_implementation="flash_attention_2")
```
</details>

<details>
<summary>Q: What if our data is resized to 512? Can we use the original image size instead?</summary>

A: Our model supports up to 1344x1344 lossless encoding. If you are currently resizing your images to 512, you might want to try using the original image sizes instead. Our system automatically includes a high-definition image encoding scheme by default.

</details>

<details>
<summary>Q: What should we do if we encounter out-of-memory (OOM) errors?</summary>

A: If you experience OOM issues, consider reducing the batch size (`bs`). To maintain an equivalent total batch size, you can adjust the `gradient_accumulation_steps` setting. This approach allows you to manage memory usage effectively while still processing the desired amount of data per training step.
</details>

<details>
<summary>Q: How can we determine the maximum length for our training data, and what if we do not want to train the vision encoder?</summary>

A: I recommend using this function [here](https://github.com/OpenBMB/MiniCPM-V/blob/main/finetune/dataset.py#L220) to sample the length of your training data. Note that the `input_ids` length includes the image portion. Once you determine the maximum length, you can specify it in the startup command using `--model_max_length xxx`.

Additionally, if you prefer not to train the vision encoder, you can add `--tune_vision false` to your command.

</details>

<details>
<summary>Q: How can we adjust training hyperparameters when using LoRA to train our model?</summary>

A: You can refer to the [LoRA documentation](https://huggingface.co/docs/peft/en/package_reference/lora#peft.LoraConfig) for guidance on adjusting your training hyperparameters when using LoRA. This documentation provides detailed information on configuring various parameters specific to the LoRA adaptation technique.
</details>

#### Customizing Hyperparameters
To tailor the training process according to your specific requirements, you can adjust various hyperparameters. For comprehensive documentation on available hyperparameters and their functionalities, you can refer to the [official Transformers documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) and [Lora documentation](https://huggingface.co/docs/peft/en/package_reference/lora#peft.LoraConfig). Experimentation and fine-tuning of these parameters are essential for achieving optimal model performance tailored to your specific task and dataset.


================================================
FILE: OCR_Multimodal_Search/finetune/trainer.py
================================================

import torch
import torch.nn as nn
import deepspeed
from transformers import Trainer
from transformers.trainer_pt_utils import nested_detach
from transformers.utils import is_sagemaker_mp_enabled
from transformers.trainer import *
from transformers.integrations import is_deepspeed_zero3_enabled
import torch.nn.functional as F
import logging

import logging

# 创建一个 logger
logger = logging.getLogger('my_logger')
logger.setLevel(logging.DEBUG)  # 设置 logger 的级别

# 创建一个 handler,用于写入日志文件
fh = logging.FileHandler('app.log')
fh.setLevel(logging.DEBUG)

# 再创建一个 handler,用于输出到控制台
ch = logging.StreamHandler()
ch.setLevel(logging.ERROR)

# 定义 handler 的输出格式
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)

# 给 logger 添加 handler
logger.addHandler(fh)
logger.addHandler(ch)

# 记录一条日志
logger.info('This is an info message')
logger.error('This is an error message')


class CPMTrainer(Trainer):
    def original_loss(self, model, inputs, return_outputs=False):
        if "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
        
        if not self.args.use_lora:
            outputs = self.model(data = inputs, use_cache=False)
        else:
            with self.model._enable_peft_forward_hooks(**inputs):
                outputs = self.model.base_model(data = inputs, use_cache=False)
                
        if labels is not None:
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
            logits = outputs.logits.view(-1,
                                         self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            # Enable model parallelism
            labels = labels.to(logits.device)
            loss = loss_fct(logits, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            # We don't use .loss here since the model may return tuples instead of ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss
    def coloss(self, query_embeddings, doc_embeddings):
        """
        query_embeddings: (batch_size, num_query_tokens, dim)
        doc_embeddings: (batch_size, num_doc_tokens, dim)

        Positive scores are the diagonal of the scores matrix.
        """

        # Compute the ColBERT scores
        scores = (
            torch.einsum("bnd,csd->bcns", query_embeddings, doc_embeddings).max(dim=3)[0].sum(dim=2)
        )  # (batch_size, batch_size)

        # Positive scores are the diagonal of the scores matrix.
        pos_scores = scores.diagonal()  # (batch_size,)

        # Negative score for a given query is the maximum of the scores against all all other pages.
        # NOTE: We exclude the diagonal by setting it to a very low value: since we know the maximum score is 1,
        # we can subtract 1 from the diagonal to exclude it from the maximum operation.
        neg_scores = scores - torch.eye(scores.shape[0], device=scores.device) * 1e6  # (batch_size, batch_size)
        neg_scores = neg_scores.max(dim=1)[0]  # (batch_size,)

        # Compute the loss
        # The loss is computed as the negative log of the softmax of the positive scores
        # relative to the negative scores.
        # This can be simplified to log-sum-exp of negative scores minus the positive score
        # for numerical stability.
        # torch.vstack((pos_scores, neg_scores)).T.softmax(1)[:, 0].log()*(-1)
        loss = F.softplus(neg_scores - pos_scores).mean()

        return loss
    def compute_loss(self,model, inputs):
        """
        query_embeddings: (batch_size, num_query_tokens, dim)
        doc_embeddings: (batch_size, num_doc_tokens, dim)

        Positive scores are the diagonal of the scores matrix.
        """
        if not self.args.use_lora:
            outputs = self.model(data = inputs['query_ids'], use_cache=False)
            query_embeddings=outputs.float()
            doc_embeddings = self.model(data = inputs, use_cache=False).float()
        else:
            with self.model._enable_peft_forward_hooks(**inputs):
                outputs = self.model.base_model(data = inputs['query_ids'], use_cache=False)
                query_embeddings=outputs.half()
                self.model.text_proj = self.model.text_proj.half()
                logger.info(f"query_embeddings.dtype:{query_embeddings.dtype}")
                # logger.info(f"doc_embeddings.dtype:{doc_embeddings.dtype}")
                query_embeddings=self.model.text_proj(query_embeddings)
                doc_embeddings = self.model.base_model(data = inputs, use_cache=False).half()
                doc_embeddings=self.model.text_proj(doc_embeddings)
        logger.info(f"outputs.shape:{outputs.shape}")
        logger.info(f"query_embeddings_shape:{query_embeddings.shape}")
        logger.info(f"doc_embeddings_shape:{doc_embeddings.shape}")

        # Compute the ColBERT scores
        # 计算得分
        # scores = (
        #     torch.einsum("bnd,csd->bcns", query_embeddings, doc_embeddings).max(dim=3)[0].sum(dim=2)
        # )  # (batch_size, batch_size)

        # # 正样本得分
        # pos_scores = scores.diagonal()  # (batch_size,)

        # 负样本得分
        # neg_scores = scores - torch.eye(scores.shape[0], device=scores.device) * 1e6  # (batch_size, batch_size)
        # neg_scores = neg_scores.max(dim=1)[0]  # (batch_size,)

        # 归一化得分
        # 可以选择对 query_embeddings 和 doc_embeddings 进行归一化
        query_embeddings_normalized = F.normalize(query_embeddings, p=2, dim=-1)
        doc_embeddings_normalized = F.normalize(doc_embeddings, p=2, dim=-1)

        # 重新计算得分
        scores_normalized = (
            torch.einsum("bnd,csd->bcns", query_embeddings_normalized, doc_embeddings_normalized).max(dim=3)[0].sum(dim=2)
        )  # (batch_size, batch_size)

        # 正样本得分
        pos_scores_normalized = scores_normalized.diagonal()  # (batch_size,)

        # 负样本得分
        neg_scores_normalized = scores_normalized - torch.eye(scores_normalized.shape[0], device=scores_normalized.device) * 1e6  # (batch_size, batch_size)
        neg_scores_normalized = neg_scores_normalized.max(dim=1)[0]  # (batch_size,)

        # 计算损失
        # 使用更稳定的计算方式
        loss = F.softplus(neg_scores_normalized - pos_scores_normalized).mean()

        return loss

    def prediction_step(
        self,
        model: nn.Module,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        prediction_loss_only: bool,
        ignore_keys: Optional[List[str]] = None,
    ) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]:
        """
        Perform an evaluation step on `model` using `inputs`.

        Subclass and override to inject custom behavior.

        Args:
            model (`nn.Module`):
                The model to evaluate.
            inputs (`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.

                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument `labels`. Check your model's documentation for all accepted arguments.
            prediction_loss_only (`bool`):
                Whether or not to return the loss only.
            ignore_keys (`List[str]`, *optional*):
                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
                gathering predictions.

        Return:
            Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss,
            logits and labels (each being optional).
        """
        has_labels = (
            False
            if len(self.label_names) == 0
            else all(inputs.get(k) is not None for k in self.label_names)
        )
        # For CLIP-like models capable of returning loss values.
        # If `return_loss` is not specified or being `None` in `inputs`, we check if the default value of `return_loss`
        # is `True` in `model.forward`.
        return_loss = inputs.get("return_loss", None)
        if return_loss is None:
            return_loss = self.can_return_loss
        loss_without_labels = (
            True if len(self.label_names) == 0 and return_loss else False
        )

        inputs = self._prepare_inputs(inputs)
        if ignore_keys is None:
            if hasattr(self.model, "config"):
                ignore_keys = getattr(
                    self.model.config, "keys_to_ignore_at_inference", []
                )
            else:
                ignore_keys = []

        # labels may be popped when computing the loss (label smoothing for instance) so we grab them first.
        if has_labels or loss_without_labels:
            labels = nested_detach(tuple(inputs.get(name)
                                   for name in self.label_names))
            if len(labels) == 1:
                labels = labels[0]
        else:
            labels = None

        with torch.no_grad():
            if is_sagemaker_mp_enabled():
                raw_outputs = smp_forward_only(model, inputs)
                if has_labels or loss_without_labels:
                    if isinstance(raw_outputs, dict):
                        loss_mb = raw_outputs["loss"]
                        logits_mb = tuple(
                            v
                            for k, v in raw_outputs.items()
                            if k not in ignore_keys + ["loss"]
                        )
                    else:
                        loss_mb = raw_outputs[0]
                        logits_mb = raw_outputs[1:]

                    loss = loss_mb.reduce_mean().detach().cpu()
                    logits = smp_nested_concat(logits_mb)
                else:
                    loss = None
                    if isinstance(raw_outputs, dict):
                        logits_mb = tuple(
                            v for k, v in raw_outputs.items() if k not in ignore_keys
                        )
                    else:
                        logits_mb = raw_outputs
                    logits = smp_nested_concat(logits_mb)
            else:
                if has_labels or loss_without_labels:
                    with self.compute_loss_context_manager():
                        loss= self.compute_loss(
                            model, inputs
                        )
                    loss = loss.mean().detach()

                    

        if prediction_loss_only:
            return (loss, None, None)

        logits = nested_detach(logits)
        if len(logits) == 1:
            logits = logits[0]

        return (loss, logits, labels)
        
    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
        """
        Perform a training step on a batch of inputs.

        Subclass and override to inject custom behavior.

        Args:
            model (`nn.Module`):
                The model to train.
            inputs (`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.

                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument `labels`. Check your model's documentation for all accepted arguments.

        Return:
            `torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
        inputs = self._prepare_inputs(inputs)

        if is_sagemaker_mp_enabled():
            loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps)
            return loss_mb.reduce_mean().detach().to(self.args.device)

        with self.compute_loss_context_manager():
            loss = self.compute_loss(model, inputs)

        del inputs
        torch.cuda.empty_cache()

        if self.args.n_gpu > 1:
            loss = loss.mean()  # mean() to average on multi-gpu parallel training

        if self.use_apex:
            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        else:
            self.accelerator.backward(loss)

        return loss.detach() / self.args.gradient_accumulation_steps
    
    def _save(self, output_dir: Optional[str] = None, state_dict=None):
        # If we are executing this function, we are the process zero, so we don't check for that.
        output_dir = output_dir if output_dir is not None else self.args.output_dir
        os.makedirs(output_dir, exist_ok=True)
        logger.info(f"Saving model checkpoint to {output_dir}")
        text_proj = self.model.text_proj
        torch.save(text_proj.state_dict(), os.path.join(output_dir,'text_proj.pth'))

        supported_classes = (PreTrainedModel,) if not is_peft_available() else (PreTrainedModel, PeftModel)
        # Save a trained model and configuration using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        if not isinstance(self.model, supported_classes):
            if state_dict is None:
                state_dict = self.model.state_dict()

            if isinstance(unwrap_model(self.model), supported_classes):
                unwrap_model(self.model).save_pretrained(
                    output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors
                )
            else:
                logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
                if self.args.save_safetensors:
                    safetensors.torch.save_file(
                        state_dict, os.path.join(output_dir, SAFE_WEIGHTS_NAME), metadata={"format": "pt"}
                    )
                else:
                    torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
        else:
            
            self.model.save_pretrained(
                output_dir, state_dict=state_dict, safe_serialization=self.args.save_safetensors
            )

        if self.tokenizer is not None:
            self.tokenizer.save_pretrained(output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))


================================================
FILE: OCR_Multimodal_Search/infer/app.py
================================================
import os

import gradio as gr
import torch
from pdf2image import convert_from_path
from PIL import Image
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from transformers import AutoModel,AutoTokenizer
from peft import PeftModel
from dataset import ImageDataset,QueryDataset,load_from_pdf,data_collator,data_collator_query,load_from_json
from utils import build_transform,evaluate_colbert
import json


def search(query: str, ds, images):
    qs = []
    queries_dataset=QueryDataset([query],tokenizer)
    dataloader = DataLoader(
        queries_dataset,
        batch_size=1,
        shuffle=False,
        collate_fn=lambda x: data_collator_query(x),
    )
    for batch_query in tqdm(dataloader):
        with torch.no_grad():
            batch_query=batch_query.to("cuda")
            embeddings_query = model.base_model(data = batch_query, use_cache=False).half()
            embeddings_query = model.text_proj(embeddings_query)
        qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))

    # run evaluation
    scores = evaluate_colbert(qs, ds)
    best_page = int(scores.argmax(axis=1).item())
    return f"The most relevant page is {best_page}", images[best_page]


def index(file, ds):
    """Example script to run inference with ColPali"""
    images = []
    for f in file:
        if f.endswith(".json"):
            with open(f, 'r', encoding='utf-8') as fi:
                data = json.load(fi)
            images.extend([d['image'] for d in data])
        if f.endswith(".pdf"):
            images.extend(convert_from_path(f))
        if f.endswith('png') or f.endswith('jpg'):
            images.append(f)
    if hasattr(model.config, "slice_config"):
        slice_config = model.config.slice_config.to_dict()
    else:
        slice_config = model.config.to_dict()
    transform_func = build_transform()
    images_dataset=ImageDataset(images,transform_func,tokenizer,slice_config)
    image_dataloader = DataLoader(
        images_dataset,
        batch_size=16,
        shuffle=False,
        collate_fn=lambda x: data_collator(x,padding_value=0, max_length=2048,device=device),
    )
    ds = []
    for batch_doc in tqdm(image_dataloader):
        with torch.no_grad():
            embeddings_doc = model.base_model(data = batch_doc, use_cache=False).half()
            embeddings_doc = model.text_proj(embeddings_doc)
        ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
    del embeddings_doc
    return f"Uploaded and converted {len(images)} pages", ds, images


COLORS = ["#4285f4", "#db4437", "#f4b400", "#0f9d58", "#e48ef1"]
# Load model
# Load model and lora
model_name = "/root/ld/ld_model_pretrained/minicpm-v"
lora_path = "/root/ld/ld_model_pretrained/adapter_minicpmv_search"
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="cuda",trust_remote_code=True).eval()
model = PeftModel.from_pretrained(model, lora_path)
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
# 加载权重文件
text_proj_weights = torch.load(os.path.join(lora_path,"text_proj.pth"))

# 获取 text_proj 层
text_proj_layer = model.text_proj

# 更新 text_proj 层的权重
text_proj_layer.load_state_dict(text_proj_weights)

device = model.device
mock_image = Image.new("RGB", (448, 448), (255, 255, 255))

with gr.Blocks() as demo:
    gr.Markdown("# ColCPM-V: Cross-modal HD retrieval ")
    gr.Markdown("## 1️⃣ Upload PDFs")
    file = gr.File(file_types=["pdf","json",'jpg','png'], file_count="multiple")

    gr.Markdown("## 2️⃣ Convert the PDFs and upload")
    convert_button = gr.Button("🔄 Convert and upload")
    message = gr.Textbox("Files not yet uploaded")
    embeds = gr.State(value=[])
    imgs = gr.State(value=[])

    # Define the actions
    convert_button.click(index, inputs=[file, embeds], outputs=[message, embeds, imgs])

    gr.Markdown("## 3️⃣ Search")
    query = gr.Textbox(placeholder="Enter your query here")
    search_button = gr.Button("🔍 Search")
    message2 = gr.Textbox("Query not yet set")
    output_img = gr.Image()

    search_button.click(search, inputs=[query, embeds, imgs], outputs=[message2, output_img])


if __name__ == "__main__":
    demo.queue(max_size=10).launch(debug=True)


================================================
FILE: OCR_Multimodal_Search/infer/cli_demo.py
================================================
import os

import torch
from pdf2image import convert_from_path
from PIL import Image
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoModel,AutoTokenizer
from peft import PeftModel
from dataset import ImageDataset,QueryDataset,data_collator,data_collator_query
from utils import build_transform,evaluate_colbert
import json


def search(query: str, ds, images):
    qs = []
    queries_dataset=QueryDataset([query],tokenizer)
    dataloader = DataLoader(
        queries_dataset,
        batch_size=1,
        shuffle=False,
        collate_fn=lambda x: data_collator_query(x),
    )
    for batch_query in tqdm(dataloader):
        with torch.no_grad():
            batch_query=batch_query.to("cuda")
            embeddings_query = model.base_model(data = batch_query, use_cache=False).half()
            embeddings_query = model.text_proj(embeddings_query)
        qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))

    # run evaluation
    scores = evaluate_colbert(qs, ds)
    best_page = int(scores.argmax(axis=1).item())
    return best_page
    return f"The most relevant page is {best_page}", images[best_page]


def index(file):
    """Example script to run inference with ColPali"""
    images = []
    for f in file:
        if f.endswith(".json"):
            with open(f, 'r', encoding='utf-8') as fi:
                data = json.load(fi)
            images.extend([d['image'] for d in data])
        if f.endswith(".pdf"):
            images.extend(convert_from_path(f))
        if f.endswith('png') or f.endswith('jpg'):
            images.append(f)
    if hasattr(model.config, "slice_config"):
        slice_config = model.config.slice_config.to_dict()
    else:
        slice_config = model.config.to_dict()
    transform_func = build_transform()
    images_dataset=ImageDataset(images,transform_func,tokenizer,slice_config)
    image_dataloader = DataLoader(
        images_dataset,
        batch_size=16,
        shuffle=False,
        collate_fn=lambda x: data_collator(x,padding_value=0, max_length=2048,device=device),
    )
    ds = []
    for batch_doc in tqdm(image_dataloader):
        with torch.no_grad():
            embeddings_doc = model.base_model(data = batch_doc, use_cache=False).half()
            embeddings_doc = model.text_proj(embeddings_doc)
        ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
    del embeddings_doc
    return  ds, images


COLORS = ["#4285f4", "#db4437", "#f4b400", "#0f9d58", "#e48ef1"]
# Load model
# Load model and lora
model_name = "/root/ld/ld_model_pretrained/minicpm-v"
lora_path = "/root/ld/ld_model_pretrained/adapter_minicpmv_search"
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="cuda",trust_remote_code=True).eval()
model = PeftModel.from_pretrained(model, lora_path)
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
# 加载权重文件
text_proj_weights = torch.load(os.path.join(lora_path,"text_proj.pth"))

# 获取 text_proj 层
text_proj_layer = model.text_proj

# 更新 text_proj 层的权重
text_proj_layer.load_state_dict(text_proj_weights)

device = model.device
mock_image = Image.new("RGB", (448, 448), (255, 255, 255))
query = "我该怎么选择红外测温模块"
file_path = "/root/ld/ld_project/pull_request/MiniCPM_Series_Tutorial/OCR_Multimodal_Search/infer/eval_image"
file_name=os.listdir(file_path)
file_paths=[os.path.join(file_path,file) for file in file_name if file.endswith('jpg')]
ds, images = index(file_paths)
best_page=search(query,ds,images)
print(file_paths[best_page])



================================================
FILE: OCR_Multimodal_Search/infer/dataset.py
================================================
import copy
import json
import logging
import math
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import torch.nn.functional as F
from gradio.utils import NamedString
import numpy as np
import torch
from PIL import Image
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset
from transformers import AutoProcessor, AutoTokenizer

llama3_chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}"

class ImageDataset(Dataset):
    """Dataset for supervised fine-tuning."""
    def __init__(
        self,
        raw_path,
        transform,
        tokenizer,
        slice_config,
        llm_type="minicpm",
        patch_size=14,
        query_nums=64,
        batch_vision=False,
    ):
        super(ImageDataset, self).__init__()
        self.raw_path = raw_path
        self.tokenizer = tokenizer
        self.transform = transform
        self.slice_config = slice_config
        self.llm_type = llm_type
        self.patch_size = patch_size
        self.query_nums=query_nums
        self.batch_vision = batch_vision

    def __len__(self):
        return len(self.raw_path)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        if type(self.raw_path[i]) == str or isinstance(self.raw_path[i], NamedString):
            image = Image.open(self.raw_path[i]).convert("RGB")
        else:
            image = self.raw_path[i]
        ret = preprocess(
            image,
            self.raw_path[i],
            self.tokenizer,
            self.transform,
            query_nums=self.query_nums,
            slice_config=self.slice_config,
            llm_type=self.llm_type,
            patch_size=self.patch_size,
            batch_vision=self.batch_vision,
        )
        ret = dict(
            input_ids=ret["input_ids"],
            position_ids=ret["position_ids"],
            labels=ret["target"],
            attention_mask=torch.ones_like(ret["input_ids"], dtype=torch.bool),
            pixel_values=ret["pixel_values"],
            tgt_sizes=ret["tgt_sizes"],
            image_bound=ret["image_bound"],
        )

        return ret
class QueryDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(
        self,
        raw_query,
        tokenizer,
        llm_type="minicpm",

    ):
        super(QueryDataset, self).__init__()
        self.raw_query = raw_query
        self.tokenizer = tokenizer
        self.llm_type = llm_type
    def __len__(self):
        return len(self.raw_query)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        query = self.raw_query[i]
        query_ids = conversation_to_ids_minicpm(conversation=[{
                "content": query,
                "role": "user"
            },{
                "content": "",
                "role": "assistant"
            }], tokenizer=self.tokenizer)[0]
        if "cpm" in self.llm_type:
            query_ids = [item for sublist in query_ids for item in sublist]
        ret = torch.tensor(query_ids)
        return ret

def data_collator_query(examples, padding_value=0, max_length=2048):
    def trim_and_pad(seq, batch_first, padding_value):
        return pad_sequence([s[:max_length] for s in seq], batch_first=True, padding_value=padding_value)
    query_ids = trim_and_pad(
        [example for example in examples],
        batch_first=True,
        padding_value=padding_value,
    )
    return query_ids

def data_collator(examples, padding_value=0, max_length=2048,device='cpu'):
    def trim_and_pad(seq, batch_first, padding_value):
        return pad_sequence([s[:max_length] for s in seq], batch_first=True, padding_value=padding_value)
    input_ids = trim_and_pad(
        [example["input_ids"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    ).to(device)
    position_ids = trim_and_pad(
        [example["position_ids"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    ).to(device)
    targets = trim_and_pad(
        [example["labels"] for example in examples],
        batch_first=True,
        padding_value=-100,
    ).to(device)
    attention_mask = trim_and_pad(
        [example["attention_mask"] for example in examples],
        batch_first=True,
        padding_value=padding_value,
    ).to(device)
    pixel_values = [example["pixel_values"] for example in examples]
    image_bound = [example["image_bound"] for example in examples]
    tgt_sizes = [example["tgt_sizes"] for example in examples]
    return {
        "input_ids": input_ids,
        "position_ids": position_ids,
        "labels": targets,
        "attention_mask": attention_mask,
        "pixel_values" : pixel_values,
        "image_bound": image_bound,
        "tgt_sizes": tgt_sizes,
        "pixel_values": pixel_values,
    }

def load_from_pdf(pdf_path: str):
    from pdf2image import convert_from_path

    images = convert_from_path(pdf_path)
    return images
def load_from_json(json_path: str):
    with open(json_path, 'r') as file:
        data = json.load(file)
    images_path = [ i["image"]for i in data]
    queries = [i["query"] for i in data]
    return images_path,queries
def conversation_to_ids(conversation, tokenizer, llm_type=None, new_schema=False):
    """
    for single image multi-turn conversation
    conversation: [{'role': 'user', 'content': 'Describe this image'},
                   {'role': 'assistant', 'content': 'This is a cat.'}]
    """
    if llm_type == "llama3":
        input_ids, context, raw_msg = conversation_to_ids_llama3(
            conversation, tokenizer
        )
    elif llm_type == "qwen2":
        input_ids, context, raw_msg = conversation_to_ids_qwen2(
            conversation, tokenizer
        )
    else:
        input_ids, context, raw_msg = conversation_to_ids_minicpm(
            conversation, tokenizer
        )

    ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))
    context = torch.from_numpy(np.hstack(context, dtype=np.int8))

    # build target
    target = torch.full_like(ids, -100, dtype=torch.int32)
    
    for i in range(1, len(ids)):
        if context[i] == 0:
            target[i - 1] = ids[i]
        if context[i] == 1 and context[i - 1] == 0:
            if hasattr(tokenizer, "eot_id"):
                target[i - 1] = tokenizer.eot_id
            else:
                target[i - 1] = tokenizer.eos_id
    
    # build image bound
    if new_schema:
        start_cond = (ids == tokenizer.im_start_id) | (ids == tokenizer.slice_start_id)
        end_cond = (ids == tokenizer.im_end_id) | (ids == tokenizer.slice_end_id)
        image_start_tokens = torch.where(start_cond)[0]
        image_start_tokens += 1
        image_end_tokens = torch.where(end_cond)[0]
    else:
        image_start_tokens = torch.where(ids == tokenizer.im_start_id)[0]
        image_start_tokens += 1
        image_end_tokens = torch.where(ids == tokenizer.im_end_id)[0]
    if len(image_start_tokens) != len(image_end_tokens):
        print("image start token != image end tokens")
    
    if len(image_start_tokens) > 0:
        image_bound = torch.hstack(
            [image_start_tokens.unsqueeze(-1), image_end_tokens.unsqueeze(-1)]
        )
    else:
        image_bound = []

    position_ids = torch.arange(ids.size(0)).long()
    return {
        "input_ids": ids,
        "target": target,
        "image_bound": image_bound,
        "raw_msg": raw_msg,
        "position_ids": position_ids
    }


def conversation_to_ids_minicpm(conversation, tokenizer):
    raw_msg = ""
    input_ids = []
    context = []
    for idx, msg in enumerate(conversation):
        role = msg["role"]
        message = msg["content"]
        assert role in ["user", "assistant"]
        if role == "user":
            prefix = "<用户>"
        else:
            prefix = "<AI>"
        # append eos
        if idx == len(conversation) - 1:
            message = message + tokenizer.eos_token
        prefix_ids = tokenizer.encode(prefix)[1:]  # remove bos
        message_ids = tokenizer.encode(message)[1:]

        input_ids.append(prefix_ids)
        input_ids.append(message_ids)

        context.append(np.ones((len(prefix_ids),), dtype=np.int8))
        if role == "assistant":
            context.append(np.zeros((len(message_ids),), dtype=np.int8))
        else:
            context.append(np.ones((len(message_ids),), dtype=np.int8))

        raw_msg += prefix + message

    return input_ids, context, raw_msg


def conversation_to_ids_llama3(conversation, tokenizer):
    raw_msg = ""
    input_ids = []
    context = []
    raw_msg = tokenizer.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=False, chat_template=llama3_chat_template,
    )
    input_ids = tokenizer.apply_chat_template(
        conversation, tokenize=True, add_generation_prompt=False, chat_template=llama3_chat_template,
    )
    input_ids = np.array(input_ids)

    start_header_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|start_header_id|>")
    )[0]
    assistant_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("assistant")
    )[0]
    end_header_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|end_header_id|>")
    )[0]
    eot_idxs = np.where(
        input_ids == tokenizer.convert_tokens_to_ids("<|eot_id|>"))[0]

    context = np.ones_like(input_ids, dtype=np.int8)

    for assistant_idx in assistant_idxs:
        if assistant_idx in set((start_header_idxs + end_header_idxs) / 2):
            st = assistant_idx + 3  # assistant<|end_header_id|>\n\n
            for eot_idx in eot_idxs:
                if eot_idx > st:
                    context[st: eot_idx + 1] = 0
                    break

    input_ids = np.hstack(input_ids)
    context = np.hstack(context)

    return input_ids, context, raw_msg


def conversation_to_ids_qwen2(conversation, tokenizer):
    raw_msg = ""
    chat = []
    context = []
    for idx, msg in enumerate(conversation):
        role = msg["role"]
        message = msg["content"]
        assert role in ["user", "assistant"]
        if role == "user":
            prefix = "user"
        else:
            prefix = "assistant"
        chat.append({"role":prefix, "content":message})
        raw_msg += prefix + message
    #assert set([i['role'] for i in chat]) & set(['assistant'])

    ret = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)
    input_ids = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=False)
    input_ids = np.array(input_ids)

    start_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_start|>'))[0]
    assistant_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('assistant'))[0]
    end_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_end|>'))[0]

    context = np.ones_like(input_ids, dtype=np.int8)

    for assistant_idx in assistant_idxs:
        if assistant_idx-1 in set(start_idxs):
            st = assistant_idx + 1
            for end_idx in end_idxs:
                if end_idx > st:
                    context[st: end_idx + 1] = 0
                    break
                    
    input_ids = np.hstack(input_ids)
    context = np.hstack(context)
    return input_ids, context, raw_msg



def preprocess(
    image,
    conversation,
    tokenizer,
    transform,
    query_nums=64,
    slice_config=None,
    llm_type=None,
    patch_size=14,
    batch_vision=False,
):
    """
    single image preprocess, the image will be placed at the top of the conversation
    """
    conversation = [
            {
                "content": "<image>\n对这张图片进行准确的ocr",
                "role": "user"
            },
            {
                "content": "",
                "role": "assistant"
            }
        ]
    assert len(conversation) > 1, "conversation length must large than 2"
    assert conversation[0]["role"] == "user", "the first role must be user"

    if slice_config is not None:
        assert isinstance(slice_config, Dict)
        assert "patch_size" in slice_config
        assert "max_slice_nums" in slice_config
        assert "scale_resolution" in slice_config
    default_image_placeholder = (
        tokenizer.im_start + tokenizer.unk_token * query_nums + tokenizer.im_end
    )
    new_schema = False
    use_image_id = False
    if llm_type=='qwen2':
        new_schema = True
        use_image_id = True
    if slice_config:
        images = []
        image_id_cnt = 0 
        source_image, patches, best_grid = slice_image(
            image,
            slice_config["max_slice_nums"],
            slice_config["scale_resolution"],
            slice_config["patch_size"],
        )
        images.append(source_image)
        image_placeholder = default_image_placeholder
        if len(patches) > 0:
            for i in range(len(patches)):
                for j in range(len(patches[0])):
                    images.append(patches[i][j])
            if use_image_id:
                image_placeholder = f'{tokenizer.im_id_start}{image_id_cnt}{tokenizer.im_id_end}' + image_placeholder
                image_id_cnt += 1
            image_placeholder += get_grid_placeholder(
                tokenizer, best_grid, query_nums, new_schema = new_schema)
        images = [transform(i) for i in images]
    else:
        images = [transform(image)]
        image_placeholder = default_image_placeholder
    if "<image>" in conversation[0]["content"]:
        conversation[0]["content"] = conversation[0]["content"].replace(
            "<image>", image_placeholder
        )
    else:
        conversation[0]["content"] = (
            image_placeholder + "\n" + conversation[0]["content"]
        )

    input_dict = conversation_to_ids(conversation, tokenizer, llm_type, new_schema)

    if batch_vision:
        tgt_sizes = []
        reshape_images = []
        for image in images:
            H, W = image.shape[1:]
            reshape_image = reshape_by_patch(image, patch_size)
            reshape_images.append(reshape_image)
            tgt_sizes.append([H // patch_size, W // patch_size])
        if tgt_sizes:
            tgt_sizes = torch.Tensor(tgt_sizes).type(torch.int32)

        input_dict["pixel_values"] = reshape_images
        input_dict["tgt_sizes"] = tgt_sizes

    else:
        input_dict["pixel_values"] = images
        input_dict["tgt_sizes"] = []

    return input_dict


def slice_image(
    image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False
):
    original_size = image.size
    original_width, original_height = original_size
    log_ratio = math.log(original_width / original_height)
    ratio = original_width * original_height / \
        (scale_resolution * scale_resolution)
    multiple = min(math.ceil(ratio), max_slice_nums)

    source_image = None
    best_grid = None
    patches = []

    if multiple <= 1 or never_split:
        # dont need to slice, upsample
        best_size = find_best_resize(
            original_size, scale_resolution, patch_size, allow_upscale=True
        )
        source_image = image.resize(best_size, Image.Resampling.BICUBIC)
    else:
        candidate_split_grids_nums = []
        for i in [multiple - 1, multiple, multiple + 1]:
            if i == 1 or i > max_slice_nums:
                continue
            candidate_split_grids_nums.append(i)

        # source image, down-sampling and ensure divided by patch_size
        best_resize = find_best_resize(
            original_size, scale_resolution, patch_size)
        source_image = image.copy().resize(best_resize, Image.Resampling.BICUBIC)
        candidate_grids = []

        # find best grid
        for split_grids_nums in candidate_split_grids_nums:
            m = 1
            while m <= split_grids_nums:
                if split_grids_nums % m == 0:
                    candidate_grids.append([m, split_grids_nums // m])
                m += 1

        best_grid = [1, 1]
        min_error = float("inf")
        for grid in candidate_grids:
            error = abs(log_ratio - math.log(grid[0] / grid[1]))
            if error < min_error:
                best_grid = grid
                min_error = error

        refine_size = get_refine_size(
            original_size, best_grid, scale_resolution, patch_size, allow_upscale=True
        )

        refine_image = image.resize(refine_size, Image.Resampling.BICUBIC)
        patches = split_to_patches(refine_image, best_grid)

    return source_image, patches, best_grid


def ensure_divide(length, patch_size):
    return max(round(length / patch_size) * patch_size, patch_size)


def find_best_resize(original_size, scale_resolution, patch_size, allow_upscale=False):
    width, height = original_size
    if (width * height > scale_resolution * scale_resolution) or allow_upscale:
        r = width / height
        height = int(scale_resolution / math.sqrt(r))
        width = int(height * r)
    best_width = ensure_divide(width, patch_size)
    best_height = ensure_divide(height, patch_size)
    return (best_width, best_height)


def get_refine_size(
    original_size, grid, scale_resolution, patch_size, allow_upscale=False
):
    width, height = original_size
    grid_x, grid_y = grid

    refine_width = ensure_divide(width, grid_x)
    refine_height = ensure_divide(height, grid_y)

    grid_width = refine_width / grid_x
    grid_height = refine_height / grid_y

    best_grid_size = find_best_resize(
        (grid_width, grid_height),
        scale_resolution,
        patch_size,
        allow_upscale=allow_upscale,
    )

    refine_size = (best_grid_size[0] * grid_x, best_grid_size[1] * grid_y)

    return refine_size


def split_to_patches(image, grid):
    patches = []
    width, height = image.size
    grid_x = int(width / grid[0])
    grid_y = int(height / grid[1])

    for i in range(0, height, grid_y):
        images = []
        for j in range(0, width, grid_x):
            box = (j, i, j + grid_x, i + grid_y)
            patch = image.crop(box)
            images.append(patch)
        patches.append(images)

    return patches


def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
    image_placeholder = (
        tokenizer.im_start + tokenizer.unk_token * query_num + tokenizer.im_end
    )

    cols = grid[0]
    rows = grid[1]
    slices = []
    for i in range(rows):
        lines = []
        for j in range(cols):
            lines.append(image_placeholder)
        slices.append("".join(lines))
    if new_schema:
        slice_placeholder = '\n'.join(slices)
    else:
        slice_placeholder = tokenizer.slice_start + \
        "\n".join(slices) + tokenizer.slice_end
    return slice_placeholder


def reshape_by_patch(image_tensor, patch_size):
    """
    :param image_tensor: shape [3, H, W]
    :param patch_size:
    :return: [3, patch_size, HW/patch_size]
    """
    patches = torch.nn.functional.unfold(
        image_tensor, (patch_size, patch_size), stride=(patch_size, patch_size)
    )

    patches = patches.reshape(image_tensor.size(0), patch_size, patch_size, -1)
    patches = patches.permute(0, 1, 3, 2).reshape(
        image_tensor.size(0), patch_size, -1)
    return patches


================================================
FILE: OCR_Multimodal_Search/infer/inference.py
================================================
import torch
import typer
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
import sys,os,json
from transformers import AutoModel,AutoTokenizer
from peft import PeftModel
from utils import build_transform,evaluate_colbert
from dataset import ImageDataset,QueryDataset,load_from_pdf,data_collator,data_collator_query,load_from_json
def main() -> None:
    # Load model and lora
    model_name = "/root/ld/ld_model_pretrained/minicpm-v"
    lora_path = "/root/ld/ld_project/pull_request/MiniCPM-V/finetune/output/output__lora/checkpoint-2000"
    model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="cuda",trust_remote_code=True).eval()
    model = PeftModel.from_pretrained(model, lora_path)
    tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)
    # 加载权重文件
    text_proj_weights = torch.load(os.path.join(lora_path,"text_proj.pth"))

    # 获取 text_proj 层
    text_proj_layer = model.text_proj

    # 更新 text_proj 层的权重
    text_proj_layer.load_state_dict(text_proj_weights)

    # select images -> load_from_pdf(<pdf_path>),  load_from_image_urls(["<url_1>"]), load_from_dataset(<path>)
    
    images = load_from_json("/root/ld/ld_dataset/pdf_cn_30k_search_eval1.json")[0][:30]
    queries = load_from_json('/root/ld/ld_dataset/pdf_cn_30k_search_eval1.json')[1][:30]
    if hasattr(model.config, "slice_config"):
        slice_config = model.config.slice_config.to_dict()
    else:
        slice_config = model.config.to_dict()
    transform_func = build_transform()
    images_dataset=ImageDataset(images,transform_func,tokenizer,slice_config)
    queries_dataset=QueryDataset(queries,tokenizer)
    # run inference - docs
    image_dataloader = DataLoader(
        images_dataset,
        batch_size=16,
        shuffle=False,
        collate_fn=lambda x: data_collator(x),
    )
    ds = []
    for batch_doc in tqdm(image_dataloader):
        batch_doc = {k: (v.to("cuda") if isinstance(v, torch.Tensor) else v) for k, v in batch_doc.items()}
        with torch.no_grad():
            embeddings_doc = model.base_model(data = batch_doc, use_cache=False).half()
            embeddings_doc = model.text_proj(embeddings_doc)
        ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
    del embeddings_doc
    # run inference - queries
    dataloader = DataLoader(
        queries_dataset,
        batch_size=32,
        shuffle=False,
        collate_fn=lambda x: data_collator_query(x),
    )

    qs = []
    for batch_query in tqdm(dataloader):
        with torch.no_grad():
            batch_query=batch_query.to("cuda")
            embeddings_query = model.base_model(data = batch_query, use_cache=False).half()
            embeddings_query = model.text_proj(embeddings_query)
        qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))

    # run evaluation
    scores = evaluate_colbert(qs, ds)
    print(scores)
    print(scores.argmax(axis=1))

if __name__ == "__main__":
    typer.run(main)


================================================
FILE: OCR_Multimodal_Search/infer/utils.py
================================================
from torchvision import transforms
import torch.nn.functional as F
import torch
def evaluate_colbert( qs, ps, batch_size=128) -> torch.Tensor:
    scores = []
    for i in range(0, len(qs), batch_size):
        scores_batch = []
        qs_batch = torch.nn.utils.rnn.pad_sequence(qs[i : i + batch_size], batch_first=True, padding_value=0).to(
            "cuda"
        )
        qs_batch = F.normalize(qs_batch, p=2, dim=-1)
        for j in range(0, len(ps), batch_size):
            ps_batch = torch.nn.utils.rnn.pad_sequence(
                ps[j : j + batch_size], batch_first=True, padding_value=0
            ).to("cuda")
            ps_batch = F.normalize(ps_batch, p=2, dim=-1)
            scores_batch.append(torch.einsum("bnd,csd->bcns", qs_batch, ps_batch).max(dim=3)[0].sum(dim=2))
        scores_batch = torch.cat(scores_batch, dim=1).cpu()
        scores.append(scores_batch)
    scores = torch.cat(scores, dim=0)
    return scores
def build_transform():
    IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5) # timm.data.IMAGENET_INCEPTION_MEAN
    IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)  # timm.data.IMAGENET_INCEPTION_STD
    return transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD
                ),
            ]
        )

================================================
FILE: OCR_VG/chat.py
================================================
import os
import torch
import json
from PIL import Image
import base64
import io
from accelerate import load_checkpoint_and_dispatch, init_empty_weights
from transformers import AutoTokenizer, AutoModel
from omnilmm.utils import disable_torch_init
from omnilmm.model.omnilmm import OmniLMMForCausalLM
from omnilmm.model.utils import build_transform
from omnilmm.train.train_utils import omni_preprocess

DEFAULT_IMAGE_TOKEN = "<image>"
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
DEFAULT_IM_START_TOKEN = "<im_start>"
DEFAULT_IM_END_TOKEN = "<im_end>"

    

def init_omni_lmm(model_path):
    torch.backends.cuda.matmul.allow_tf32 = True
    disable_torch_init()
    model_name = os.path.expanduser(model_path)
    print(f'Load omni_lmm model and tokenizer from {model_name}')
    tokenizer = AutoTokenizer.from_pretrained(
        model_name, model_max_length=2048)

    if False:
        # model on multiple devices for small size gpu memory (Nvidia 3090 24G x2) 
        with init_empty_weights():
            model = OmniLMMForCausalLM.from_pretrained(model_name, tune_clip=True, torch_dtype=torch.bfloat16)
        model = load_checkpoint_and_dispatch(model, model_name, dtype=torch.bfloat16, 
                    device_map="auto",  no_split_module_classes=['Eva','MistralDecoderLayer', 'ModuleList', 'Resampler']
        )
    else:
        model = OmniLMMForCausalLM.from_pretrained(
            model_name, tune_clip=True, torch_dtype=torch.bfloat16
        ).to(device='cuda', dtype=torch.bfloat16)

    image_processor = build_transform(
        is_train=False, input_size=model.model.config.image_size, std_mode='OPENAI_CLIP')

    mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
    assert mm_use_im_start_end

    tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN,
                         DEFAULT_IM_END_TOKEN], special_tokens=True)


    vision_config = model.model.vision_config
    vision_config.im_patch_token = tokenizer.convert_tokens_to_ids(
        [DEFAULT_IMAGE_PATCH_TOKEN])[0]
    vision_config.use_im_start_end = mm_use_im_start_end
    vision_config.im_start_token, vision_config.im_end_token = tokenizer.convert_tokens_to_ids(
        [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN])
    image_token_len = model.model.config.num_query

    return model, image_processor, image_token_len, tokenizer

def expand_question_into_multimodal(question_text, image_token_len, im_st_token, im_ed_token, im_patch_token):
    if '<image>' in question_text[0]['content']:
        question_text[0]['content'] = question_text[0]['content'].replace(
            '<image>', im_st_token + im_patch_token * image_token_len + im_ed_token)
    else:
        question_text[0]['content'] = im_st_token + im_patch_token * \
            image_token_len + im_ed_token + '\n' + question_text[0]['content']
    return question_text

def wrap_question_for_omni_lmm(question, image_token_len, tokenizer):
    question = expand_question_into_multimodal(
        question, image_token_len, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, DEFAULT_IMAGE_PATCH_TOKEN)

    conversation = question
    data_dict = omni_preprocess(sources=[conversation],
                                  tokenizer=tokenizer,
                                  generation=True)

    data_dict = dict(input_ids=data_dict["input_ids"][0],
                     labels=data_dict["labels"][0])
    return data_dict



class OmniLMM12B:
    def __init__(self, model_path) -> None:
        model, img_processor, image_token_len, tokenizer = init_omni_lmm(model_path)
        self.model = model
        self.image_token_len = image_token_len
        self.image_transform = img_processor
        self.tokenizer = tokenizer
        self.model.eval()

    def decode(self, image, input_ids):
        with torch.inference_mode():
            output = self.model.generate_vllm(
                input_ids=input_ids.unsqueeze(0).cuda(),
                images=image.unsqueeze(0).half().cuda(),
                temperature=0.6,
                max_new_tokens=1024,
                # num_beams=num_beams,
                do_sample=True,
                output_scores=True,
                return_dict_in_generate=True,
                repetition_penalty=1.1,
                top_k=30,
                top_p=0.9,
            )

            response = self.tokenizer.decode(
                output.sequences[0], skip_special_tokens=True)
            response = response.strip()
            return response

    def chat(self, input):
        try:
            image = Image.open(io.BytesIO(base64.b64decode(input['image']))).convert('RGB')
        except Exception as e:
            return "Image decode error"

        msgs = json.loads(input['question'])
        input_ids = wrap_question_for_omni_lmm(
            msgs, self.image_token_len, self.tokenizer)['input_ids']
        input_ids = torch.as_tensor(input_ids)
        #print('input_ids', input_ids)
        image = self.image_transform(image)

        out = self.decode(image, input_ids)

        return out
        

def img2base64(file_name):
    with open(file_name, 'rb') as f:
        encoded_string = base64.b64encode(f.read())
        return encoded_string

class MiniCPMV:
    def __init__(self, model_path) -> None:
        self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(dtype=torch.bfloat16)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.model.eval().cuda()

    def chat(self, input):
        try:
            image = Image.open(io.BytesIO(base64.b64decode(input['image']))).convert('RGB')
        except Exception as e:
            return "Image decode error"

        msgs = json.loads(input['question'])
        
        answer, context, _ = self.model.chat(
            image=image,
            msgs=msgs,
            context=None,
            tokenizer=self.tokenizer,
            sampling=True,
            temperature=0.7
    	)
        return answer

class MiniCPMV2_5:
    def __init__(self, model_path) -> None:
        self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(dtype=torch.float16)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.model.eval().cuda()

    def chat(self, input):
        try:
            image = Image.open(io.BytesIO(base64.b64decode(input['image']))).convert('RGB')
        except Exception as e:
            return "Image decode error"

        msgs = json.loads(input['question'])
    
        answer = self.model.chat(
            image=image,
            msgs=msgs,
            tokenizer=self.tokenizer,
            sampling=False,
            temperature=0.7
    	)
        return answer


class MiniCPMVChat:
    def __init__(self, model_path) -> None:
        if '12B' in model_path:
            self.model = OmniLMM12B(model_path)
        elif 'MiniCPM-Llama3-V' in model_path:
            self.model = MiniCPMV2_5(model_path)
        else:
            self.model = MiniCPMV(model_path)

    def chat(self, input):
        return self.model.chat(input)


if __name__ == '__main__':
    
    model_path = 'openbmb/OmniLMM-12B'
    chat_model = MiniCPMVChat(model_path)

    im_64 = img2base64('./assets/worldmap_ck.jpg')

    # first round chat 
    msgs = [{"role": "user", "content": "What is interesting about this image?"}]
    input = {"image": im_64, "question": json.dumps(msgs, ensure_ascii=True)}
    answer = chat_model.chat(input)
    print(msgs[-1]["content"]+'\n', answer)

    # second round chat 
    msgs.append({"role": "assistant", "content": answer})
    msgs.append({"role": "user", "content": "Where is China in the image"})
    input = {"image": im_64,"question": json.dumps(msgs, ensure_ascii=True)}
    answer = chat_model.chat(input)
    print(msgs[-1]["content"]+'\n', answer)



================================================
FILE: OCR_VG/data_demo/img_gt.json
================================================
{
    "data": {
        "/root/ld/ld_project/MIniCPM_Series_Tutorial/OCR_VG/data_demo/img/000001.jpg": {
            "gt": [
                {
                    "polygon": [
                        [
                            404.0, 33.0
                        ],
                        [
                            485.0,    33.0
                        ],
                        [
                            485.0,    103.0
                        ],
                        [
                            404.0, 103.0
                        ]
                    ],
                    "text": "猫"
                },
                {
                    "polygon": [
                        [
                            577.0,
                            121.0
                        ],
                        [
                            826.0,
                            121.0
                        ],
                        [
                            826.0,
                            222.0
                        ],
                        [
                            577.0,
                            222.0
                        ]
                    ],
                    "text": "博尔赫斯"
                },
                {
                    "polygon": [
                        [
                            211.0,
                            270.0
                        ],
                        [
                            739.0,
                            270.0
                        ],
                        [
                            739.0,
                            350.0
                        ],
                        [
                            211.0,
                            350.0
                        ]
                    ],
                    "text": "镜子没有这么更加沉默"
                },
                {
                    "polygon": [
                        [
                            210.0,
                            371.0
                        ],
                        [
                            885.0,
                            371.0
                        ],
                        [
                            885.0,
                            462.0
                        ],
                        [
                            210.0,
                            462.0
                        ]
                    ],
                    "text": "透进的曙光也不这么更为隐秘"
                },
                {
                    "polygon": [
                        [
                            206.0,
                            486.0
                        ],
                        [
                            881.0,
                            502.0
                        ],
                        [
                            879.0,
                            584.0
                        ],
                        [
                            211.0,
                            549.0
                        ]
                    ],
                    "text": "你在月光下豹子的模样"
                },
                {
                    "polygon": [
                        [
                            215.0,
                            571.0
                        ],
                        [
                            738.0,
                            589.0
                        ],
                        [
                            740.0,
                            642.0
                        ],
                        [
                            216.0,
                            637.0
                        ]
                    ],
                    "text": "只能让我们从远处窥视"
                },
                {
                    "polygon": [
                        [
                            215.0,
                            720.0
                        ],
                        [
                            734.0,
                            720.0
                        ],
                        [
                            734.0,
                            786.0
                        ],
                        [
                            215.0,
                            786.0
                        ]
                    ],
                    "text": "由于无法解释的神圣意旨"
                },
                {
                    "polygon": [
                        [
                            211.0,
                            805.0
                        ],
                        [
                            685.0,
                            805.0
                        ],
                        [
                            685.0,
                            871.0
                        ],
                        [
                            211.0,
                            871.0
                        ]
                    ],
                    "text": "我们徒然地到处找你"
                },
                {
                    "polygon": [
                        [
                            209.0,
                            924.0
                        ],
                        [
                            858.0,
                            943.0
                        ],
                        [
                            856.0,
                            1007.0
                        ],
                        [
                            210.0,
                            987.0
                        ]
                    ],
                    "text": "你就是孤独你就是神秘"
                },
                {
                    "polygon": [
                        [
                            206.0,
                            1001.0
                        ],
                        [
                            720.0,
                            1001.0
                        ],
                        [
                            720.0,
                            1068.0
                        ],
                        [
                            206.0,
                            1068.0
                        ]
                    ],
                    "text": "比恒河或者日落还要遥远"
                },
                {
                    "polygon": [
                        [
                            204.0,
                            1082.0
                        ],
                        [
                            1025.0,
                            1082.0
                        ],
                        [
                            1025.0,
                            1170.0
                        ],
                        [
                            204.0,
                            1170.0
                        ]
                    ],
                    "text": "你的脊背容忍了我的手慢条斯里的抚摸"
                },
                {
                    "polygon": [
                        [
                            202.0,
                            1204.0
                        ],
                        [
                            657.0,
                            1204.0
                        ],
                        [
                            657.0,
                            1277.0
                        ],
                        [
                            202.0,
                            1277.0
                        ]
                    ],
                    "text": "你自从早已遗忘的永恒"
                },
                {
                    "polygon": [
                        [
                            203.0,
                            1286.0
                        ],
                        [
                            829.0,
                            1286.0
                        ],
                        [
                            829.0,
                            1358.0
                        ],
                        [
                            203.0,
                            1358.0
                        ]
                    ],
                    "text": "已经允许人们犹豫的手的抚爱"
                },
                {
                    "polygon": [
                        [
                            187.0,
                            1367.0
                        ],
                        [
                            567.0,
                            1367.0
                        ],
                        [
                            567.0,
                            1443.0
                        ],
                        [
                            187.0,
                            1443.0
                        ]
                    ],
                    "text": "你是在另一个时代"
                },
                {
                    "polygon": [
                        [
                            184.0,
                            1445.0
                        ],
                        [
                            976.0,
                            1445.0
                        ],
                        [
                            976.0,
                            1543.0
                        ],
                        [
                            184.0,
                            1543.0
                        ]
                    ],
                    "text": "你是像梦一样隔绝的另一个区域的主宰"
                },
                {
                    "polygon": [
                        [
                            852.0,
                            1599.0
                        ],
                        [
                            1044.0,
                            1599.0
                        ],
                        [
                            1044.0,
                            1666.0
                        ],
                        [
                            852.0,
                            1666.0
                        ]
                    ],
                    "text": "2017.10.31"
                }
            ],
            "image_path": "/root/ld/ld_project/MIniCPM_Series_Tutorial/OCR_VG/data_demo/img/000001.jpg" 
        },
    "/root/ld/ld_project/MIniCPM_Series_Tutorial/OCR_VG/data_demo/img/000005.jpg": {
            "gt": [
                {
                    "polygon": [
                        [
                            97.0,
                            29.0
                        ],
                        [
                            794.0,
                            29.0
                        ],
                        [
                            794.0,
                            124.0
                        ],
                        [
                            97.0,
                            124.0
                        ]
                    ],
                    "text": "世界上最美好的事物"
                },
                {
                    "polygon": [
                        [
                            102.0,
                            137.0
                        ],
                        [
                            529.0,
                            137.0
                        ],
                        [
                            529.0,
                            226.0
                        ],
                        [
                            102.0,
                            226.0
                        ]
                    ],
                    "text": "莫过于“曾经”"
                },
                {
                    "polygon": [
                        [
                            85.0,
                            482.0
                        ],
                        [
                            987.0,
                            482.0
                        ],
                        [
                            987.0,
                            579.0
                        ],
                        [
                            85.0,
                            579.0
                        ]
                    ],
                    "text": "即悟得生命的真正意义和价值"
                },
                {
                    "polygon": [
                        [
                            79.0,
                            597.0
                        ],
                        [
                            699.0,
                            597.0
                        ],
                        [
                            699.0,
                            698.0
                        ],
                        [
                            79.0,
                            698.0
                        ]
                    ],
                    "text": "当你真的懂得以后"
                },
                {
                    "polygon": [
                        [
                            73.0,
                            717.0
                        ],
                        [
                            467.0,
                            717.0
                        ],
                        [
                            467.0,
                            813.0
                        ],
                        [
                            73.0,
                            813.0
                        ]
                    ],
                    "text": "不再有迷茫"
                },
                {
                    "polygon": [
                        [
                            78.0,
                            820.0
                        ],
                        [
                            495.0,
                            820.0
                        ],
                        [
                            495.0,
                            940.0
                        ],
                        [
                            78.0,
                            940.0
                        ]
                    ],
                    "text": "不再会伤感"
                },
                {
                    "polygon": [
                        [
                            79.0,
                            951.0
                        ],
                        [
                            451.0,
                            951.0
                        ],
                        [
                            451.0,
                            1061.0
                        ],
                        [
                            79.0,
                            1061.0
                        ]
                    ],
                    "text": "不再会畏惧"
                },
                {
                    "polygon": [
                        [
                            62.0,
                            1068.0
                        ],
                        [
                            452.0,
                            1068.0
                        ],
                        [
                            452.0,
                            1175.0
                        ],
                        [
                            62.0,
                            1175.0
                        ]
                    ],
                    "text": "不再会退缩"
                },
                {
                    "polygon": [
                        [
                            74.0,
                            1200.0
                        ],
                        [
                            566.0,
                            1200.0
                        ],
                        [
                            566.0,
                            1315.0
                        ],
                        [
                            74.0,
                            1315.0
                        ]
                    ],
                    "text": "网络励志语录"
                },
                {
                    "polygon": [
                        [
                            154.0,
                            1340.0
                        ],
                        [
                            421.0,
                            1345.0
                        ],
                        [
                            421.0,
                            1469.0
                        ],
                        [
                            164.0,
                            1466.0
                        ]
                    ],
                    "text": "学生党抄"
                },
                {
                    "polygon": [
                        [
                            83.0,
                            391.0
                        ],
                        [
                            882.0,
                            354.0
                        ],
                        [
                            881.0,
                            464.0
                        ],
                        [
                            85.0,
                            491.0
                        ]
                    ],
                    "text": "但有一种懂得就是彻悟"
                },
                {
                    "polygon": [
                        [
                            99.0,
                            297.0
                        ],
                        [
                            738.0,
                            264.0
                        ],
                        [
                            739.0,
                            361.0
                        ],
                        [
                            98.0,
                            369.0
                        ]
                    ],
                    "text": "人生该懂得的事很多"
                }
            ],
            "image_path": "/root/ld/ld_project/MIniCPM_Series_Tutorial/OCR_
Download .txt
gitextract_cc8zu0m1/

├── 4G_memory_rag/
│   └── langchain_demo.py
├── AIPC/
│   ├── Mac_feature_and_shortcut.json
│   └── spider.py
├── MiniCPM-o-long_video_inference/
│   ├── README.md
│   ├── infer.py
│   └── requirements.txt
├── MiniCPMV2_6_awq/
│   ├── modeling_minicpmv.py
│   └── quantize.py
├── OCR_Multimodal_Search/
│   ├── asset/
│   │   └── README.md
│   ├── finetune/
│   │   ├── __init__.py
│   │   ├── app.log
│   │   ├── dataset.py
│   │   ├── dataset_original.py
│   │   ├── ds_config_zero2.json
│   │   ├── ds_config_zero3.json
│   │   ├── finetune.py
│   │   ├── finetune_ds.sh
│   │   ├── finetune_lora.sh
│   │   ├── readme.md
│   │   └── trainer.py
│   └── infer/
│       ├── app.py
│       ├── cli_demo.py
│       ├── dataset.py
│       ├── inference.py
│       └── utils.py
├── OCR_VG/
│   ├── chat.py
│   ├── data_demo/
│   │   ├── img_gt.json
│   │   └── train_data_demo.json
│   ├── gt_test.py
│   ├── merge_box.py
│   ├── omnilmm/
│   │   ├── __init__.py
│   │   ├── constants.py
│   │   ├── conversation.py
│   │   ├── model/
│   │   │   ├── __init__.py
│   │   │   ├── omnilmm.py
│   │   │   ├── resampler.py
│   │   │   └── utils.py
│   │   ├── train/
│   │   │   └── train_utils.py
│   │   └── utils.py
│   └── simsun.ttc
├── README.md
├── README_application.md
├── README_en.md
├── agent_auto_plan/
│   ├── README.md
│   ├── autoplan/
│   │   ├── all_param_inference.py
│   │   ├── bing_search.py
│   │   ├── fuctions.py
│   │   ├── load_model.py
│   │   ├── lora_inference_nomerge.py
│   │   ├── main.py
│   │   ├── prompt_plamte.py
│   │   └── tools_introduction.py
│   ├── finetune_language/
│   │   ├── README.md
│   │   ├── dataset.py
│   │   ├── ds_config_zero2.json
│   │   ├── ds_config_zero3.json
│   │   ├── finetune.py
│   │   ├── finetune_ds.sh
│   │   ├── finetune_lora.sh
│   │   ├── merge_lora.py
│   │   └── replace_file/
│   │       └── modeling_minicpmv.py
│   ├── qwen_vllm.py
│   ├── test_plan.json
│   └── test_react.json
├── agent_demo/
│   ├── agent_demo.py
│   ├── build_react_prompt.py
│   ├── get_react_data.py
│   └── react_qa_react.json
├── ft_language_replace_file/
│   └── finetune/
│       ├── __init__.py
│       ├── dataset.py
│       ├── ds_config_zero2.json
│       ├── ds_config_zero3.json
│       ├── finetune.py
│       ├── finetune_ds.sh
│       ├── finetune_lora.sh
│       ├── merge_lora.py
│       ├── only_language_web_demo.py
│       ├── readme.md
│       ├── replace_file/
│       │   ├── modeling_minicpmv.py
│       │   └── resampler.py
│       └── trainer.py
├── get_minicpmv2.6_embeding/
│   ├── dataset.py
│   ├── inference.py
│   ├── modeling_minicpmv.py
│   └── readme.md
├── mbti_role_play/
│   ├── mbti_demo.py
│   ├── mbti_sft_dpo_data/
│   │   └── get_rank_data.py
│   └── self_awareness/
│       └── get_all_awarness_data.py
├── md/
│   ├── finetune/
│   │   ├── minicpm2.0/
│   │   │   ├── llama_factory.md
│   │   │   ├── mlx_sft.md
│   │   │   └── sft.md
│   │   ├── minicpm3.0/
│   │   │   ├── llama_factory.md
│   │   │   ├── pip_list.md
│   │   │   └── sft.md
│   │   ├── minicpmv2.5/
│   │   │   ├── sft.md
│   │   │   └── swift.md
│   │   └── minicpmv2.6/
│   │       ├── pip_list.md
│   │       └── sft.md
│   ├── inference/
│   │   ├── minicpm2.0/
│   │   │   ├── llama.cpp_android.md
│   │   │   ├── llama.cpp_pc.md
│   │   │   ├── mlx.md
│   │   │   ├── ollama.md
│   │   │   ├── powerinfer_android.md
│   │   │   ├── powerinfer_pc.md
│   │   │   ├── transformers.md
│   │   │   └── vllm.md
│   │   ├── minicpm3.0/
│   │   │   ├── llamcpp.md
│   │   │   ├── ollama.md
│   │   │   ├── sglang.md
│   │   │   ├── transformers.md
│   │   │   └── vllm.md
│   │   ├── minicpmv2.5/
│   │   │   ├── LMdeploy.md
│   │   │   ├── llamacpp_pc.md
│   │   │   ├── ollama.md
│   │   │   ├── swift_commandline.md
│   │   │   ├── swift_python.md
│   │   │   ├── transformers_multi_gpu.md
│   │   │   ├── vllm.md
│   │   │   └── xinference.md
│   │   └── minicpmv2.6/
│   │       ├── llamacpp.md
│   │       ├── ollama.md
│   │       ├── transformers_mult_gpu.md
│   │       ├── vllm.md
│   │       └── vllm_api_server.md
│   ├── integrate/
│   │   ├── function_call.md
│   │   ├── langchain.md
│   │   └── openai_api.md
│   ├── md_en/
│   │   ├── finetune/
│   │   │   ├── minicpm2.0/
│   │   │   │   ├── llama_factory.md
│   │   │   │   ├── mlx_sft.md
│   │   │   │   └── sft.md
│   │   │   ├── minicpm3.0/
│   │   │   │   ├── llama_factory.md
│   │   │   │   ├── pip_list.md
│   │   │   │   └── sft.md
│   │   │   ├── minicpmv2.5/
│   │   │   │   ├── sft.md
│   │   │   │   └── swift.md
│   │   │   └── minicpmv2.6/
│   │   │       ├── pip_list.md
│   │   │       └── sft.md
│   │   ├── inegrate/
│   │   │   ├── function_call.md
│   │   │   ├── langchain.md
│   │   │   └── openai_api.md
│   │   ├── inference/
│   │   │   ├── minicpm2.0/
│   │   │   │   ├── llama.cpp_android.md
│   │   │   │   ├── llama.cpp_pc.md
│   │   │   │   ├── mlx.md
│   │   │   │   ├── ollama.md
│   │   │   │   ├── powerinfer_android.md
│   │   │   │   ├── powerinfer_pc.md
│   │   │   │   ├── transformers.md
│   │   │   │   └── vllm.md
│   │   │   ├── minicpm3.0/
│   │   │   │   ├── llamacpp.md
│   │   │   │   ├── sglang.md
│   │   │   │   ├── transfomers.md
│   │   │   │   └── vllm.md
│   │   │   ├── minicpmv2.5/
│   │   │   │   ├── LMdeploy.md
│   │   │   │   ├── llamacpp_pc.md
│   │   │   │   ├── ollama.md
│   │   │   │   ├── swift_commandline.md
│   │   │   │   ├── swift_python.md
│   │   │   │   ├── transformers_multi_gpu.md
│   │   │   │   ├── vllm.md
│   │   │   │   └── xinference.md
│   │   │   └── minicpmv2.6/
│   │   │       ├── llamacpp.md
│   │   │       ├── ollama.md
│   │   │       ├── transformers_mult_gpu.md
│   │   │       ├── vllm.md
│   │   │       └── vllm_api_server.md
│   │   └── quantize/
│   │       ├── minicpm2.0/
│   │       │   ├── awq.md
│   │       │   ├── bnb.md
│   │       │   └── gptq.md
│   │       ├── minicpm3.0/
│   │       │   ├── awq.md
│   │       │   ├── bnb.md
│   │       │   └── gptq.md
│   │       ├── minicpmv2.5/
│   │       │   └── bnb.md
│   │       └── minicpmv2.6/
│   │           ├── awq.md
│   │           └── bnb.md
│   └── quantize/
│       ├── minicpm2.0/
│       │   ├── awq.md
│       │   ├── bnb.md
│       │   └── gptq.md
│       ├── minicpm3.0/
│       │   ├── awq.md
│       │   ├── bnb.md
│       │   └── gptq.md
│       ├── minicpmv2.5/
│       │   └── bnb.md
│       └── minicpmv2.6/
│           ├── awq.md
│           └── bnb.md
└── windows_minicpm3.0_agent/
    ├── app.py
    ├── cli_demo.py
    ├── dataset.py
    ├── get_reponse.py
    ├── inference.py
    ├── utils.py
    └── windows_agent.py
Download .txt
SYMBOL INDEX (546 symbols across 55 files)

FILE: 4G_memory_rag/langchain_demo.py
  function clean_text (line 68) | def clean_text(text):
  class MiniCPM_LLM (line 90) | class MiniCPM_LLM(LLM):
    method __init__ (line 94) | def __init__(self, model_path: str):
    method _call (line 121) | def _call(self, prompt, stop: Optional[List[str]] = None):
    method _llm_type (line 178) | def _llm_type(self) -> str:
  function load_documents (line 183) | def load_documents(file_paths):
  function load_models (line 213) | def load_models():
  function embed_documents (line 237) | def embed_documents(documents, embedding_models):
  function create_prompt_template (line 256) | def create_prompt_template():
  function create_rag_chain (line 276) | def create_rag_chain(llm, prompt):
  function analysis_links (line 282) | def analysis_links(docs):
  function main (line 309) | def main():
  function process_query (line 337) | def process_query(file, query):

FILE: MiniCPM-o-long_video_inference/infer.py
  function extract_frames_and_audio (line 22) | def extract_frames_and_audio(video_path, sample_fps=2, max_frames=None, ...
  class LongVideoAudioProcessor (line 72) | class LongVideoAudioProcessor:
    method __init__ (line 73) | def __init__(self,
    method preprocess_frame (line 117) | def preprocess_frame(self, frame):
    method update_memory_bank (line 133) | def update_memory_bank(self, frames, audio_segments, current_time):
    method calculate_time_weights (line 153) | def calculate_time_weights(self, current_time):
    method weighted_sampling (line 164) | def weighted_sampling(self, frames, weights, sample_count):
    method update_text_summary (line 172) | def update_text_summary(self, new_summary):
    method process_long_video (line 179) | def process_long_video(self, video_path, query):
    method _inference_with_memory (line 224) | def _inference_with_memory(self, frames, audio_segments, query, chunk_...
    method _merge_results (line 339) | def _merge_results(self, all_results, query):
  class AudioProcessor (line 396) | class AudioProcessor:
    method __init__ (line 397) | def __init__(self, sample_rate=16000):
    method extract_audio_from_video (line 400) | def extract_audio_from_video(self, video_path):
    method segment_audio (line 409) | def segment_audio(self, audio, duration, segment_duration=1.0):
    method extract_audio_features (line 426) | def extract_audio_features(self, audio_segment):
  function main (line 432) | def main():

FILE: MiniCPMV2_6_awq/modeling_minicpmv.py
  class MiniCPMVPreTrainedModel (line 17) | class MiniCPMVPreTrainedModel(Qwen2PreTrainedModel):
  class MiniCPMV (line 21) | class MiniCPMV(MiniCPMVPreTrainedModel):
    method __init__ (line 22) | def __init__(self, config):
    method init_vision_module (line 33) | def init_vision_module(self):
    method init_resampler (line 49) | def init_resampler(self, embed_dim, vision_dim):
    method get_input_embeddings (line 58) | def get_input_embeddings(self):
    method set_input_embeddings (line 61) | def set_input_embeddings(self, value):
    method get_output_embeddings (line 64) | def get_output_embeddings(self):
    method set_output_embeddings (line 67) | def set_output_embeddings(self, new_embeddings):
    method set_decoder (line 70) | def set_decoder(self, decoder):
    method prepare_inputs_for_generation (line 73) | def prepare_inputs_for_generation(
    method get_decoder (line 121) | def get_decoder(self):
    method get_vllm_embedding (line 124) | def get_vllm_embedding(self, data):
    method forward (line 217) | def forward(self, data, **kwargs):
    method _decode (line 236) | def _decode(self, inputs_embeds, tokenizer, attention_mask, decode_tex...
    method _decode_stream (line 249) | def _decode_stream(self, inputs_embeds, tokenizer, **kwargs):
    method _decode_text (line 265) | def _decode_text(self, result_ids, tokenizer):
    method generate (line 277) | def generate(
    method chat (line 321) | def chat(

FILE: MiniCPMV2_6_awq/quantize.py
  function copy_files_not_in_B (line 18) | def copy_files_not_in_B(A_path, B_path):
  function load_alpaca (line 46) | def load_alpaca():
  function load_wikitext (line 58) | def load_wikitext():

FILE: OCR_Multimodal_Search/finetune/dataset.py
  class SupervisedDataset (line 43) | class SupervisedDataset(Dataset):
    method __init__ (line 46) | def __init__(
    method __len__ (line 67) | def __len__(self):
    method __getitem__ (line 70) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function data_collator (line 107) | def data_collator(examples, padding_value=0, max_length=2048):
  function conversation_to_ids (line 150) | def conversation_to_ids(conversation, tokenizer, llm_type=None, new_sche...
  function conversation_to_ids_minicpm (line 215) | def conversation_to_ids_minicpm(conversation, tokenizer):
  function conversation_to_ids_llama3 (line 247) | def conversation_to_ids_llama3(conversation, tokenizer):
  function conversation_to_ids_qwen2 (line 287) | def conversation_to_ids_qwen2(conversation, tokenizer):
  function preprocess (line 327) | def preprocess(
  function slice_image (line 414) | def slice_image(
  function ensure_divide (line 473) | def ensure_divide(length, patch_size):
  function find_best_resize (line 477) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 488) | def get_refine_size(
  function split_to_patches (line 512) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 529) | def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
  function reshape_by_patch (line 550) | def reshape_by_patch(image_tensor, patch_size):

FILE: OCR_Multimodal_Search/finetune/dataset_original.py
  class SupervisedDataset (line 18) | class SupervisedDataset(Dataset):
    method __init__ (line 21) | def __init__(
    method __len__ (line 42) | def __len__(self):
    method __getitem__ (line 45) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function data_collator (line 70) | def data_collator(examples, padding_value=0, max_length=2048):
  function conversation_to_ids (line 108) | def conversation_to_ids(conversation, tokenizer, llm_type=None, new_sche...
  function conversation_to_ids_minicpm (line 173) | def conversation_to_ids_minicpm(conversation, tokenizer):
  function conversation_to_ids_llama3 (line 205) | def conversation_to_ids_llama3(conversation, tokenizer):
  function conversation_to_ids_qwen2 (line 245) | def conversation_to_ids_qwen2(conversation, tokenizer):
  function preprocess (line 285) | def preprocess(
  function slice_image (line 372) | def slice_image(
  function ensure_divide (line 431) | def ensure_divide(length, patch_size):
  function find_best_resize (line 435) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 446) | def get_refine_size(
  function split_to_patches (line 470) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 487) | def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
  function reshape_by_patch (line 508) | def reshape_by_patch(image_tensor, patch_size):

FILE: OCR_Multimodal_Search/finetune/finetune.py
  class ModelArguments (line 30) | class ModelArguments:
  class DataArguments (line 35) | class DataArguments:
  class TrainingArguments (line 45) | class TrainingArguments(transformers.TrainingArguments):
  class LoraArguments (line 66) | class LoraArguments:
  function rank0_print (line 80) | def rank0_print(*args):
  function safe_save_model_for_hf_trainer (line 85) | def safe_save_model_for_hf_trainer(trainer, output_dir: str, bias="none"):
  function make_supervised_data_module (line 91) | def make_supervised_data_module(
  function build_transform (line 142) | def build_transform():
  function get_parameter_number (line 154) | def get_parameter_number(model):
  function train (line 172) | def train():

FILE: OCR_Multimodal_Search/finetune/trainer.py
  class CPMTrainer (line 41) | class CPMTrainer(Trainer):
    method original_loss (line 42) | def original_loss(self, model, inputs, return_outputs=False):
    method coloss (line 73) | def coloss(self, query_embeddings, doc_embeddings):
    method compute_loss (line 104) | def compute_loss(self,model, inputs):
    method prediction_step (line 165) | def prediction_step(
    method training_step (line 273) | def training_step(self, model: nn.Module, inputs: Dict[str, Union[torc...
    method _save (line 315) | def _save(self, output_dir: Optional[str] = None, state_dict=None):

FILE: OCR_Multimodal_Search/infer/app.py
  function search (line 17) | def search(query: str, ds, images):
  function index (line 39) | def index(file, ds):

FILE: OCR_Multimodal_Search/infer/cli_demo.py
  function search (line 15) | def search(query: str, ds, images):
  function index (line 38) | def index(file):

FILE: OCR_Multimodal_Search/infer/dataset.py
  class ImageDataset (line 19) | class ImageDataset(Dataset):
    method __init__ (line 21) | def __init__(
    method __len__ (line 42) | def __len__(self):
    method __getitem__ (line 45) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class QueryDataset (line 72) | class QueryDataset(Dataset):
    method __init__ (line 75) | def __init__(
    method __len__ (line 86) | def __len__(self):
    method __getitem__ (line 89) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function data_collator_query (line 103) | def data_collator_query(examples, padding_value=0, max_length=2048):
  function data_collator (line 113) | def data_collator(examples, padding_value=0, max_length=2048,device='cpu'):
  function load_from_pdf (line 150) | def load_from_pdf(pdf_path: str):
  function load_from_json (line 155) | def load_from_json(json_path: str):
  function conversation_to_ids (line 161) | def conversation_to_ids(conversation, tokenizer, llm_type=None, new_sche...
  function conversation_to_ids_minicpm (line 226) | def conversation_to_ids_minicpm(conversation, tokenizer):
  function conversation_to_ids_llama3 (line 258) | def conversation_to_ids_llama3(conversation, tokenizer):
  function conversation_to_ids_qwen2 (line 298) | def conversation_to_ids_qwen2(conversation, tokenizer):
  function preprocess (line 338) | def preprocess(
  function slice_image (line 434) | def slice_image(
  function ensure_divide (line 493) | def ensure_divide(length, patch_size):
  function find_best_resize (line 497) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 508) | def get_refine_size(
  function split_to_patches (line 532) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 549) | def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
  function reshape_by_patch (line 570) | def reshape_by_patch(image_tensor, patch_size):

FILE: OCR_Multimodal_Search/infer/inference.py
  function main (line 12) | def main() -> None:

FILE: OCR_Multimodal_Search/infer/utils.py
  function evaluate_colbert (line 4) | def evaluate_colbert( qs, ps, batch_size=128) -> torch.Tensor:
  function build_transform (line 22) | def build_transform():

FILE: OCR_VG/chat.py
  function init_omni_lmm (line 21) | def init_omni_lmm(model_path):
  function expand_question_into_multimodal (line 61) | def expand_question_into_multimodal(question_text, image_token_len, im_s...
  function wrap_question_for_omni_lmm (line 70) | def wrap_question_for_omni_lmm(question, image_token_len, tokenizer):
  class OmniLMM12B (line 85) | class OmniLMM12B:
    method __init__ (line 86) | def __init__(self, model_path) -> None:
    method decode (line 94) | def decode(self, image, input_ids):
    method chat (line 115) | def chat(self, input):
  function img2base64 (line 133) | def img2base64(file_name):
  class MiniCPMV (line 138) | class MiniCPMV:
    method __init__ (line 139) | def __init__(self, model_path) -> None:
    method chat (line 144) | def chat(self, input):
  class MiniCPMV2_5 (line 162) | class MiniCPMV2_5:
    method __init__ (line 163) | def __init__(self, model_path) -> None:
    method chat (line 168) | def chat(self, input):
  class MiniCPMVChat (line 186) | class MiniCPMVChat:
    method __init__ (line 187) | def __init__(self, model_path) -> None:
    method chat (line 195) | def chat(self, input):

FILE: OCR_VG/gt_test.py
  function parse_text (line 11) | def parse_text(input_str):
  function cv2ImgAddText (line 25) | def cv2ImgAddText(img, text, left, top, textColor=(0, 255, 0), textSize=...
  function draw (line 38) | def draw(final_box,final_text,img,height,width):
  function get_pic_path (line 56) | def get_pic_path(dir):

FILE: OCR_VG/merge_box.py
  function cv2ImgAddText (line 15) | def cv2ImgAddText(img, text, left, top, textColor=(0, 255, 0), textSize=...
  function calculate_intersection_percentage (line 31) | def calculate_intersection_percentage(rect1, rect2):
  function get_threshold (line 70) | def get_threshold(boxes, text_list, width, height):
  function find_min_bounding_rectangle (line 101) | def find_min_bounding_rectangle(polygons,four_pint=False):
  function merge_boxes (line 133) | def merge_boxes(boxes, text_list, width, height):
  function calculate_angle (line 184) | def calculate_angle(v1, v2):
  function is_quadrilateral_angles_in_range (line 193) | def is_quadrilateral_angles_in_range(points):
  function is_rectangle (line 204) | def is_rectangle(points):
  function get_query_answer (line 221) | def get_query_answer(boxes, text_list, height, width):
  function load_json (line 262) | def load_json(file_path):
  function draw (line 267) | def draw(final_box,final_text,img):
  function save_to_json (line 288) | def save_to_json(output_data, output_path):
  function main (line 300) | def main():

FILE: OCR_VG/omnilmm/conversation.py
  class SeparatorStyle (line 6) | class SeparatorStyle(Enum):
  class Conversation (line 13) | class Conversation:
    method get_prompt (line 26) | def get_prompt(self):
    method append_message (line 51) | def append_message(self, role, message):
    method get_images (line 54) | def get_images(self, return_pil=False):
    method to_gradio_chatbot (line 110) | def to_gradio_chatbot(self):
    method copy (line 142) | def copy(self):
    method dict (line 152) | def dict(self):

FILE: OCR_VG/omnilmm/model/omnilmm.py
  class OmniLMMConfig (line 23) | class OmniLMMConfig(MistralConfig):
  class Identity (line 27) | class Identity(torch.nn.Identity):
    method forward (line 28) | def forward(self, input: Tensor, **kwargs) -> Tensor:
  function create_vision_module (line 32) | def create_vision_module(config):
  class OmniLMMModel (line 56) | class OmniLMMModel(MistralModel):
    method __init__ (line 59) | def __init__(self, config: OmniLMMConfig, mm_vision_tower=None, mm_hid...
    method initialize_vision_modules (line 75) | def initialize_vision_modules(self, vision_tower, no_randaug, num_quer...
    method get_vision_embedding (line 108) | def get_vision_embedding(self, pixel_values):
    method get_vllm_embedding (line 123) | def get_vllm_embedding(self, data):
    method forward (line 184) | def forward(
  class OmniLMMForCausalLM (line 269) | class OmniLMMForCausalLM(MistralForCausalLM):
    method __init__ (line 272) | def __init__(self, config, mm_vision_tower=None, tune_clip=True):
    method forward (line 283) | def forward(
    method prepare_inputs_for_generation (line 350) | def prepare_inputs_for_generation(
    method generate_vllm (line 372) | def generate_vllm(
    method initialize_vision_tokenizer (line 400) | def initialize_vision_tokenizer(self, mm_use_im_start_end, tokenizer, ...

FILE: OCR_VG/omnilmm/model/resampler.py
  function get_abs_pos (line 23) | def get_abs_pos(abs_pos, tgt_size):
  function get_2d_sincos_pos_embed (line 43) | def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
  function get_2d_sincos_pos_embed_from_grid (line 62) | def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
  function get_1d_sincos_pos_embed_from_grid (line 75) | def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
  class Resampler (line 96) | class Resampler(nn.Module):
    method __init__ (line 104) | def __init__(
    method _init_weights (line 140) | def _init_weights(self, m):
    method forward (line 149) | def forward(self, x, attn_mask=None):
    method _repeat (line 170) | def _repeat(self, query, N: int):

FILE: OCR_VG/omnilmm/model/utils.py
  function auto_upgrade (line 23) | def auto_upgrade(config):
  class KeywordsStoppingCriteria (line 42) | class KeywordsStoppingCriteria(StoppingCriteria):
    method __init__ (line 43) | def __init__(self, keywords, tokenizer, input_ids):
    method __call__ (line 49) | def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTe...
  function auto_upgrade (line 61) | def auto_upgrade(config):
  function identity_func (line 82) | def identity_func(img):
  function autocontrast_func (line 86) | def autocontrast_func(img, cutoff=0):
  function equalize_func (line 118) | def equalize_func(img):
  function rotate_func (line 142) | def rotate_func(img, degree, fill=(0, 0, 0)):
  function solarize_func (line 153) | def solarize_func(img, thresh=128):
  function color_func (line 163) | def color_func(img, factor):
  function contrast_func (line 185) | def contrast_func(img, factor):
  function brightness_func (line 198) | def brightness_func(img, factor):
  function sharpness_func (line 208) | def sharpness_func(img, factor):
  function shear_x_func (line 230) | def shear_x_func(img, factor, fill=(0, 0, 0)):
  function translate_x_func (line 238) | def translate_x_func(img, offset, fill=(0, 0, 0)):
  function translate_y_func (line 249) | def translate_y_func(img, offset, fill=(0, 0, 0)):
  function posterize_func (line 260) | def posterize_func(img, bits):
  function shear_y_func (line 268) | def shear_y_func(img, factor, fill=(0, 0, 0)):
  function cutout_func (line 276) | def cutout_func(img, pad_size, replace=(0, 0, 0)):
  function enhance_level_to_args (line 290) | def enhance_level_to_args(MAX_LEVEL):
  function shear_level_to_args (line 296) | def shear_level_to_args(MAX_LEVEL, replace_value):
  function translate_level_to_args (line 306) | def translate_level_to_args(translate_const, MAX_LEVEL, replace_value):
  function cutout_level_to_args (line 316) | def cutout_level_to_args(cutout_const, MAX_LEVEL, replace_value):
  function solarize_level_to_args (line 324) | def solarize_level_to_args(MAX_LEVEL):
  function none_level_to_args (line 331) | def none_level_to_args(level):
  function posterize_level_to_args (line 335) | def posterize_level_to_args(MAX_LEVEL):
  function rotate_level_to_args (line 342) | def rotate_level_to_args(MAX_LEVEL, replace_value):
  class RandomAugment (line 394) | class RandomAugment(object):
    method __init__ (line 396) | def __init__(self, N=2, M=10, isPIL=False, augs=[]):
    method get_random_ops (line 405) | def get_random_ops(self):
    method __call__ (line 409) | def __call__(self, img):
  function build_transform (line 421) | def build_transform(is_train, randaug=True, input_size=224, interpolatio...
  function img2b64 (line 465) | def img2b64(img_path):
  function str2b64 (line 475) | def str2b64(str):
  function b642str (line 479) | def b642str(b64):
  function is_dist_avail_and_initialized (line 483) | def is_dist_avail_and_initialized():
  function get_world_size (line 491) | def get_world_size():
  function get_rank (line 497) | def get_rank():
  function all_gather (line 503) | def all_gather(data):
  function mean (line 546) | def mean(lst):
  function stop_gradient_by_name (line 550) | def stop_gradient_by_name(name: str):

FILE: OCR_VG/omnilmm/train/train_utils.py
  function _tokenize_fn (line 22) | def _tokenize_fn(strings: Sequence[str],
  function omni_preprocess (line 50) | def omni_preprocess(sources,

FILE: OCR_VG/omnilmm/utils.py
  function build_logger (line 17) | def build_logger(logger_name, logger_filename):
  class StreamToLogger (line 60) | class StreamToLogger(object):
    method __init__ (line 65) | def __init__(self, logger, log_level=logging.INFO):
    method __getattr__ (line 71) | def __getattr__(self, attr):
    method write (line 74) | def write(self, buf):
    method flush (line 88) | def flush(self):
  function disable_torch_init (line 94) | def disable_torch_init():
  function violates_moderation (line 103) | def violates_moderation(text):
  function pretty_print_semaphore (line 124) | def pretty_print_semaphore(semaphore):

FILE: agent_auto_plan/autoplan/all_param_inference.py
  function format_outputs (line 12) | def format_outputs(outputs):
  function all_param_split_task (line 22) | def all_param_split_task(question,tokenizer,merge_model):
  function stopping_criteria (line 63) | def stopping_criteria(cur_len,output_so_far):

FILE: agent_auto_plan/autoplan/bing_search.py
  class SearchResult (line 24) | class SearchResult:
    method __init__ (line 25) | def __init__(self, title, url, snip) -> None:
    method dump (line 30) | def dump(self):
    method __str__ (line 37) | def __str__(self) -> str:
  class SearcherInterface (line 40) | class SearcherInterface:
    method search (line 41) | def search(self, query) -> List[SearchResult]:
  function generate_document (line 45) | def generate_document(url):
  function summarize_document (line 55) | def summarize_document(url, model_name):
  function fetch_webpage_content (line 67) | def fetch_webpage_content(url):
  function summarize_text (line 72) | def summarize_text(text,question,model,tokenizer):
  class Searcher (line 95) | class Searcher(SearcherInterface):
    method __init__ (line 96) | def __init__(self) -> None:
    method _parse (line 99) | def _parse(self, result) -> List[SearchResult]:
    method search (line 107) | def search(self, query) -> List[SearchResult]:
  function get_bing_search_raw_page (line 111) | def get_bing_search_raw_page(question: str):
  function query_bing (line 143) | def query_bing(question, max_tries=3,model=None,tokenizer=None):

FILE: agent_auto_plan/autoplan/fuctions.py
  function gpt_35_api_stream (line 11) | def gpt_35_api_stream(messages: list):
  function bm25 (line 43) | def bm25(query, corpus,model=None):
  function get_check_text (line 65) | def get_check_text(sub_task_list):
  function get_tools_description (line 70) | def get_tools_description(tools):
  function get_task_and_question (line 80) | def get_task_and_question(path):
  function task_text_split (line 102) | def task_text_split(text,question_orgin):#将任务分解后的输出变成任务列表,这里是用了qwen第一次th...
  function distance (line 115) | def distance(query,map_dict):#计算距离

FILE: agent_auto_plan/autoplan/load_model.py
  function get_model (line 6) | def get_model(args):

FILE: agent_auto_plan/autoplan/lora_inference_nomerge.py
  function get_merge_model (line 9) | def get_merge_model(base_path,adapter_path):
  function split_task (line 27) | def split_task(question,tokenizer,merge_model,prompt=None):
  function format_outputs (line 51) | def format_outputs(outputs):#将输出格式化
  function test (line 61) | def test():
  function stopping_criteria (line 94) | def stopping_criteria(cur_len,output_so_far):

FILE: agent_auto_plan/autoplan/main.py
  function _get_args (line 72) | def _get_args():
  function llm_with_plugin (line 126) | def llm_with_plugin(
  function build_input_text (line 315) | def build_input_text(chat_history, list_of_plugin_info) -> str:
  function text_completion (line 370) | def text_completion(input_text: str, stop_words) -> str:  # 作为一个文本续写模型来使用
  function parse_latest_plugin_call (line 394) | def parse_latest_plugin_call(text):
  function test (line 426) | def test():

FILE: agent_auto_plan/autoplan/prompt_plamte.py
  function check_action_inputs (line 97) | def check_action_inputs(question,action,list_of_plugin_info,history=None...
  function prompt_task_split (line 144) | def prompt_task_split(question,tools,args,write_file):#任务分解函数

FILE: agent_auto_plan/autoplan/tools_introduction.py
  function call_plugin (line 199) | def call_plugin(plugin_name: str, plugin_args: str,write_file,embeding_m...

FILE: agent_auto_plan/finetune_language/dataset.py
  class SupervisedDataset (line 23) | class SupervisedDataset(Dataset):
    method __init__ (line 26) | def __init__(
    method __len__ (line 49) | def __len__(self):
    method __getitem__ (line 52) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function data_collator (line 94) | def data_collator(examples, padding_value=0, max_length=2048):
  function conversation_to_ids (line 132) | def conversation_to_ids(conversation, tokenizer, llm_type=None, new_sche...
  function conversation_to_ids_minicpm (line 206) | def conversation_to_ids_minicpm(conversation, tokenizer):
  function conversation_to_ids_llama3 (line 238) | def conversation_to_ids_llama3(conversation, tokenizer):
  function conversation_to_ids_qwen2 (line 278) | def conversation_to_ids_qwen2(conversation, tokenizer):
  function preprocess (line 317) | def preprocess(
  function slice_image (line 418) | def slice_image(
  function ensure_divide (line 477) | def ensure_divide(length, patch_size):
  function find_best_resize (line 481) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 492) | def get_refine_size(
  function split_to_patches (line 516) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 533) | def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
  function reshape_by_patch (line 559) | def reshape_by_patch(image_tensor, patch_size):

FILE: agent_auto_plan/finetune_language/finetune.py
  class ModelArguments (line 27) | class ModelArguments:
  class DataArguments (line 32) | class DataArguments:
  class TrainingArguments (line 42) | class TrainingArguments(transformers.TrainingArguments):
  class LoraArguments (line 59) | class LoraArguments:
  function rank0_print (line 73) | def rank0_print(*args):
  function safe_save_model_for_hf_trainer (line 78) | def safe_save_model_for_hf_trainer(trainer, output_dir: str, bias="none"):
  function make_supervised_data_module (line 84) | def make_supervised_data_module(
  function build_transform (line 137) | def build_transform():
  function get_parameter_number (line 149) | def get_parameter_number(model):
  function train (line 167) | def train():

FILE: agent_auto_plan/finetune_language/merge_lora.py
  function copy_files_not_in_B (line 11) | def copy_files_not_in_B(A_path, B_path):

FILE: agent_auto_plan/finetune_language/replace_file/modeling_minicpmv.py
  class MiniCPMVPreTrainedModel (line 18) | class MiniCPMVPreTrainedModel(Qwen2PreTrainedModel):
  class MiniCPMV (line 22) | class MiniCPMV(MiniCPMVPreTrainedModel):
    method __init__ (line 23) | def __init__(self, config):
    method init_vision_module (line 37) | def init_vision_module(self):
    method init_resampler (line 53) | def init_resampler(self, embed_dim, vision_dim):
    method get_input_embeddings (line 62) | def get_input_embeddings(self):
    method set_input_embeddings (line 65) | def set_input_embeddings(self, value):
    method get_output_embeddings (line 68) | def get_output_embeddings(self):
    method set_output_embeddings (line 71) | def set_output_embeddings(self, new_embeddings):
    method set_decoder (line 74) | def set_decoder(self, decoder):
    method prepare_inputs_for_generation (line 77) | def prepare_inputs_for_generation(
    method get_decoder (line 125) | def get_decoder(self):
    method get_vllm_embedding (line 128) | def get_vllm_embedding(self, data):
    method forward (line 236) | def forward(self, data, **kwargs):
    method _decode (line 257) | def _decode(self, inputs_embeds, tokenizer, attention_mask, decode_tex...
    method _decode_stream (line 270) | def _decode_stream(self, inputs_embeds, tokenizer, **kwargs):
    method _decode_text (line 286) | def _decode_text(self, result_ids, tokenizer):
    method generate (line 298) | def generate(
    method chat (line 342) | def chat(

FILE: agent_auto_plan/qwen_vllm.py
  function save_to_jsonl (line 17) | def save_to_jsonl(data, output_file):
  function save_to_excel (line 21) | def save_to_excel(data, output_file):
  function process_prompt (line 30) | def process_prompt(prompt):

FILE: agent_demo/agent_demo.py
  class SequenceStoppingCriteria (line 29) | class SequenceStoppingCriteria(StoppingCriteria):
    method __init__ (line 30) | def __init__(self, sequence_ids):
    method check_sequences (line 33) | def check_sequences(self, current_tokens, sequences):
    method __call__ (line 46) | def __call__(self, input_ids, scores, **kwargs):
  function llm_with_plugin (line 114) | def llm_with_plugin(prompt: str, history, list_of_plugin_info=()):
  function build_input_text (line 141) | def build_input_text(chat_history, list_of_plugin_info) -> str:
  function text_completion (line 186) | def text_completion(input_text: str, stop_words) -> str:  # 作为一个文本续写模型来使用
  function parse_latest_plugin_call (line 210) | def parse_latest_plugin_call(text):
  function call_plugin (line 235) | def call_plugin(plugin_name: str, plugin_args: str) -> str:
  function test (line 310) | def test():

FILE: agent_demo/build_react_prompt.py
  function build_input_text (line 24) | def build_input_text(chat_history, list_of_plugin_info) -> str:
  function parse_latest_plugin_call (line 71) | def parse_latest_plugin_call(text):

FILE: agent_demo/get_react_data.py
  function save_cpm3_data (line 78) | def save_cpm3_data(cpm3_data_path,cpm3_data):
  function switch_cpm_tool (line 86) | def switch_cpm_tool(tools):
  function function_call (line 126) | def function_call(plugin_name, plugin_args):
  function split_react_data (line 152) | def split_react_data(react_str):
  function get_answer_from_output (line 170) | def get_answer_from_output(output):
  function get_tool_description (line 177) | def get_tool_description(tool):
  function get_question (line 191) | def get_question():
  function get_react_data (line 220) | def get_react_data():
  function get_cpm_function_call (line 259) | def get_cpm_function_call():

FILE: ft_language_replace_file/finetune/dataset.py
  class SupervisedDataset (line 23) | class SupervisedDataset(Dataset):
    method __init__ (line 26) | def __init__(
    method __len__ (line 47) | def __len__(self):
    method __getitem__ (line 50) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function data_collator (line 87) | def data_collator(examples, padding_value=0, max_length=2048):
  function conversation_to_ids (line 125) | def conversation_to_ids(conversation, tokenizer, llm_type=None):
  function conversation_to_ids_minicpm (line 178) | def conversation_to_ids_minicpm(conversation, tokenizer):
  function conversation_to_ids_llama3 (line 210) | def conversation_to_ids_llama3(conversation, tokenizer):
  function preprocess (line 252) | def preprocess(
  function slice_image (line 352) | def slice_image(
  function ensure_divide (line 411) | def ensure_divide(length, patch_size):
  function find_best_resize (line 415) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 426) | def get_refine_size(
  function split_to_patches (line 450) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 467) | def get_grid_placeholder(tokenizer, grid, query_num):
  function reshape_by_patch (line 485) | def reshape_by_patch(image_tensor, patch_size):

FILE: ft_language_replace_file/finetune/finetune.py
  class ModelArguments (line 44) | class ModelArguments:
  class DataArguments (line 49) | class DataArguments:
  class TrainingArguments (line 59) | class TrainingArguments(transformers.TrainingArguments):
  class LoraArguments (line 76) | class LoraArguments:
  function maybe_zero_3 (line 89) | def maybe_zero_3(param):
  function get_peft_state_maybe_zero_3 (line 100) | def get_peft_state_maybe_zero_3(named_params, bias):
  function rank0_print (line 126) | def rank0_print(*args):
  function safe_save_model_for_hf_trainer (line 131) | def safe_save_model_for_hf_trainer(trainer, output_dir: str, bias="none"):
  function make_supervised_data_module (line 147) | def make_supervised_data_module(
  function get_parameter_number (line 197) | def get_parameter_number(model):
  function train (line 215) | def train():

FILE: ft_language_replace_file/finetune/only_language_web_demo.py
  function create_component (line 110) | def create_component(params, comp='Slider'):
  function chat (line 134) | def chat(img, msgs, ctx, params=None, vision_hidden_states=None):
  function upload_img (line 175) | def upload_img(image, _chatbot, _app_session):
  function respond (line 185) | def respond(_question, _chat_bot, _app_cfg, params_form, num_beams, repe...
  function regenerate_button_clicked (line 227) | def regenerate_button_clicked(_question, _chat_bot, _app_cfg, params_for...

FILE: ft_language_replace_file/finetune/replace_file/modeling_minicpmv.py
  class MiniCPMVPreTrainedModel (line 25) | class MiniCPMVPreTrainedModel(LlamaPreTrainedModel):
  class MiniCPMV (line 29) | class MiniCPMV(MiniCPMVPreTrainedModel):
    method __init__ (line 30) | def __init__(self, config):
    method init_vision_module (line 40) | def init_vision_module(self):
    method init_resampler (line 51) | def init_resampler(self, embed_dim, vision_dim,):
    method init_transform (line 60) | def init_transform(self):
    method get_input_embeddings (line 70) | def get_input_embeddings(self):
    method set_input_embeddings (line 73) | def set_input_embeddings(self, value):
    method get_vllm_embedding (line 76) | def get_vllm_embedding(self, data):
    method forward (line 184) | def forward(self, data, **kwargs):
    method _convert_to_tensors (line 206) | def _convert_to_tensors(
    method _process_list (line 231) | def _process_list(
    method _decode (line 246) | def _decode(self, inputs_embeds, tokenizer, **kwargs):
    method _decode_stream (line 259) | def _decode_stream(self, inputs_embeds, tokenizer, **kwargs):
    method _decode_text (line 278) | def _decode_text(self, result_ids, tokenizer):
    method slice_image (line 289) | def slice_image(self, image):
    method get_slice_image_placeholder (line 297) | def get_slice_image_placeholder(self, image, tokenizer):
    method reshape_by_patch (line 327) | def reshape_by_patch(self, image_tensor):
    method generate (line 344) | def generate(
    method chat (line 396) | def chat(
  class PreTrainedTokenizerFastWrapper (line 509) | class PreTrainedTokenizerFastWrapper(PreTrainedTokenizerFast):
    method __init__ (line 510) | def __init__(self, **kwargs):
    method eos_id (line 525) | def eos_id(self):
    method bos_id (line 529) | def bos_id(self):
    method unk_id (line 533) | def unk_id(self):
    method eot_id (line 537) | def eot_id(self):
    method im_start_id (line 541) | def im_start_id(self):
    method im_end_id (line 545) | def im_end_id(self):
    method escape (line 549) | def escape(text: str) -> str:
    method unescape (line 553) | def unescape(text: str) -> str:
  function pad (line 557) | def pad(orig_items, key, max_length=None, padding_value=0, padding_side=...
  function slice_image (line 605) | def slice_image(
  function ensure_divide (line 662) | def ensure_divide(length, patch_size):
  function find_best_resize (line 666) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 677) | def get_refine_size(
  function split_to_patches (line 701) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 718) | def get_grid_placeholder(tokenizer, grid, query_num):

FILE: ft_language_replace_file/finetune/replace_file/resampler.py
  function get_2d_sincos_pos_embed (line 17) | def get_2d_sincos_pos_embed(embed_dim, image_size):
  function get_2d_sincos_pos_embed_from_grid (line 37) | def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
  function get_1d_sincos_pos_embed_from_grid_new (line 48) | def get_1d_sincos_pos_embed_from_grid_new(embed_dim, pos):
  class Resampler (line 68) | class Resampler(nn.Module):
    method __init__ (line 76) | def __init__(
    method _set_2d_pos_cache (line 109) | def _set_2d_pos_cache(self, max_size, device='cpu'):
    method _adjust_pos_cache (line 115) | def _adjust_pos_cache(self, tgt_sizes, device):
    method _init_weights (line 122) | def _init_weights(self, m):
    method forward (line 131) | def forward(self, x, tgt_sizes=None):
    method _repeat (line 172) | def _repeat(self, query, N: int):
  class MultiheadAttention (line 176) | class MultiheadAttention(nn.MultiheadAttention):
    method __init__ (line 177) | def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bi...
    method forward (line 184) | def forward(
    method multi_head_attention_forward (line 340) | def multi_head_attention_forward(
  function _mha_shape_check (line 624) | def _mha_shape_check(query: Tensor, key: Tensor, value: Tensor,
  function _canonical_mask (line 672) | def _canonical_mask(
  function _none_or_dtype (line 701) | def _none_or_dtype(input: Optional[Tensor]) -> Optional[DType]:
  function _in_projection_packed (line 708) | def _in_projection_packed(
  function _in_projection (line 768) | def _in_projection(

FILE: ft_language_replace_file/finetune/trainer.py
  class CPMTrainer (line 11) | class CPMTrainer(Trainer):
    method compute_loss (line 12) | def compute_loss(self, model, inputs, return_outputs=False):
    method prediction_step (line 44) | def prediction_step(
    method training_step (line 172) | def training_step(self, model: nn.Module, inputs: Dict[str, Union[torc...
    method _save (line 214) | def _save(self, output_dir: Optional[str] = None, state_dict=None):

FILE: get_minicpmv2.6_embeding/dataset.py
  class ImageDataset (line 19) | class ImageDataset(Dataset):
    method __init__ (line 21) | def __init__(
    method __len__ (line 42) | def __len__(self):
    method __getitem__ (line 45) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class QueryDataset (line 72) | class QueryDataset(Dataset):
    method __init__ (line 75) | def __init__(
    method __len__ (line 86) | def __len__(self):
    method __getitem__ (line 89) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function data_collator_query (line 116) | def data_collator_query(examples, padding_value=0, max_length=2048):
  function data_collator (line 126) | def data_collator(examples, padding_value=0, max_length=2048,device='cpu'):
  function load_from_pdf (line 162) | def load_from_pdf(pdf_path: str):
  function load_from_json (line 167) | def load_from_json(json_path: str):
  function conversation_to_ids (line 173) | def conversation_to_ids(conversation, tokenizer, llm_type=None, new_sche...
  function conversation_to_ids_minicpm (line 238) | def conversation_to_ids_minicpm(conversation, tokenizer):
  function conversation_to_ids_llama3 (line 270) | def conversation_to_ids_llama3(conversation, tokenizer):
  function conversation_to_ids_qwen2 (line 310) | def conversation_to_ids_qwen2(conversation, tokenizer):
  function preprocess (line 350) | def preprocess(
  function slice_image (line 446) | def slice_image(
  function ensure_divide (line 505) | def ensure_divide(length, patch_size):
  function find_best_resize (line 509) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 520) | def get_refine_size(
  function split_to_patches (line 544) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 561) | def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
  function reshape_by_patch (line 582) | def reshape_by_patch(image_tensor, patch_size):

FILE: get_minicpmv2.6_embeding/inference.py
  function build_transform (line 9) | def build_transform():
  function main (line 20) | def main() -> None:

FILE: get_minicpmv2.6_embeding/modeling_minicpmv.py
  class MiniCPMVPreTrainedModel (line 18) | class MiniCPMVPreTrainedModel(Qwen2PreTrainedModel):
  class MiniCPMV (line 22) | class MiniCPMV(MiniCPMVPreTrainedModel):
    method __init__ (line 23) | def __init__(self, config):
    method init_vision_module (line 34) | def init_vision_module(self):
    method init_resampler (line 50) | def init_resampler(self, embed_dim, vision_dim):
    method get_input_embeddings (line 59) | def get_input_embeddings(self):
    method set_input_embeddings (line 62) | def set_input_embeddings(self, value):
    method get_output_embeddings (line 65) | def get_output_embeddings(self):
    method set_output_embeddings (line 68) | def set_output_embeddings(self, new_embeddings):
    method set_decoder (line 71) | def set_decoder(self, decoder):
    method get_decoder (line 74) | def get_decoder(self):
    method get_vllm_embedding (line 77) | def get_vllm_embedding(self, data):
    method forward (line 174) | def forward(self, data, **kwargs):
    method _decode (line 193) | def _decode(self, inputs_embeds, tokenizer, attention_mask, decode_tex...
    method _decode_stream (line 206) | def _decode_stream(self, inputs_embeds, tokenizer, **kwargs):
    method _decode_text (line 222) | def _decode_text(self, result_ids, tokenizer):
    method generate (line 234) | def generate(
    method chat (line 278) | def chat(

FILE: mbti_role_play/mbti_demo.py
  function check_model_v (line 44) | def check_model_v(img_file_path: str = None):
  function hf_gen (line 68) | def hf_gen(dialog: List, top_p: float, temperature: float, repetition_pe...
  function hf_v_gen (line 102) | def hf_v_gen(dialog: List, top_p: float, temperature: float, repetition_...
  function generate (line 134) | def generate(chat_history: List, query: str, I_E_choice: str, N_S_choice...
  function regenerate (line 174) | def regenerate(chat_history: List, I_E_choice, N_S_choice, T_F_choice, J...
  function clear_history (line 212) | def clear_history():
  function reverse_last_round (line 221) | def reverse_last_round(chat_history):
  function process_choice (line 232) | def process_choice(I_E, N_S, T_F, J_P):

FILE: mbti_role_play/mbti_sft_dpo_data/get_rank_data.py
  function load_json (line 8) | def load_json(file_path):

FILE: windows_minicpm3.0_agent/app.py
  function search (line 17) | def search(query: str, ds, images):
  function index (line 39) | def index(file, ds):

FILE: windows_minicpm3.0_agent/cli_demo.py
  function get_relevant_image (line 13) | def get_relevant_image(query,file_path):

FILE: windows_minicpm3.0_agent/dataset.py
  class ImageDataset (line 19) | class ImageDataset(Dataset):
    method __init__ (line 21) | def __init__(
    method __len__ (line 42) | def __len__(self):
    method __getitem__ (line 45) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  class QueryDataset (line 72) | class QueryDataset(Dataset):
    method __init__ (line 75) | def __init__(
    method __len__ (line 86) | def __len__(self):
    method __getitem__ (line 89) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
  function data_collator_query (line 103) | def data_collator_query(examples, padding_value=0, max_length=2048):
  function data_collator (line 113) | def data_collator(examples, padding_value=0, max_length=2048,device='cpu'):
  function load_from_pdf (line 150) | def load_from_pdf(pdf_path: str):
  function load_from_json (line 155) | def load_from_json(json_path: str):
  function conversation_to_ids (line 161) | def conversation_to_ids(conversation, tokenizer, llm_type=None, new_sche...
  function conversation_to_ids_minicpm (line 226) | def conversation_to_ids_minicpm(conversation, tokenizer):
  function conversation_to_ids_llama3 (line 258) | def conversation_to_ids_llama3(conversation, tokenizer):
  function conversation_to_ids_qwen2 (line 298) | def conversation_to_ids_qwen2(conversation, tokenizer):
  function preprocess (line 338) | def preprocess(
  function slice_image (line 434) | def slice_image(
  function ensure_divide (line 493) | def ensure_divide(length, patch_size):
  function find_best_resize (line 497) | def find_best_resize(original_size, scale_resolution, patch_size, allow_...
  function get_refine_size (line 508) | def get_refine_size(
  function split_to_patches (line 532) | def split_to_patches(image, grid):
  function get_grid_placeholder (line 549) | def get_grid_placeholder(tokenizer, grid, query_num, new_schema=False):
  function reshape_by_patch (line 570) | def reshape_by_patch(image_tensor, patch_size):

FILE: windows_minicpm3.0_agent/inference.py
  function main (line 12) | def main() -> None:

FILE: windows_minicpm3.0_agent/utils.py
  function evaluate_colbert (line 4) | def evaluate_colbert( qs, ps, batch_size=128) -> torch.Tensor:
  function build_transform (line 22) | def build_transform():

FILE: windows_minicpm3.0_agent/windows_agent.py
  function text_to_image_search (line 20) | def text_to_image_search(text_description,images_path):
  function image_anwer_question (line 23) | def image_anwer_question(image_path,query):
  function fake_tool_execute (line 40) | def fake_tool_execute(toolcalls):
  function clear_cuda_variables (line 114) | def clear_cuda_variables():
Condensed preview — 190 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,746K chars).
[
  {
    "path": "4G_memory_rag/langchain_demo.py",
    "chars": 11430,
    "preview": "\"\"\"\n你只需要最少6g显存(足够)的显卡就能在消费级显卡上体验流畅的rag。\n\n使用方法:\n1. 运行pull_request/rag/langchain_demo.py\n2. 上传pdf/txt文件(同一目录下可传多个)\n3. 输入问题"
  },
  {
    "path": "AIPC/Mac_feature_and_shortcut.json",
    "chars": 2661,
    "preview": "{\n    \"复制\": \"Command ⌘ + C\",\n    \"粘贴\": \"Command ⌘ + V\",\n    \"剪切\": \"Option ⌥ + Command ⌘ + V\",\n    \"打开文件\": \"Command ⌘ + O"
  },
  {
    "path": "AIPC/spider.py",
    "chars": 1526,
    "preview": "import requests\nfrom bs4 import BeautifulSoup\nimport json\n# 目标URL\nurl = 'https://liubing.me/article/mac/mac-shortcut-key"
  },
  {
    "path": "MiniCPM-o-long_video_inference/README.md",
    "chars": 1100,
    "preview": "# MiniCPM-o 长视频推理脚本说明\n\n## 1. 如何使用该脚本进行长视频推理\n\n### 环境依赖\n请先确保已安装以下依赖,可通过如下命令一键安装:\n\n```bash\npip install -r requirements.txt\n"
  },
  {
    "path": "MiniCPM-o-long_video_inference/infer.py",
    "chars": 15946,
    "preview": "import json\nimport math\nimport tempfile\nfrom datetime import datetime\n\nimport numpy as np\nimport torch\nimport librosa\nim"
  },
  {
    "path": "MiniCPM-o-long_video_inference/requirements.txt",
    "chars": 124,
    "preview": "decord==0.6.0\nlibrosa==0.9.0\nmoviepy==1.0.3\nnumpy==2.3.0\nPillow==11.2.1\nsoundfile==0.12.1\ntorch==2.5.1\ntransformers==4.5"
  },
  {
    "path": "MiniCPMV2_6_awq/modeling_minicpmv.py",
    "chars": 18460,
    "preview": "import math\nfrom typing import List, Optional\nimport json\nimport torch\nimport torchvision\nfrom threading import Thread\nf"
  },
  {
    "path": "MiniCPMV2_6_awq/quantize.py",
    "chars": 2545,
    "preview": "\n\nfrom datasets import load_dataset\nfrom awq import AutoAWQForCausalLM\nfrom transformers import AutoTokenizer\nimport os\n"
  },
  {
    "path": "OCR_Multimodal_Search/asset/README.md",
    "chars": 2136,
    "preview": "# MiniCPM_Series_Tutorial\n\n本人是openbmb负责开源社区的同学,modelbest(面壁智能)一直致力降低大模型使用门槛,提高模型知识密度,让大模型飞入千家万户。\n\n为此我写了MiniCPM和MiniCPMV的"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "OCR_Multimodal_Search/finetune/dataset.py",
    "chars": 18736,
    "preview": "import copy\nimport json\nimport logging\nimport math\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/dataset_original.py",
    "chars": 17432,
    "preview": "import copy\nimport json\nimport logging\nimport math\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/ds_config_zero2.json",
    "chars": 1229,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/ds_config_zero3.json",
    "chars": 1500,
    "preview": "\n{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/finetune.py",
    "chars": 10656,
    "preview": "import glob\nimport json\nimport logging\nimport os\nfrom dataclasses import dataclass, field\nfrom functools import partial\n"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/finetune_ds.sh",
    "chars": 1898,
    "preview": "#!/bin/bash\n\nGPUS_PER_NODE=8\nNNODES=1\nNODE_RANK=0\nMASTER_ADDR=localhost\nMASTER_PORT=6001\n\nMODEL=\"openbmb/MiniCPM-V-2_6\"\n"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/finetune_lora.sh",
    "chars": 2095,
    "preview": "#!/bin/bash\n\nGPUS_PER_NODE=8\nNNODES=1\nNODE_RANK=0\nMASTER_ADDR=localhost\nMASTER_PORT=6001\n\nMODEL=\"/root/ld/ld_model_pretr"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/readme.md",
    "chars": 14476,
    "preview": "# MiniCPM-V Finetuning\n\n\nWe offer the official scripts for easy finetuning of the pretrained **MiniCPM-V-2_6**, **MiniCP"
  },
  {
    "path": "OCR_Multimodal_Search/finetune/trainer.py",
    "chars": 14706,
    "preview": "\nimport torch\nimport torch.nn as nn\nimport deepspeed\nfrom transformers import Trainer\nfrom transformers.trainer_pt_utils"
  },
  {
    "path": "OCR_Multimodal_Search/infer/app.py",
    "chars": 4245,
    "preview": "import os\n\nimport gradio as gr\nimport torch\nfrom pdf2image import convert_from_path\nfrom PIL import Image\nfrom torch.uti"
  },
  {
    "path": "OCR_Multimodal_Search/infer/cli_demo.py",
    "chars": 3554,
    "preview": "import os\n\nimport torch\nfrom pdf2image import convert_from_path\nfrom PIL import Image\nfrom torch.utils.data import DataL"
  },
  {
    "path": "OCR_Multimodal_Search/infer/dataset.py",
    "chars": 19562,
    "preview": "import copy\nimport json\nimport logging\nimport math\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import"
  },
  {
    "path": "OCR_Multimodal_Search/infer/inference.py",
    "chars": 3049,
    "preview": "import torch\nimport typer\nfrom torch.utils.data import DataLoader\nfrom tqdm import tqdm\nfrom transformers import AutoPro"
  },
  {
    "path": "OCR_Multimodal_Search/infer/utils.py",
    "chars": 1374,
    "preview": "from torchvision import transforms\nimport torch.nn.functional as F\nimport torch\ndef evaluate_colbert( qs, ps, batch_size"
  },
  {
    "path": "OCR_VG/chat.py",
    "chars": 7891,
    "preview": "import os\nimport torch\nimport json\nfrom PIL import Image\nimport base64\nimport io\nfrom accelerate import load_checkpoint_"
  },
  {
    "path": "OCR_VG/data_demo/img_gt.json",
    "chars": 17771,
    "preview": "{\n    \"data\": {\n        \"/root/ld/ld_project/MIniCPM_Series_Tutorial/OCR_VG/data_demo/img/000001.jpg\": {\n            \"gt"
  },
  {
    "path": "OCR_VG/data_demo/train_data_demo.json",
    "chars": 1320,
    "preview": "[\n    {\n        \"id\": \"712\",\n        \"image\": \"/root/ld/ld_dataset/2022_12_SCUT-HCCDoc_Val/Chinese/img/007650.jpg\",\n    "
  },
  {
    "path": "OCR_VG/gt_test.py",
    "chars": 3471,
    "preview": "from chat import MiniCPMVChat, img2base64\nimport torch\nimport json\nimport cv2\nimport os\ntorch.manual_seed(0)\nimport re\nf"
  },
  {
    "path": "OCR_VG/merge_box.py",
    "chars": 11845,
    "preview": "import json\nfrom PIL import Image, ImageDraw\nimport time\nimport os\nfrom sklearn.cluster import KMeans\nimport numpy as np"
  },
  {
    "path": "OCR_VG/omnilmm/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "OCR_VG/omnilmm/constants.py",
    "chars": 84,
    "preview": "CONTROLLER_HEART_BEAT_EXPIRATION = 30\nWORKER_HEART_BEAT_INTERVAL = 15\n\nLOGDIR = \".\"\n"
  },
  {
    "path": "OCR_VG/omnilmm/conversation.py",
    "chars": 13290,
    "preview": "import dataclasses\nfrom enum import auto, Enum\nfrom typing import List, Tuple\n\n\nclass SeparatorStyle(Enum):\n    \"\"\"Diffe"
  },
  {
    "path": "OCR_VG/omnilmm/model/__init__.py",
    "chars": 39,
    "preview": "from .omnilmm import OmniLMMForCausalLM"
  },
  {
    "path": "OCR_VG/omnilmm/model/omnilmm.py",
    "chars": 20617,
    "preview": "\nimport gc\nimport math\nimport timm\nimport torch\nfrom torch import Tensor\nimport torch.nn as nn\nfrom torch.nn import Cros"
  },
  {
    "path": "OCR_VG/omnilmm/model/resampler.py",
    "chars": 5427,
    "preview": "# Copyright (c) Alibaba Cloud.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the roo"
  },
  {
    "path": "OCR_VG/omnilmm/model/utils.py",
    "chars": 16991,
    "preview": "from torchvision import transforms\nfrom timm.data.transforms import RandomResizedCropAndInterpolation\nfrom timm.data.con"
  },
  {
    "path": "OCR_VG/omnilmm/train/train_utils.py",
    "chars": 5722,
    "preview": "import os\nimport gc\nimport copy\nimport time\n\nimport torch\nimport warnings\nimport transformers\n\nimport numpy as np\n\nfrom "
  },
  {
    "path": "OCR_VG/omnilmm/utils.py",
    "chars": 3988,
    "preview": "import datetime\nimport logging\nimport logging.handlers\nimport os\nimport sys\n\nimport requests\n\nfrom omnilmm.constants imp"
  },
  {
    "path": "README.md",
    "chars": 8178,
    "preview": "# MiniCPM-Cookbook\n<div align=\"center\">\n<img src=\"./asset/logo.png\" width=\"500em\" ></img> \n\n本仓库是MiniCPM端侧系列模型的使用指南,包括推理、"
  },
  {
    "path": "README_application.md",
    "chars": 3176,
    "preview": "# MiniCPM_Series_Tutorial\n\n本人是openbmb负责开源社区的同学,modelbest(面壁智能)一直致力降低大模型使用门槛,提高模型知识密度,让大模型飞入千家万户。\n\n为此我写了MiniCPM和MiniCPMV的"
  },
  {
    "path": "README_en.md",
    "chars": 9352,
    "preview": "# MiniCPM_Series Cookbook\n<div align=\"center\">\n<img src=\"./asset/logo.png\" width=\"500em\" ></img> \n\nThis repository is a "
  },
  {
    "path": "agent_auto_plan/README.md",
    "chars": 482,
    "preview": "# finetune minicpmv with some pairs having image and some not\n\n1. git the code\n```bash\ngit clone https://github.com/LDLI"
  },
  {
    "path": "agent_auto_plan/autoplan/all_param_inference.py",
    "chars": 12309,
    "preview": "\"\"\"本文件是用来测试模型推理的文件\"\"\"\n\n\nimport torch\nfrom transformers import AutoModelForCausalLM,AutoTokenizer,GenerationConfig\nimport"
  },
  {
    "path": "agent_auto_plan/autoplan/bing_search.py",
    "chars": 5899,
    "preview": "from playwright.sync_api import sync_playwright\n#from searcher import *\nfrom typing import List, Dict, Tuple, Optional\n\n"
  },
  {
    "path": "agent_auto_plan/autoplan/fuctions.py",
    "chars": 4663,
    "preview": "import re\nfrom fastbm25 import fastbm25\nimport math\nfrom rank_bm25 import BM25Okapi  \nimport openai\nfrom sentence_transf"
  },
  {
    "path": "agent_auto_plan/autoplan/load_model.py",
    "chars": 2738,
    "preview": "from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig,AutoModel\nimport torch\nfrom sentence_tran"
  },
  {
    "path": "agent_auto_plan/autoplan/lora_inference_nomerge.py",
    "chars": 14436,
    "preview": "from peft import PeftModel, PeftConfig\nimport torch\nfrom transformers import AutoModelForCausalLM,AutoTokenizer,Generati"
  },
  {
    "path": "agent_auto_plan/autoplan/main.py",
    "chars": 16559,
    "preview": "\"\"\"本文件是主文件,修改48行到68行的参数,然后运行即可\"\"\"\n\nfrom prompt_plamte import (\n    task_check_template,\n    task_split_template,\n    tas"
  },
  {
    "path": "agent_auto_plan/autoplan/prompt_plamte.py",
    "chars": 9016,
    "preview": "import re\nfrom fuctions import get_tools_description,task_text_split\nfrom tools_introduction import tools\n\n\n#下面这个是根据工具进行"
  },
  {
    "path": "agent_auto_plan/autoplan/tools_introduction.py",
    "chars": 16402,
    "preview": "import json5\nimport bing_search as query_bing\nfrom fuctions import bm25,distance,task_text_split\nimport re\n\n#下面是工具的介绍和接口"
  },
  {
    "path": "agent_auto_plan/finetune_language/README.md",
    "chars": 482,
    "preview": "# finetune minicpmv with some pairs having image and some not\n\n1. git the code\n```bash\ngit clone https://github.com/LDLI"
  },
  {
    "path": "agent_auto_plan/finetune_language/dataset.py",
    "chars": 19137,
    "preview": "import copy\nimport json\nimport logging\nimport math\nimport os\nimport re\nimport random\nfrom dataclasses import dataclass, "
  },
  {
    "path": "agent_auto_plan/finetune_language/ds_config_zero2.json",
    "chars": 1230,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_"
  },
  {
    "path": "agent_auto_plan/finetune_language/ds_config_zero3.json",
    "chars": 1500,
    "preview": "\n{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial"
  },
  {
    "path": "agent_auto_plan/finetune_language/finetune.py",
    "chars": 9325,
    "preview": "import glob\nimport json\nimport logging\nimport os\nfrom dataclasses import dataclass, field\nfrom functools import partial\n"
  },
  {
    "path": "agent_auto_plan/finetune_language/finetune_ds.sh",
    "chars": 1996,
    "preview": "#!/bin/bash\n\nGPUS_PER_NODE=8\nNNODES=1\nNODE_RANK=0\nMASTER_ADDR=localhost\nMASTER_PORT=6001\n\nMODEL=\"openbmb/MiniCPM-V-2_6\"\n"
  },
  {
    "path": "agent_auto_plan/finetune_language/finetune_lora.sh",
    "chars": 2271,
    "preview": "#!/bin/bash\n\nGPUS_PER_NODE=8\nNNODES=1\nNODE_RANK=0\nMASTER_ADDR=localhost\nMASTER_PORT=6001\n\nMODEL=\"/root/ld/ld_model_pretr"
  },
  {
    "path": "agent_auto_plan/finetune_language/merge_lora.py",
    "chars": 1961,
    "preview": "from peft import PeftModel\nfrom transformers import AutoModel, AutoTokenizer\nimport os\nimport shutil\n\nmodel_type = \"/roo"
  },
  {
    "path": "agent_auto_plan/finetune_language/replace_file/modeling_minicpmv.py",
    "chars": 19534,
    "preview": "import math\nfrom typing import List, Optional\nimport json\nimport torch\nimport torchvision\n\nfrom threading import Thread\n"
  },
  {
    "path": "agent_auto_plan/qwen_vllm.py",
    "chars": 6958,
    "preview": "from vllm import LLM, SamplingParams\nimport argparse\nimport json\nimport pandas as pd\nfrom transformers import AutoTokeni"
  },
  {
    "path": "agent_auto_plan/test_plan.json",
    "chars": 506222,
    "preview": "[{\"conversations\": [{\"from\": \"user\", \"value\": \"\\nuser\\nAnswer the following questions as best you can. You have access t"
  },
  {
    "path": "agent_auto_plan/test_react.json",
    "chars": 577736,
    "preview": "[{\"conversations\": [{\"from\": \"user\", \"value\": \"\\nuser\\nAnswer the following questions as best you can. You have access t"
  },
  {
    "path": "agent_demo/agent_demo.py",
    "chars": 14384,
    "preview": "#\n# 相关材料:\n#   ReAct Prompting 原理简要介绍,不包含代码实现:\n#       https://github.com/QwenLM/Qwen-7B/blob/main/examples/react_prompt."
  },
  {
    "path": "agent_demo/build_react_prompt.py",
    "chars": 4046,
    "preview": "import json\nTOOL_DESC = \"\"\"{name_for_model}: Call this tool to interact with the {name_for_human} API. What is the {name"
  },
  {
    "path": "agent_demo/get_react_data.py",
    "chars": 14475,
    "preview": "import re\nfrom vllm import LLM, SamplingParams\nfrom build_react_prompt import build_input_text,TOOL_DESC,PROMPT_REACT,pa"
  },
  {
    "path": "agent_demo/react_qa_react.json",
    "chars": 20733,
    "preview": "[\n    {\n        \"instruction\": \"You are a helpful assistant.\",\n        \"input\": \"Answer the following questions as best "
  },
  {
    "path": "ft_language_replace_file/finetune/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "ft_language_replace_file/finetune/dataset.py",
    "chars": 16275,
    "preview": "import copy\nimport json\nimport logging\nimport math\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import"
  },
  {
    "path": "ft_language_replace_file/finetune/ds_config_zero2.json",
    "chars": 1229,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_"
  },
  {
    "path": "ft_language_replace_file/finetune/ds_config_zero3.json",
    "chars": 1500,
    "preview": "\n{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial"
  },
  {
    "path": "ft_language_replace_file/finetune/finetune.py",
    "chars": 11720,
    "preview": "import glob\nimport json\nimport logging\nimport os\nfrom dataclasses import dataclass, field\nfrom functools import partial\n"
  },
  {
    "path": "ft_language_replace_file/finetune/finetune_ds.sh",
    "chars": 1895,
    "preview": "#!/bin/bash\n\nGPUS_PER_NODE=8\nNNODES=1\nNODE_RANK=0\nMASTER_ADDR=localhost\nMASTER_PORT=6001\n\nMODEL=\"/root/ld/ld_model_pretr"
  },
  {
    "path": "ft_language_replace_file/finetune/finetune_lora.sh",
    "chars": 2047,
    "preview": "#!/bin/bash\n\nGPUS_PER_NODE=8\nNNODES=1\nNODE_RANK=0\nMASTER_ADDR=localhost\nMASTER_PORT=6001\n\nMODEL=\"/root/ld/ld_model_pretr"
  },
  {
    "path": "ft_language_replace_file/finetune/merge_lora.py",
    "chars": 937,
    "preview": "from peft import PeftModel\nimport torch\nfrom transformers import AutoModel,AutoTokenizer\nmodel_type=\"/root/ld/ld_model_p"
  },
  {
    "path": "ft_language_replace_file/finetune/only_language_web_demo.py",
    "chars": 8803,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\nimport gradio as gr\nfrom PIL import Image\nimport traceback\nimport re\nimport torc"
  },
  {
    "path": "ft_language_replace_file/finetune/readme.md",
    "chars": 4597,
    "preview": "# MiniCPM-V Finetuning\n\n\nWe offer the demo scripts for easy finetuning of the pretrained **MiniCPM-Llama3-V 2.5** on bot"
  },
  {
    "path": "ft_language_replace_file/finetune/replace_file/modeling_minicpmv.py",
    "chars": 26281,
    "preview": "import math\nfrom typing import List, Optional\nimport json\nimport torch\nimport torchvision\nfrom threading import Thread\nf"
  },
  {
    "path": "ft_language_replace_file/finetune/replace_file/resampler.py",
    "chars": 35860,
    "preview": "from functools import partial\nimport numpy as np\nimport warnings\nfrom typing import Optional, Tuple\nimport torch\nfrom to"
  },
  {
    "path": "ft_language_replace_file/finetune/trainer.py",
    "chars": 10606,
    "preview": "import torch\nimport torch.nn as nn\nimport deepspeed\nfrom transformers import Trainer\nfrom transformers.trainer_pt_utils "
  },
  {
    "path": "get_minicpmv2.6_embeding/dataset.py",
    "chars": 19936,
    "preview": "import copy\nimport json\nimport logging\nimport math\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import"
  },
  {
    "path": "get_minicpmv2.6_embeding/inference.py",
    "chars": 2399,
    "preview": "import torch\nimport typer\nfrom torch.utils.data import DataLoader\nfrom tqdm import tqdm\nfrom transformers import AutoMod"
  },
  {
    "path": "get_minicpmv2.6_embeding/modeling_minicpmv.py",
    "chars": 16089,
    "preview": "import math\nfrom typing import List, Optional\nimport json\nimport torch\nimport torchvision\n\nfrom threading import Thread\n"
  },
  {
    "path": "get_minicpmv2.6_embeding/readme.md",
    "chars": 3607,
    "preview": "### MiniCPM-V embeding Project Operation Guide\n\n#### 1. Download the Project Code\n\nFirst, you need to clone the `MiniCPM"
  },
  {
    "path": "mbti_role_play/mbti_demo.py",
    "chars": 11349,
    "preview": "from typing import List\nimport argparse\nimport gradio as gr\nimport torch\nfrom threading import Thread\nfrom PIL import Im"
  },
  {
    "path": "mbti_role_play/mbti_sft_dpo_data/get_rank_data.py",
    "chars": 4090,
    "preview": "import os\nimport json\nimport random\n# 获取mbti_sft_dpo_data目录下的所有文件和文件夹名称\npath='MIniCPM_Series_Tutorial/mbti_role_play/mbt"
  },
  {
    "path": "mbti_role_play/self_awareness/get_all_awarness_data.py",
    "chars": 687,
    "preview": "import os\nimport json\n\n# 写入MIniCPM_Series_Tutorial/mbti_role_play/self_awareness的绝对地址\ndirectory ='/root/ld/ld_project/mb"
  },
  {
    "path": "md/finetune/minicpm2.0/llama_factory.md",
    "chars": 4364,
    "preview": "\n\n# 安装LLaMA-Factory依赖\n\n首先,克隆LLaMA-Factory仓库,并安装依赖项:\n```sh\ngit clone https://github.com/hiyouga/LLaMA-Factory\ncd LLaMA-Fa"
  },
  {
    "path": "md/finetune/minicpm2.0/mlx_sft.md",
    "chars": 1718,
    "preview": "\n\n# 使用MLX训练指南(Mac推荐)\n\n## 设备要求\n- Mac OS 14以上版本\n\n## 步骤\n1. **下载LLAMA_FORMAT_MLX中的所有文件**\n   将所有文件下载到 `model_path` (路径可自定义)。\n"
  },
  {
    "path": "md/finetune/minicpm2.0/sft.md",
    "chars": 2874,
    "preview": "\n# MiniCPM 模型微调指南\n\n## 设备需求\n- 最少一张12GB显存,20系列以上显卡\n- 使用QLoRA时可尝试6-8GB显卡\n\n## 步骤\n1. **使用Git获取官方代码**\n   ```sh\n   git clone ht"
  },
  {
    "path": "md/finetune/minicpm3.0/llama_factory.md",
    "chars": 5044,
    "preview": "\n## Llama_factory\n\n使用方法与[【微调教程】MiniCPM2.0](../minicpm2.0/llama_factory.md)基本相同,仅需修改`template: cpm3`\n\n1. **首先安装Llama_fact"
  },
  {
    "path": "md/finetune/minicpm3.0/pip_list.md",
    "chars": 13785,
    "preview": "absl-py                           2.1.0\naccelerate                        0.30.1\naddict                            2.4.0"
  },
  {
    "path": "md/finetune/minicpm3.0/sft.md",
    "chars": 2769,
    "preview": "# 官方代码微调(SFT推荐)\n\n## 设备需求\n- 最少一张24G显存,20系列以上显卡\n- qlora可尝试12G显卡\n\n## 步骤\n\n### 1. 使用git获取官方代码\n```bash\ngit clone https://githu"
  },
  {
    "path": "md/finetune/minicpmv2.5/sft.md",
    "chars": 19975,
    "preview": "\n# 官方代码模型训练指南\n\n## 1. 安装依赖包\n\n首先,进入项目目录并安装依赖包:\n```sh\ncd MiniCPM-V\npip install -r requirements.txt\n```\n**注意**:推荐从DeepSpeed官"
  },
  {
    "path": "md/finetune/minicpmv2.5/swift.md",
    "chars": 1779,
    "preview": "# Swift 安装与训练指南\n\n## 安装Swift\n\n首先克隆Swift仓库:\n\n```sh\ngit clone https://github.com/modelscope/swift.git\ncd swift\npip install "
  },
  {
    "path": "md/finetune/minicpmv2.6/pip_list.md",
    "chars": 9294,
    "preview": "river Version: 535.104.05   CUDA Version: 12.2\nLinux version 5.4.0-48-generic (buildd@lcy01-amd64-010) (gcc version 9.3."
  },
  {
    "path": "md/finetune/minicpmv2.6/sft.md",
    "chars": 9400,
    "preview": "\n# MiniCPMV训练环境介绍及步骤\n\n## 1. 训练环境介绍\n笔者的训练环境为:[pip list](./pip_list.md)\n## 2. 获取MiniCPMV的GitHub代码\n通过Git克隆MiniCPMV项目到本地:\n``"
  },
  {
    "path": "md/inference/minicpm2.0/llama.cpp_android.md",
    "chars": 907,
    "preview": "# 部署llama.cpp到安卓端\n\n## 设备要求\n- 安卓手机\n- 推荐使用骁龙8系列及以上芯片的手机\n\n### 步骤1:获得ggml-model-Q4_K_M.gguf量化模型\n\n按照[部署llama.cpp到PC端](llama.c"
  },
  {
    "path": "md/inference/minicpm2.0/llama.cpp_pc.md",
    "chars": 2094,
    "preview": "# 部署llama.cpp到PC端\n\n## 支持设备\n- Linux\n- macOS\n\n### 步骤1:下载llama.cpp\n\n通过Git克隆llama.cpp仓库:\n```sh\ngit clone https://github.com/"
  },
  {
    "path": "md/inference/minicpm2.0/mlx.md",
    "chars": 1632,
    "preview": "\n# 更新macOS至13.5及以上版本\n\n为了使用`mlx-lm`进行推理,您需要将Mac设备的操作系统升级到至少13.5版本。可以通过以下步骤检查并安装更新:\n\n1. 打开mac的“设置”。\n2. 选择“通用”选项。\n3. 在“软件更新"
  },
  {
    "path": "md/inference/minicpm2.0/ollama.md",
    "chars": 507,
    "preview": "# 安装 Ollama\n\n前往Ollama的GitHub页面以获取安装指南:[https://github.com/ollama/ollama](https://github.com/ollama/ollama)\n\n## macOS\n\n下载"
  },
  {
    "path": "md/inference/minicpm2.0/powerinfer_android.md",
    "chars": 2443,
    "preview": "\n# 在安卓手机上部署PowerInfer\n\n## 设备需求\n- 安卓手机\n- 推荐使用骁龙8系列及以上芯片的手机\n\n### 步骤1:安装Termux\n\n在手机上下载并安装合适的Termux版本,推荐使用[v0.118.1版本](https"
  },
  {
    "path": "md/inference/minicpm2.0/powerinfer_pc.md",
    "chars": 1809,
    "preview": "\n# PowerInfer 简介\n\nPowerInfer是由上海交通大学开发的一个推理引擎,它可以在CPU/GPU上基于稀疏模型进行加速,据称能够获得最高达llama.cpp 11倍的推理性能。然而,PowerInfer目前仅适配了包括Mi"
  },
  {
    "path": "md/inference/minicpm2.0/transformers.md",
    "chars": 1333,
    "preview": "## 设置环境\n\n1. 在命令行中输入以下命令来克隆MiniCPM仓库:\n   ```bash\n   git clone https://github.com/OpenBMB/MiniCPM.git\n   ```\n\n2. 安装项目所需的依赖"
  },
  {
    "path": "md/inference/minicpm2.0/vllm.md",
    "chars": 1654,
    "preview": "\n### 安装vLLM\n\n首先确保安装了`vllm`库:\n\n```bash\npip install vllm\n```\n\n### Python 脚本示例\n\n接下来,在Python脚本中使用`vllm`进行文本生成:\n\n```python\nfr"
  },
  {
    "path": "md/inference/minicpm3.0/llamcpp.md",
    "chars": 1604,
    "preview": "\n## Llamacpp\n**设备:Linux,Mac**\n\n### 1. 下载llama.cpp的minicpm3分支\n```bash\ngit clone https://github.com/OpenBMB/llama.cpp.git\n"
  },
  {
    "path": "md/inference/minicpm3.0/ollama.md",
    "chars": 1046,
    "preview": "\n## Ollama\n\n1. **获取我们fork的分支代码**\n   PR暂未合并,请务必使用我们的分支\n   ```bash\n   git clone https://github.com/LDLINGLINGLING/ollama.g"
  },
  {
    "path": "md/inference/minicpm3.0/sglang.md",
    "chars": 1701,
    "preview": "```markdown\n# Sglang 安装与使用指南\n\n## 源码安装 Sglang\n\n1. 首先,从 GitHub 克隆 Sglang 项目仓库:\n\n   ```bash\n   git clone https://github.com"
  },
  {
    "path": "md/inference/minicpm3.0/transformers.md",
    "chars": 1582,
    "preview": "\n# MiniCPM 3.0 使用示例\n\n## Chat 方法\n\n下面的代码示例展示了如何使用 `transformers` 库来实现与 MiniCPM 3.0 模型的聊天功能:\n\n```python\nfrom transformers i"
  },
  {
    "path": "md/inference/minicpm3.0/vllm.md",
    "chars": 2604,
    "preview": "\n# VLLM 安装与使用指南\n\n## 安装 VLLM\n\n首先通过 Git 克隆 VLLM 仓库:\n\n```bash\ngit clone https://github.com/LDLINGLINGLING/vllm.git\ncd vllm\n"
  },
  {
    "path": "md/inference/minicpmv2.5/LMdeploy.md",
    "chars": 1012,
    "preview": "\n# LMdeploy并发推理\n\n## 设备要求\n- 单张24GB NVIDIA 20系以上显卡,或者两张以上12GB 20系以上显卡\n\n## 步骤1:安装deploy\n\n安装`deploy`库,注意不要使用源码编译:\n\n```sh\npip"
  },
  {
    "path": "md/inference/minicpmv2.5/llamacpp_pc.md",
    "chars": 1950,
    "preview": "\n# llama.cpp 在Linux或macOS上的部署\n\n## 配套视频\n- [llamacpp](https://www.bilibili.com/video/BV1tS42197NL/?spm_id_from=333.337.sea"
  },
  {
    "path": "md/inference/minicpmv2.5/ollama.md",
    "chars": 1477,
    "preview": "\n# Ollama 部署\n\n## 设备要求\n- 运行非量化版:内存超过19GB\n- 运行量化版:内存超过8GB\n\n## 步骤1:获取gguf模型\n\n按照[llama.cpp的教程](llamacpp_pc.md)获取gguf模型文件,语言模"
  },
  {
    "path": "md/inference/minicpmv2.5/swift_commandline.md",
    "chars": 1065,
    "preview": "\n# Swift 命令行推理\n\n## 设备要求\n- 所有显卡内存总共不低于24GB\n\n## 步骤1:安装Swift\n\n通过Git克隆Swift仓库,并安装依赖:\n\n```sh\ngit clone https://github.com/mod"
  },
  {
    "path": "md/inference/minicpmv2.5/swift_python.md",
    "chars": 1508,
    "preview": "# Swift 脚本推理\n\n## 设备要求\n- 多卡显存总共不超过24GB\n\n## 步骤1:安装Swift\n\n按照[swift安装教程](./swift_commandline.md)安装Swift。\n\n## 步骤2:执行脚本\n\n根据注释,"
  },
  {
    "path": "md/inference/minicpmv2.5/transformers_multi_gpu.md",
    "chars": 2229,
    "preview": "```python\nffrom PIL import Image\nimport torch\nfrom transformers import AutoConfig, AutoModel, AutoTokenizer\nfrom acceler"
  },
  {
    "path": "md/inference/minicpmv2.5/vllm.md",
    "chars": 2615,
    "preview": "\n# vLLM 部署(并发推荐)\n\n## 步骤1:获取vLLM代码\n\n通过Git克隆vLLM仓库:\n\n```sh\ngit clone https://github.com/vllm-project/vllm.git\n```\n\n## 步骤2:"
  },
  {
    "path": "md/inference/minicpmv2.5/xinference.md",
    "chars": 627,
    "preview": "\n# Xinference 本地模型部署指南\n\n## 步骤1:安装Xinference\n\n安装Xinference及其所有可选依赖:\n\n```sh\npip install \"xinference[all]\"\n```\n\n## 步骤2:运行Xi"
  },
  {
    "path": "md/inference/minicpmv2.6/llamacpp.md",
    "chars": 2326,
    "preview": "# Llama.cpp 推理\n\n## 设备要求\n\n- 运行非量化版本需要超过19GB内存\n- 运行量化版本需要超过8GB内存\n\n## 步骤1:下载依赖包\n\n使用Homebrew安装依赖包:\n\n```sh\nbrew install ffmpe"
  },
  {
    "path": "md/inference/minicpmv2.6/ollama.md",
    "chars": 2959,
    "preview": "\n# Ollama 推理\n## 设备要求\n- 运行非量化版本需要19GB以上内存\n- 运行量化版本需要8GB以上内存\n## ollama官方支持\n1. 官方已经合并我们的分支,可以直接用新版ollama\n```bash\nollama run"
  },
  {
    "path": "md/inference/minicpmv2.6/transformers_mult_gpu.md",
    "chars": 1835,
    "preview": "# transformers_mult_gpu\n```python\n#!/usr/bin/env python\n# encoding: utf-8\nimport torch\nfrom transformers import AutoMode"
  },
  {
    "path": "md/inference/minicpmv2.6/vllm.md",
    "chars": 6098,
    "preview": "\n# VLLM 推理\n\n## 笔者的pip list(awq,fp16,vllm都能跑)\n\n```plaintext\nvllm 0.5.4\ntransformers 4.44.0\ntorchvision 0.19.0\ntorch 2.4.0"
  },
  {
    "path": "md/inference/minicpmv2.6/vllm_api_server.md",
    "chars": 2857,
    "preview": "# VLLM API Server\n\n## 步骤1:使用Git下载并安装VLLM\n\n通过Git克隆VLLM仓库并安装依赖:\n\n```sh\ngit clone https://github.com/vllm-project/vllm.git\n"
  },
  {
    "path": "md/integrate/function_call.md",
    "chars": 2691,
    "preview": "\n# Function Call简易实现(Minicpm3.0)\n\n下面是一个简单的Python脚本,用于演示如何通过调用特定的函数来获取订单的配送日期。此脚本包含了一个函数`get_delivery_date`以及一个处理函数调用的函数`"
  },
  {
    "path": "md/integrate/langchain.md",
    "chars": 8799,
    "preview": "\n# RAG (Retrieval-Augmented Generation)简介\n## 代码地址与使用\n按照项目[4g显存玩转rag](https://github.com/OpenBMB/MiniCPM/blob/main/demo/m"
  },
  {
    "path": "md/integrate/openai_api.md",
    "chars": 1270,
    "preview": "# openai_api\n## 使用方法\n[项目地址](https://github.com/OpenBMB/MiniCPM/tree/main/demo/openai_api_demo)\n1. 首先获取MiniCPM官方代码\n```sh\n"
  },
  {
    "path": "md/md_en/finetune/minicpm2.0/llama_factory.md",
    "chars": 6161,
    "preview": "# Installing LLaMA-Factory Dependencies\n\nFirst, clone the LLaMA-Factory repository and install the dependencies:\n```sh\ng"
  },
  {
    "path": "md/md_en/finetune/minicpm2.0/mlx_sft.md",
    "chars": 3520,
    "preview": "# Training Guide Using MLX (Recommended for Mac)\n\n## System Requirements\n- Mac OS 14 or later\n\n## Steps\n1. **Download Al"
  },
  {
    "path": "md/md_en/finetune/minicpm2.0/sft.md",
    "chars": 3880,
    "preview": "# MiniCPM Model Fine-Tuning Guide\n\n## System Requirements\n- At least one GPU with 12GB VRAM, NVIDIA 20 series or better\n"
  },
  {
    "path": "md/md_en/finetune/minicpm3.0/llama_factory.md",
    "chars": 7590,
    "preview": "# LLaMA-Factory Usage Guide\n\nThe usage of LLaMA-Factory is largely similar to the [Fine-Tuning Tutorial for MiniCPM2.0]("
  },
  {
    "path": "md/md_en/finetune/minicpm3.0/pip_list.md",
    "chars": 13785,
    "preview": "absl-py                           2.1.0\naccelerate                        0.30.1\naddict                            2.4.0"
  },
  {
    "path": "md/md_en/finetune/minicpm3.0/sft.md",
    "chars": 3819,
    "preview": "# Official Code Fine-Tuning (SFT Recommended)\n\n## Device Requirements\n- At least one GPU with 24GB of VRAM, Series 20 or"
  },
  {
    "path": "md/md_en/finetune/minicpmv2.5/sft.md",
    "chars": 16373,
    "preview": "# Official Code Model Training Guide\n\n## 1. Install Dependencies\n\nFirst, enter the project directory and install the dep"
  },
  {
    "path": "md/md_en/finetune/minicpmv2.5/swift.md",
    "chars": 2703,
    "preview": "# Swift Installation and Training Guide\n\n## Installing Swift\n\nFirst, clone the Swift repository:\n\n```sh\ngit clone https:"
  },
  {
    "path": "md/md_en/finetune/minicpmv2.6/pip_list.md",
    "chars": 9294,
    "preview": "river Version: 535.104.05   CUDA Version: 12.2\nLinux version 5.4.0-48-generic (buildd@lcy01-amd64-010) (gcc version 9.3."
  },
  {
    "path": "md/md_en/finetune/minicpmv2.6/sft.md",
    "chars": 12170,
    "preview": "# MiniCPMV Training Environment Introduction and Steps\n\n## 1. Training Environment Overview\nThe training environment set"
  },
  {
    "path": "md/md_en/inegrate/function_call.md",
    "chars": 4378,
    "preview": "Below is a simple Python script translated into English and formatted as Markdown. This script demonstrates how to call "
  },
  {
    "path": "md/md_en/inegrate/langchain.md",
    "chars": 12098,
    "preview": "# RAG (Retrieval-Augmented Generation) Introduction\r\n## Code Location and Usage\r\nBy simply modifying the parameters at t"
  },
  {
    "path": "md/md_en/inegrate/openai_api.md",
    "chars": 1692,
    "preview": "\n# OpenAI API\n## Usage\n[Project Repository](https://github.com/OpenBMB/MiniCPM/tree/main/demo/openai_api_demo)\n\n1. First"
  },
  {
    "path": "md/md_en/inference/minicpm2.0/llama.cpp_android.md",
    "chars": 1732,
    "preview": "# Deploying llama.cpp on Android\n\n## Device Requirements\n- Android smartphone\n- It is recommended to use a phone with Sn"
  },
  {
    "path": "md/md_en/inference/minicpm2.0/llama.cpp_pc.md",
    "chars": 2837,
    "preview": "# Deploying llama.cpp on PC\n\n## Supported Devices\n- Linux\n- macOS\n\n### Step 1: Download llama.cpp\n\nClone the llama.cpp r"
  },
  {
    "path": "md/md_en/inference/minicpm2.0/mlx.md",
    "chars": 2269,
    "preview": "# Update macOS to Version 13.5 or Later\n\nTo use `mlx-lm` for inference, you need to upgrade your Mac's operating system "
  },
  {
    "path": "md/md_en/inference/minicpm2.0/ollama.md",
    "chars": 701,
    "preview": "# Installing Ollama\n\nVisit the Ollama GitHub page for installation instructions: [https://github.com/ollama/ollama](http"
  },
  {
    "path": "md/md_en/inference/minicpm2.0/powerinfer_android.md",
    "chars": 3740,
    "preview": "# Deploying PowerInfer on an Android Device\n\n## Device Requirements\n- Android smartphone\n- It is recommended to use a ph"
  },
  {
    "path": "md/md_en/inference/minicpm2.0/powerinfer_pc.md",
    "chars": 2644,
    "preview": "\n# Introduction to PowerInfer\n\nPowerInfer is an inference engine developed by Shanghai Jiao Tong University that acceler"
  },
  {
    "path": "md/md_en/inference/minicpm2.0/transformers.md",
    "chars": 1745,
    "preview": "## Setting Up the Environment\n\n1. Clone the MiniCPM repository by entering the following command in the terminal:\n   ```"
  },
  {
    "path": "md/md_en/inference/minicpm2.0/vllm.md",
    "chars": 2145,
    "preview": "\n### Install vLLM\n\nFirst, ensure that the `vllm` library is installed:\n\n```bash\npip install vllm\n```\n\n### Python Script "
  },
  {
    "path": "md/md_en/inference/minicpm3.0/llamacpp.md",
    "chars": 1994,
    "preview": "## Llamacpp\n**Device: Linux, Mac**\n\n### 1. Download the minicpm3 branch of llama.cpp\n```bash\ngit clone https://github.co"
  },
  {
    "path": "md/md_en/inference/minicpm3.0/sglang.md",
    "chars": 2126,
    "preview": "# Sglang Installation and Usage Guide\n\n## Source Code Installation of Sglang\n\n1. First, clone the Sglang project reposit"
  },
  {
    "path": "md/md_en/inference/minicpm3.0/transfomers.md",
    "chars": 2157,
    "preview": "# MiniCPM 3.0 Usage Examples\n\n## Chat Method\n\nThe following code example demonstrates how to implement a chat feature wi"
  },
  {
    "path": "md/md_en/inference/minicpm3.0/vllm.md",
    "chars": 3181,
    "preview": "# VLLM Installation and Usage Guide\n\n## Installing VLLM\n\nFirst, clone the VLLM repository via Git:\n\n```bash\ngit clone ht"
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/LMdeploy.md",
    "chars": 1403,
    "preview": "# Concurrent Inference Deployment\n\n## System Requirements\n- A single 24GB NVIDIA 20-series or higher GPU, or multiple 12"
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/llamacpp_pc.md",
    "chars": 2479,
    "preview": "# Deploying llama.cpp on Linux or macOS\n\n## Accompanying Video\n- [llamacpp](https://www.bilibili.com/video/BV1tS42197NL/"
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/ollama.md",
    "chars": 2365,
    "preview": "# Ollama Deployment\n\n## System Requirements\n- Running non-quantized version: More than 19GB of memory\n- Running quantize"
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/swift_commandline.md",
    "chars": 1639,
    "preview": "# Swift Command Line Inference\n\n## System Requirements\n- Total GPU memory across all cards must be at least 24GB\n\n## Ste"
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/swift_python.md",
    "chars": 1861,
    "preview": "# Swift Script Inference\n\n## System Requirements\n- Total GPU memory across multiple cards does not exceed 24GB\n\n## Step "
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/transformers_multi_gpu.md",
    "chars": 2856,
    "preview": "```python\nfrom PIL import Image\nimport torch\nfrom transformers import AutoConfig, AutoModel, AutoTokenizer\nfrom accelera"
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/vllm.md",
    "chars": 3113,
    "preview": "# vLLM Deployment (Recommended for Concurrency)\n\n## Step 1: Get the vLLM Code\n\nClone the vLLM repository using Git:\n\n```"
  },
  {
    "path": "md/md_en/inference/minicpmv2.5/xinference.md",
    "chars": 1060,
    "preview": "# Xinference Local Model Deployment Guide\n\n## Step 1: Install Xinference\n\nInstall Xinference along with all its optional"
  },
  {
    "path": "md/md_en/inference/minicpmv2.6/llamacpp.md",
    "chars": 3124,
    "preview": "# Llama.cpp Inference\n\n## System Requirements\n\n- Non-quantized version requires more than 19GB of memory\n- Quantized ver"
  },
  {
    "path": "md/md_en/inference/minicpmv2.6/ollama.md",
    "chars": 4437,
    "preview": "# Ollama Inference\n\n## System Requirements\n\n- Non-quantized version requires more than 19GB of memory\n- Quantized versio"
  },
  {
    "path": "md/md_en/inference/minicpmv2.6/transformers_mult_gpu.md",
    "chars": 1835,
    "preview": "# transformers_mult_gpu\n```python\n#!/usr/bin/env python\n# encoding: utf-8\nimport torch\nfrom transformers import AutoMode"
  },
  {
    "path": "md/md_en/inference/minicpmv2.6/vllm.md",
    "chars": 7045,
    "preview": "# VLLM Inference\n\n## Author's pip list (supports awq, fp16, and vllm)\n\n```plaintext\nvllm 0.5.4\ntransformers 4.44.0\ntorch"
  },
  {
    "path": "md/md_en/inference/minicpmv2.6/vllm_api_server.md",
    "chars": 3744,
    "preview": "# VLLM API Server\n\n## Step 1: Download and Install VLLM Using Git\n\nClone the VLLM repository and install dependencies:\n\n"
  },
  {
    "path": "md/md_en/quantize/minicpm2.0/awq.md",
    "chars": 2812,
    "preview": "# MiniCPM Model Quantization Guide - Using AutoAWQ\n\nTo perform quantization of the MiniCPM model, you need to follow the"
  },
  {
    "path": "md/md_en/quantize/minicpm2.0/bnb.md",
    "chars": 2036,
    "preview": "# MiniCPM Model Quantization Guide - Using bitsandbytes (BNB)\n\nTo perform quantization of the MiniCPM model, you need to"
  },
  {
    "path": "md/md_en/quantize/minicpm2.0/gptq.md",
    "chars": 1434,
    "preview": "# MiniCPM Model Quantization Guide - Using AutoGPTQ\n\nTo perform quantization of the MiniCPM model, you need to follow th"
  },
  {
    "path": "md/md_en/quantize/minicpm3.0/awq.md",
    "chars": 3059,
    "preview": "## AutoAWQ (Slightly Lower Speed)\n\n### Device Requirements\nAt least one NVIDIA 20-series or higher GPU is required, with"
  },
  {
    "path": "md/md_en/quantize/minicpm3.0/bnb.md",
    "chars": 1741,
    "preview": "## BNB Quantization\n\n### Device Requirements\n- At least one Nvidia 20-series or higher GPU;\n- Sufficient VRAM to load th"
  },
  {
    "path": "md/md_en/quantize/minicpm3.0/gptq.md",
    "chars": 1213,
    "preview": "## AutoGPTQ\r\n\r\n### Device Requirements\r\nAt least one NVIDIA 20-series or higher GPU with more than 12GB of VRAM is requi"
  },
  {
    "path": "md/md_en/quantize/minicpmv2.5/bnb.md",
    "chars": 4393,
    "preview": "# BNB Quantization\n## Quantization Script\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer, Bits"
  },
  {
    "path": "md/md_en/quantize/minicpmv2.6/awq.md",
    "chars": 2157,
    "preview": "# AutoAWQ Model Quantization Deployment Guide\n\n## Method 1 (Recommended)\n\n### 1. Directly Download the Pre-Quantized Mod"
  },
  {
    "path": "md/md_en/quantize/minicpmv2.6/bnb.md",
    "chars": 2531,
    "preview": "# bitsandbytes Quantization Script\nModify the following `model_path`, `save_path`, and ensure you have a GPU capable of "
  },
  {
    "path": "md/quantize/minicpm2.0/awq.md",
    "chars": 1729,
    "preview": "\n# MiniCPM 模型量化指南 - 使用AutoAWQ\n\n为了执行MiniCPM模型的量化,您需要遵循以下步骤,并确保您的设备满足以下要求:\n- 至少存在一张Nvidia 20系以上的显卡;\n- 量化2b需要6GB显存;\n- 量化1b需"
  },
  {
    "path": "md/quantize/minicpm2.0/bnb.md",
    "chars": 1222,
    "preview": "\n# MiniCPM 模型量化指南 - 使用bitsandbytes (BNB)\n\n为了执行MiniCPM模型的量化,您需要遵循以下步骤,并确保您的设备满足以下要求:\n- 至少存在一张Nvidia 20系以上的显卡;\n- 显存需足够加载模型"
  },
  {
    "path": "md/quantize/minicpm2.0/gptq.md",
    "chars": 797,
    "preview": "\n# MiniCPM 模型量化指南 - 使用AutoGPTQ\n\n为了执行MiniCPM模型的量化,您需要遵循以下步骤,并确保您的设备满足以下要求:\n- 至少存在一张Nvidia 20系以上的显卡;\n- 量化2b需要6GB显存;\n- 量化1b"
  },
  {
    "path": "md/quantize/minicpm3.0/awq.md",
    "chars": 2285,
    "preview": "\n## AutoAWQ(速度略低)\n\n### 设备要求\n至少存在一张NVIDIA 20系以上显卡,量化4B需要12GB显存。\n\n1. **获取MiniCPM开源代码**\n   ```bash\n   git clone https://git"
  },
  {
    "path": "md/quantize/minicpm3.0/bnb.md",
    "chars": 1123,
    "preview": "\n## BNB量化\n\n### 设备要求\n20以上显卡,显存能够加载模型。\n\n1. **安装bitsandbytes**\n   ```bash\n   pip install bitsandbytes\n   ```\n\n2. **修改量化脚本参数"
  },
  {
    "path": "md/quantize/minicpm3.0/gptq.md",
    "chars": 813,
    "preview": "## AutoGPTQ\n\n### 设备要求\n需要至少一个NVIDIA 20系列或更高版本的GPU,并且具有超过12GB的显存。\n\n### 方法1:直接获取量化后的GPTQ权重(推荐)\n```bash\ngit clone https://hu"
  },
  {
    "path": "md/quantize/minicpmv2.5/bnb.md",
    "chars": 2580,
    "preview": "# bnb量化\n## 量化脚本\n```python\nimport torch\nfrom transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig\nfrom PIL im"
  },
  {
    "path": "md/quantize/minicpmv2.6/awq.md",
    "chars": 1501,
    "preview": "\n# AutoAWQ量化模型部署指南\n\n## 方法1(推荐)\n\n### 1. 直接下载量化好的模型\n\n通过Git克隆已量化的模型仓库:\n\n```bash\ngit clone https://www.modelscope.cn/models/"
  },
  {
    "path": "md/quantize/minicpmv2.6/bnb.md",
    "chars": 1851,
    "preview": "# bitsandbytes量化脚本\n修改以下model_path和save_path,以及需要能够加载未量化模型的显卡,大约17G显存\n```python\nimport torch\nfrom transformers import Aut"
  },
  {
    "path": "windows_minicpm3.0_agent/app.py",
    "chars": 4278,
    "preview": "import os\n\nimport gradio as gr\nimport torch\nfrom pdf2image import convert_from_path\nfrom PIL import Image\nfrom torch.uti"
  },
  {
    "path": "windows_minicpm3.0_agent/cli_demo.py",
    "chars": 3583,
    "preview": "import os\n\nimport torch\nfrom pdf2image import convert_from_path\nfrom PIL import Image\nfrom torch.utils.data import DataL"
  },
  {
    "path": "windows_minicpm3.0_agent/dataset.py",
    "chars": 19561,
    "preview": "import copy\nimport json\nimport logging\nimport math\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import"
  },
  {
    "path": "windows_minicpm3.0_agent/get_reponse.py",
    "chars": 484,
    "preview": "import requests\r\nimport json\r\nimport base64\r\nwith open(r'D:\\model_best\\minicpm\\infer\\images\\1564165156.jpg', 'rb') as im"
  },
  {
    "path": "windows_minicpm3.0_agent/inference.py",
    "chars": 3049,
    "preview": "import torch\nimport typer\nfrom torch.utils.data import DataLoader\nfrom tqdm import tqdm\nfrom transformers import AutoPro"
  },
  {
    "path": "windows_minicpm3.0_agent/utils.py",
    "chars": 1374,
    "preview": "from torchvision import transforms\nimport torch.nn.functional as F\nimport torch\ndef evaluate_colbert( qs, ps, batch_size"
  },
  {
    "path": "windows_minicpm3.0_agent/windows_agent.py",
    "chars": 5828,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\nimport re\nfrom transformers import AutoTokenizer,AutoModelForCausalLM\nfrom cli_d"
  }
]

// ... and 2 more files (download for full content)

About this extraction

This page contains the full source code of the OpenBMB/MiniCPM-CookBook GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 190 files (30.8 MB), approximately 543.0k tokens, and a symbol index with 546 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!