Repository: infinigence/Infini-Megrez-Omni
Branch: main
Commit: 031e8160e384
Files: 21
Total size: 110.6 KB

Directory structure:
gitextract_nm6z7f7i/

├── LICENSE
├── README.md
├── README_zh.md
├── data/
│   └── train/
│       └── records.jsonl
├── example_chat_hf.py
├── finetune/
│   ├── dataset.py
│   ├── ds_config_zero2.json
│   ├── finetune.py
│   ├── finetune.sh
│   ├── requirements.txt
│   └── trainer.py
├── gradio_app.py
├── requirements.txt
└── vllm_demo/
    ├── example_infer_vllm.py
    ├── megrezo.py
    ├── requirements.txt
    ├── try_minicpm_v.py
    ├── try_qwen_vl.py
    ├── vllm_profling.py
    ├── vllm_profling_minicpm.py
    └── vllm_profling_qwen.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2024 OpenBMB

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

================================================
FILE: README.md
================================================
<div align="center">

# Megrez-3B-Omni: The First Open-Source End-Side Full Modality Understanding Model

<p align="center">
    <img src="assets/megrez_logo.png" width="400"/>
<p>
<p align="center">
    📄 <a href="assets/Megrez_Omni_Technical_Report.pdf">Paper</a>
    🤗 <a href="https://huggingface.co/Infinigence/Megrez-3B-Omni">Huggingface</a>&nbsp&nbsp | &nbsp&nbsp🤖<a href="https://www.modelscope.cn/models/InfiniAI/Megrez-3B-Omni">Modelscope</a>&nbsp&nbsp | &nbsp&nbsp🖥️ <a href="https://huggingface.co/spaces/Infinigence/Megrez-3B-Omni">Demo</a>&nbsp&nbsp | &nbsp&nbsp📖 <a href="assets/wechat-official.jpg">WeChat Official</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="assets/wechat-group.jpg">WeChat Groups</a>&nbsp&nbsp
</p>

<strong>[中文](./README_zh.md) | English</strong>

</div>

## Introduction
**Megrez-3B-Omni** is an on-device multimodal understanding LLM model developed by **Infinigence AI** ([Infinigence AI](https://cloud.infini-ai.com/platform/ai)). It is an extension of the Megrez-3B-Instruct model and supports analysis of image, text, and audio modalities. The model achieves state-of-the-art accuracy in all three domains:
- Image Understanding: By utilizing SigLip-400M for constructing image tokens, Megrez-3B-Omni outperforms models with more parameters such as LLaVA-NeXT-Yi-34B. It is one of the best image understanding models among multiple mainstream benchmarks, including MME, MMMU, and OCRBench. It demonstrates excellent performance in tasks such as scene understanding and OCR.
- Language Understanding: Megrez-3B-Omni retains text understanding capabilities without significant trade-offs. Compared to its single-modal counterpart (Megrez-3B-Instruct), the accuracy variation is less than 2%, maintaining state-of-the-art performance on benchmarks like C-EVAL, MMLU/MMLU Pro, and AlignBench. It also outperforms previous-generation models with 14B parameters.
- Speech Understanding: Equipped with the encoder head of Qwen2-Audio/whisper-large-v3, the model supports both Chinese and English speech input, multi-turn conversations, and voice-based questions about input images. It can directly respond to voice commands with text and achieved leading results across multiple benchmarks.

## Evaluation

- The left image compares the performance of Megrez-3B-Omni with other open-source models on mainstream image multimodal tasks.
- The right image shows the performance of Megrez-3B-Omni on the OpenCompass test set. Image reference: [InternVL 2.5 Blog Post](https://internvl.github.io/blog/2024-12-05-InternVL-2.5/).  

You can find detailed accuracy metrics on the [Megrez-3B-Omni-HF](https://huggingface.co/Infinigence/Megrez-3B-Omni) page.  

<div style="display: flex; justify-content: space-between;">
  <img src="assets/multitask.jpg" alt="Comparison of Image Understanding Capabilities" style="width: 45%;">
  <img src="assets/opencompass.jpg" alt="OpenCompass Benchmark Performance" style="width: 45%;">
</div>

### Inference Speed

|                | image_tokens | prefill (tokens/s) | decode (tokens/s) |
|----------------|:------------:|:------------------:|:-----------------:|
| Megrez-3B-Omni |      448     |       6312.66      |       1294.9      |
| Qwen2-VL-2B    |     1378     |       7349.39      |       685.66      |
| MiniCPM-V-2_6  |      448     |       2167.09      |       452.51      |

Setup:  
- The testing environment utilizes an NVIDIA H100 GPU with vLLM. Each test includes 128 text tokens and a 720×1480 image as input, producing 128 output tokens, with `num_seqs` fixed at 8.  
- Under this setup, the decoding speed of Qwen2-VL-2B is slower than Megrez-3B-Omni, despite having a smaller base LLM. This is due to the larger number of image tokens generated when encoding images of the specified size, which impacts actual inference speed.  

## Model Demo

【GIF】

## Install

Install runtime dependencies with the following command:

```shell
pip install -r requirements.txt
```

The audio-related functionality relies on **FFmpeg** for audio processing. If you are using a Debian or Debian-based system, you can install FFmpeg with the following command:

```bash
sudo apt-get install ffmpeg
```

For other operating systems, please refer to the [official FFmpeg documentation](https://ffmpeg.org/download.html) for installation instructions.

## Inference

### Conversation with Multimodal Data

You can use the following script to chat with our model. Note that you should replace `PATH_TO_PRETRAINED_MODEL` with the path to the downloaded model checkpoint.

```python
import torch
from transformers import AutoModelForCausalLM

path = "{{PATH_TO_PRETRAINED_MODEL}}"  # Change this to the path of the model.

model = (
    AutoModelForCausalLM.from_pretrained(
        path,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )
    .eval()
    .cuda()
)

messages = [
    {
        "role": "user",
        "content": {
            "text": "Please describe the content of the image.",
            "image": "./data/sample_image.jpg",
        },
    },
]

MAX_NEW_TOKENS = 100
response = model.chat(
    messages,
    sampling=False,
    max_new_tokens=MAX_NEW_TOKENS,
)
print(response)
```

You can also find a complete script in [example_chat_hf.py](example_chat_hf.py).

### Inference with vLLM

We provide a reference implementation of inference with vLLM framework. You can find the model definition in [vllm_demo/megrezo.py](vllm_demo/megrezo.py).

1. Install vLLM

```shell
pip install vllm==0.6.3.post1 flash_attn==2.5.8 xformers==0.0.27.post2
```

**Note**: To use vLLM for inference, it is essential to install specific versions of the dependencies. Other versions may lead to interface incompatibility risks. If you encounter any issues, feel free to [open an issue](https://github.com/infinigence/Infini-Megrez-Omni/issues/new).

2. Run the inference script

Since vLLM does not officially support MegrezO yet, you need to import the module first:

```python
from vllm import ModelRegistry
from megrezo import MegrezOModel

ModelRegistry.register_model("MegrezO", MegrezOModel)
```

Then, you can run inference with the following code:

```python
from PIL import Image
from vllm import LLM
from vllm import SamplingParams


# Load the model.
model_path = "{{PATH_TO_HF_PRETRAINED_MODEL}}"  # Change this to the path of the model.
llm = LLM(
    model_path,
    trust_remote_code=True,
    gpu_memory_utilization=0.5,
)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=1000,
    repetition_penalty=1.2,
    stop=["<|turn_end|>", "<|eos|>"],
)

img = Image.open("../data/sample_image.jpg")

conversation = [
    {
        "role": "user",
        "content": {
            "text": "图片的内容是什么？",
            "image": img,
        },
    },
]

# Convert the conversation to vLLM acceptable format.
prompt = llm.get_tokenizer().apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True,
)
vllm_inputs = [
    {
        "prompt": prompt,
        "multi_modal_data": {
            "image": img,
        },
    }
]

# Generate the outputs.
outputs = llm.generate(
    vllm_inputs,
    sampling_params,
)

# Print the outputs.
for output in outputs:
    print(output.outputs[0].text)
```

You can find a complete script in [vllm_demo/example_infer_vllm.py](vllm_demo/example_infer_vllm.py).

## Chat with MegrezO using Gradio

We provide online and local demos powered by Hugging Face Gradio <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a>.

### WebUI Demonstration

<div align="center" style="display: flex; justify-content: space-between;">
  <img src="assets/gradio_demo.jpg" style="width: 80%;">
</div>

### Online Demo

Please try out our online Demo here: [🤗Megrez-3B-Omni](https://huggingface.co/spaces/Infinigence/Megrez-3B-Omni)

### Local WebUI Demo
  
You can easily deploy your own local WebUI to chat with MegrezO using Gradio.

1. Install dependencies:

```shell
pip install -r requirements.txt
```

2. Launch the Gradio app.

You need to specify the `model_path` and `port` in the command line. The `model_path` is the path to the model checkpoint, and the `port` is the port number for the local server. By default, the `port` is `7860`.

```shell
python gradio_app.py --model_path {model_path} --port {port}
```

Then, you can visit `http://localhost:7860` in your browser to interact with the model.

Feel free to modify the `gradio_app.py` to customize the input and output interfaces. For more information, please refer to the [Gradio documentation](https://gradio.app/docs).

## Fine-Tuning the Model

We provide a [fine-tuning example](./finetune/) based on [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [accelerate](https://github.com/huggingface/accelerate).

### Data Preparation

We have constructed a sample dataset based on [ALLaVA-4V/allava_laion](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V/tree/main/allava_laion) dataset:  

- **Dialogue**: [data/train/records.jsonl](./data/train/records.jsonl)  
- **Images**: [data/train/images](./data/train/images)  
- **Audio**: [data/train/audio](./data/train/audio), created by converting dialogue text into speech using TTS.  

You can also prepare your own dataset following the same format.

### Dependencies Installation

Install the required dependencies with the following command:  

```bash
pip install deepspeed accelerate
```

### Full-Parameter Fine-Tuning

To run the fine-tuning example, execute the following commands. Be sure to replace the model path in the script with the path to your downloaded model.  

```bash
cd finetune

sh finetune.sh
```

You can customize the modules to fine-tune by setting the parameters:  
`tune_vision_encoder`, `tune_vision_proj`, `tune_llm`, `tune_audio_encoder`, and `tune_audio_proj`.

### Notes

1. **Recommended Hardware**: Please use at least two GPUs with 80GB memory for fine-tuning.  
2. **If GPU memory is insufficient**:  
   - Adjust the `model_max_length` and `per_device_train_batch_size` parameters.  
   - Disable specific modules for fine-tuning to reduce memory usage.  
   - Optimize memory consumption by configuring the `zero_optimization` parameters in DeepSpeed.
3. **For better inference results**:
   - We recommend to put the images in the first round of chat for better inference results. There are no such restrictions for audio and text, which can be switched freely.
   - In the Automatic Speech Recognition (ASR) scenario, simply change content['text'] to "Convert speech to text."
   - In the OCR scenario, enabling sampling may introduce language model hallucinations which cause text changes. Users may consider disabling sampling in inference (sampling=False). However, disabling sampling may introduce model repetition.
 

## Open Source License and Usage Statement

- **License**: The code in this repository is open-sourced under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license.  
- **Hallucination**: Large models inherently have hallucination issues. Users should not completely trust the content generated by the model. 
- **Values and Safety**: While we have made every effort to ensure compliance of the data used during training, the large volume and complexity of the data may still lead to unforeseen issues. We disclaim any liability for problems arising from the use of this open-source model, including but not limited to data security issues, public opinion risks, or risks and problems caused by misleading, misuse, propagation, or improper utilization of the model.  


================================================
FILE: README_zh.md
================================================
<div align="center">

# Megrez-3B-Omni: 首个端侧全模态理解开源模型

<p align="center">
    <img src="assets/megrez_logo.png" width="400"/>
<p>
<p align="center">
    📄 <a href="assets/Megrez_Omni_Technical_Report.pdf">Paper</a>
    🤗 <a href="https://huggingface.co/Infinigence/Megrez-3B-Omni">Huggingface</a>&nbsp&nbsp | &nbsp&nbsp🤖<a href="https://www.modelscope.cn/models/InfiniAI/Megrez-3B-Omni">Modelscope</a>&nbsp&nbsp | &nbsp&nbsp🖥️ <a href="https://huggingface.co/spaces/Infinigence/Megrez-3B-Omni">Demo</a>&nbsp&nbsp | &nbsp&nbsp📖 <a href="assets/wechat-official.jpg">WeChat Official</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="assets/wechat-group.jpg">WeChat Groups</a>&nbsp&nbsp
</p>

<strong>中文 | [English](./README.md)</strong>

</div>

## 模型简介
Megrez-3B-Omni是由无问芯穹（[Infinigence AI](https://cloud.infini-ai.com/platform/ai)）研发的**端侧全模态**理解模型，基于无问大语言模型Megrez-3B-Instruct扩展，同时具备图片、文本、音频三种模态数据的理解分析能力，在三个方面均取得最优精度
- 在图像理解方面，基于SigLip-400M构建图像Token，在OpenCompass榜单上（综合8个主流多模态评测基准）平均得分66.2，超越LLaVA-NeXT-Yi-34B等更大参数规模的模型。Megrez-3B-Omni也是在MME、MMMU、OCRBench等测试集上目前精度最高的图像理解模型之一，在场景理解、OCR等方面具有良好表现。
- 在语言理解方面，Megrez-3B-Omni并未牺牲模型的文本处理能力，综合能力较单模态版本（Megrez-3B-Instruct）精度变化小于2%，保持在C-EVAL、MMLU/MMLU Pro、AlignBench等多个测试集上的最优精度优势，依然取得超越上一代14B模型的能力表现
- 在语音理解方面，采用Qwen2-Audio/whisper-large-v3的Encoder作为语音输入，支持中英文语音输入及多轮对话，支持对输入图片的语音提问，根据语音指令直接响应文本，在多项基准任务上取得了领先的结果

## 评测结果
- 左图为Megrez-3B-Omni与其他开源模型在主流图片多模态任务上的性能比较
- 右图为Megrez-3B-Omni在OpenCompass测试集上表现，图片引用自： [InternVL 2.5 Blog Post](https://internvl.github.io/blog/2024-12-05-InternVL-2.5/)*
<div style="display: flex; justify-content: space-between;">
  <img src="assets/multitask.jpg" alt="Image 1" style="width: 45%;">
  <img src="assets/opencompass.jpg" alt="Image 2" style="width: 45%;">
</div>

详细精度见 [Megrez-3B-Omni-HF](https://huggingface.co/Infinigence/Megrez-3B-Omni)

### 推理速度
|                | image_tokens | prefill (tokens/s) | decode (tokens/s) |
|----------------|:------------:|:------------------:|:-----------------:|
| Megrez-3B-Omni |      448     |       6312.66      |       1294.9      |
| Qwen2-VL-2B    |     1378     |       7349.39      |       685.66      |
| MiniCPM-V-2_6  |      448     |       2167.09      |       452.51      |

实验设置：
- 测试环境为NVIDIA H100下VLLM下输入128个Text token和一张 720*1480的图片，输出128个token，num_seqs固定为8。
- Qwen2-VL-2B的在此实验下的decode速度小于Megrez-3B-Omni，虽然其具备更小的基座LLM，但是编码上述大小图片后的image_token相较Megrez-3B-Omni较多，影响实际推理速度。

## 模型演示
【GIF】

## 安装
使用如下命令安装依赖：

```shell
pip install -r requirements.txt
```

音频功能依赖ffmpeg进行音频处理，如果您使用 Debian 相关的系统，可以通过以下命令安装：

```shell
sudo apt-get install ffmpeg
```

对于其他的操作系统，请参考 [ffmpeg 官方文档](https://ffmpeg.org/download.html) 进行安装。


## 模型推理

### 使用多模态数据进行多轮对话

请使用如下脚本进行推理。请将 `PATH_TO_PRETRAINED_MODEL` 替换为下载的模型权重的路径。
```python
import torch
from transformers import AutoModelForCausalLM

path = "{{PATH_TO_PRETRAINED_MODEL}}"  # 更改为模型的路径

model = (
    AutoModelForCausalLM.from_pretrained(
        path,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )
    .eval()
    .cuda()
)

messages = [
    {
        "role": "user",
        "content": {
            "text": "Please describe the content of the image.",
            "image": "./data/sample_image.jpg",
        },
    },
]

MAX_NEW_TOKENS = 100
response = model.chat(
    messages,
    sampling=False,
    max_new_tokens=MAX_NEW_TOKENS,
)
print(response)
```

完整的示例见：[example_chat_hf.py](example_chat_hf.py).

### 使用 vLLM 进行推理
我们提供了一个基于 vLLM 框架的推理参考实现。您可以在 [vllm_demo/megrezo.py](vllm_demo/megrezo.py) 中找到模型定义。

推理步骤如下：

1. 安装 vLLM

```shell
pip install vllm==0.6.3.post1 flash_attn==2.5.8 xformers==0.0.27.post2
```

**注意**：使用 vLLM 推理需要安装特定版本的依赖，其他版本可能存在接口不一致的风险。有任何问题欢迎[提issue](https://github.com/infinigence/Infini-Megrez-Omni/issues/new)。

2. 运行推理脚本

vLLM 尚未正式支持 MegrezO，因此您需要先导入我们定义的模块：

```python
from vllm import ModelRegistry
from megrezo import MegrezOModel

ModelRegistry.register_model("MegrezO", MegrezOModel)
```

然后，您可以使用以下代码运行推理：

```python
from PIL import Image
from vllm import LLM
from vllm import SamplingParams


model_path = "{{PATH_TO_HF_PRETRAINED_MODEL}}"  # 更改为模型的路径
llm = LLM(
    model_path,
    trust_remote_code=True,
    gpu_memory_utilization=0.5,
)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=1000,
    repetition_penalty=1.2,
    stop=["<|turn_end|>", "<|eos|>"],
)

img = Image.open("../data/sample_image.jpg")

conversation = [
    {
        "role": "user",
        "content": {
            "text": "图片的内容是什么？",
            "image": img,
        },
    },
]

# 将对话转换为 vLLM 可接受的格式。
prompt = llm.get_tokenizer().apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True,
)
vllm_inputs = [
    {
        "prompt": prompt,
        "multi_modal_data": {
            "image": img,
        },
    }
]

# 生成输出
outputs = llm.generate(
    vllm_inputs,
    sampling_params,
)

# 打印输出
for output in outputs:
    print(output.outputs[0].text)
```

完整的示例见：[vllm_demo/example_infer_vllm.py](vllm_demo/example_infer_vllm.py).

## 使用 Gradio 与 MegrezO 对话

我们提供基于 Hugging Face Gradio <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a> 实现的在线和本地 Demo。

### WeiUI 演示

<div align="center" style="display: flex; justify-content: space-between;">
  <img src="assets/gradio_demo.jpg" style="width: 80%;">
</div>

### 在线 Demo

欢迎试用在线 Demo: [🤗Megrez-3B-Omni](https://huggingface.co/spaces/Infinigence/Megrez-3B-Omni)。

### 本地 Demo
  
使用如下命令部署本地 Gradio 应用：

1. 安装依赖:

```shell
pip install -r requirements.txt
```

2. 启动 Gradio 应用

您需要在命令行中指定 `model_path` 和 `port`。`model_path` 是模型的路径，`port` 是本地服务器的端口号。默认情况下，`port` 是 `7860`。

```shell
python gradio_app.py --model_path {model_path} --port {port}
```

然后，您可以在浏览器中访问 `http://localhost:7860` 与模型对话。

如需自定义输入和输出接口，请修改 `gradio_app.py`。更多信息请参考 [Gradio 文档](https://gradio.app/docs)。

## 微调模型

我们提供了一个基于 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [accelerate](https://github.com/huggingface/accelerate) 的[微调示例](./finetune/)。

### 数据准备

我们基于[ALLaVA-4V/allava_laion](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V/tree/main/allava_laion)构造了一个示例数据集：

- **对话**：[data/train/records.jsonl](./data/train/records.jsonl)
- **图片**：[data/train/images](./data/train/images)
- **音频**：[data/train/audio](./data/train/audio)，是通过将对话中的文本使用TTS转换为语音得到的。

您也可以按照上述格式准备自己的数据集。

### 依赖安装

```shell
pip install deepspeed accelerate
```

### 全参微调

使用如下命令运行我们的微调示例，请注意将脚本中的模型路径替换成您下载的模型路径。

```shell
cd finetune

sh finetune.sh
```

您可以通过设置`tune_vision_encoder`、`tune_vision_proj`、`tune_llm`、`tune_audio_encoder`、`tune_audio_proj`来选择需要微调的模块。

### 注意事项

- 推荐使用至少2张拥有80G显存的GPU进行微调。
- 在显存不足的情况下：
  - 请尝试调整`model_max_length`和`per_device_train_batch_size`。
  - 请尝试关闭需要微调的模块以便减少显存占用。
  - 请尝试调整deepspeed的`zero_optimization`参数来优化显存使用。
- 使用时
  - 请将图片尽量在首轮输入以保证推理效果，语音和文本无此限制，可以自由切换
  - 语音识别（ASR）场景下，只需要将content['text']修改为“将语音转化为文字。”
  - OCR场景下开启采样可能会引入语言模型幻觉导致的文字变化，可考虑关闭采样进行推理（sampling=False），但关闭采样可能引入模型复读

## 开源协议及使用声明

- **协议**：本仓库中代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源。
- **幻觉**：大模型天然存在幻觉问题，用户使用过程中请勿完全相信模型生成的内容。
- **价值观及安全性**：本模型已尽全力确保训练过程中使用的数据的合规性，但由于数据的大体量及复杂性，仍有可能存在一些无法预见的问题。如果出现使用本开源模型而导致的任何问题，包括但不限于数据安全问题、公共舆论风险，或模型被误导、滥用、传播或不当利用所带来的任何风险和问题，我们将不承担任何责任。


================================================
FILE: data/train/records.jsonl
================================================
{"conversations":[{"role":"user","content":{"text":"Considering the items the figurine is holding, what scene or event from its source material might it be representing?","image":"images/100230144.jpeg","audio":"audios/0_0_0_chattts.wav"}},{"role":"assistant","content":{"text":"Based on the items the figurine is holding, which are a rifle and an iron, it likely represents a scene from the source material where the character is setting up traps to protect his home. These items are indicative of the resourceful and inventive methods the character uses in the movie \"Home Alone\" to create humorous but effective traps for intruders."}}]}
{"conversations":[{"role":"user","content":{"text":"How does the inclusion of gemstones affect the overall aesthetic and possible value of the ring?","image":"images/100277305.jpeg","audio":"audios/0_1_0_chattts.wav"}},{"role":"assistant","content":{"text":"The inclusion of gemstones along the edges of the leaf design enhances the ring's aesthetic by adding a touch of brilliance and luxury. The small, clear gemstones catch the light, creating a sparkling effect that contrasts with the silver's polished surface and the leaf's textured details. This sparkling accent likely increases the ring's visual appeal, making it stand out more as an accessory. In terms of value, the presence of gemstones, which could be diamonds, would likely raise the ring's monetary worth. Gemstones, especially diamonds, are prized for their beauty and durability, and their addition to jewelry pieces often signifies a higher quality and class of the item, thereby increasing its market value."}}]}
{"conversations":[{"role":"user","content":{"text":"What might the significance of the figure holding the 'O' in relation to the setting sun symbolize in the context of this album cover?","image":"images/100234216.jpeg","audio":"audios/0_2_0_chattts.wav"}},{"role":"assistant","content":{"text":"The silhouette of the figure holding the 'O' of the \"WOW\" so that it appears to be part of the setting sun likely symbolizes an element of enlightenment or revelation, which is often a theme in Christian music. It could represent the divine or an epiphany, echoing the 'wow' factor that the album title suggests. This imagery might be intended to convey a sense of awe or an extraordinary experience, possibly reflecting the emotional or spiritual impact that the songs aim to impart on the listener."}}]}
{"conversations":[{"role":"user","content":{"text":"What might be the significance of the character's clothing and instrument choice in relation to the type of event being promoted?","image":"images/100223054.jpeg","audio":"audios/0_3_0_chattts.wav"}},{"role":"assistant","content":{"text":"The character's clothing and instrument choice are indicative of a relaxed and informal atmosphere, which aligns with the musical genres mentioned on the poster\u2014rock and hard rock. The gray cap and black shirt could be associated with a laid-back, perhaps slightly rebellious aesthetic that is often linked to rock music cultures. The red bass guitar signifies that music is central to the event and hints that the character may represent a member of the band \"PuBFlieS,\" suggesting they play rock or hard rock music, as bass guitars are fundamental in these genres. The cheerful demeanor of the character along with the casual attire suggests that the event aims to be welcoming and enjoyable, catering to fans of these music genres."}}]}
{"conversations":[{"role":"user","content":{"text":"What material is the water bottle likely made from?","image":"images/100280844.jpeg","audio":"audios/0_4_0_chattts.wav"}},{"role":"assistant","content":{"text":"Based on the image alone, the water bottle is likely made from metal, inferred from the reflective and smooth surface, which is characteristic of metal water bottles. The metallic sheen and lack of any seams or typical plastic texturing support this assumption. Additionally, the way the light reflects off the surface suggests a material that is more reflective than plastic or glass, which is often the case with stainless steel or aluminum bottles."}}]}
{"conversations":[{"role":"user","content":{"text":"What historical period might the attire of the person in the painting suggest?","image":"images/100284327.jpeg","audio":"audios/0_5_0_chattts.wav"}},{"role":"assistant","content":{"text":"The attire of the person in the painting suggests a historical period that could be the late Middle Ages or the Renaissance, specifically within Europe. This is indicated by the style of the hat, which resembles a wide-brimmed hat known as a Gainsborough hat, popular in different forms from the 17th century onwards, and the monastic-like robes, which were common in religious orders during the medieval period. However, the exact historical period cannot be definitively determined from the attire alone due to the painting's anachronistic nature."}}]}
{"conversations":[{"role":"user","content":{"text":"Could the absence of any ports or connectors on the visible side suggest that this is a wireless device?","image":"images/100245026.jpeg","audio":"audios/0_6_0_chattts.wav"}},{"role":"assistant","content":{"text":"Based on the visible features in the image, the absence of any ports, buttons, or connectors on the surface of the device does suggest that it could be a wireless device. This characteristic is common in modern wireless technology, where devices are often designed to be sleek and unobtrusive, relying on wireless connectivity for their functionality. The simplicity and cleanliness of the design further imply that it could be a wireless charging pad, which typically requires no physical connections to the devices it charges."}}]}
{"conversations":[{"role":"user","content":{"text":"Considering the individual's attire and the setting, what could be the possible theme or context of this photo shoot?","image":"images/10688.jpeg","audio":"audios/3_0_0_chattts.wav"}},{"role":"assistant","content":{"text":"The theme or context of the photo shoot could be a conceptual representation of rebellion or resistance. The military-style jacket with patches, the helmet, and the setting that resembles a prison all suggest a narrative of combatting authority or standing against confinement. The use of fashion to portray this narrative indicates a possible commentary on individualism and defiance."}}]}
{"conversations":[{"role":"user","content":{"text":"What could be the possible association between the two logos presented in the image, and how might they relate to the content listed in the slide?","image":"images/104042.jpeg","audio":"audios/2_0_0_chattts.wav"}},{"role":"assistant","content":{"text":"The possible association between the two logos and the content of the slide suggests a partnership or a collaborative project focused on recycling and waste electrical and electronic equipment (WEEE). The \"LIFE +\" logo is associated with an EU environmental initiative, and \"RECYCLING SYMPRAXIS\" suggests a practice or a consortium working towards recycling. The date and word \"PHILOXENIA\" hint at an event, possibly a conference or seminar that took place in 2010. The second logo, which is less identifiable, likely represents the organization responsible for the content of the presentation, in this case, \"Q-PLAN Northern Greece\", which seems to be the coordinator or the main body overseeing the implementation of the state-of-the-art technologies and applications in WEEE recycling. The contents listed in the slide would be topics discussed in relation to these technologies and their applications."}}]}
{"conversations":[{"role":"user","content":{"text":"What might the three stars above the team crest signify in the context of soccer achievements?","image":"images/100271334.jpeg","audio":"audios/0_7_0_chattts.wav"}},{"role":"assistant","content":{"text":"The three stars above the team crest traditionally represent major honors or championships won by the team. In many soccer leagues, a star is added to the team's crest for a set number of league or major tournament victories. For instance, a club might add a star for every ten league titles they win. Therefore, these stars are likely indicative of the team's historical success, possibly in their domestic league or international competitions."}}]}


================================================
FILE: example_chat_hf.py
================================================
# -*- encoding: utf-8 -*-
# File: example_chat_hf.py
# Description: None

import torch
from transformers import AutoModelForCausalLM

path = "/mnt/algorithm/user_dir/zhoudong/workspace/models/megrez-o"  # Change this to the path of the model.

model = (
    AutoModelForCausalLM.from_pretrained(
        path,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    .eval()
    .cuda()
)
prompt = "hi" * (128 - 1) 
# Chat with text and image
messages = [
    {
        "role": "user",
        "content": {
            "text": prompt,
            "image": "./data/sample_image.jpg",
        },
    },
]

# Chat with audio and image
# messages = [
#     {
#         "role": "user",
#         "content": {
#             "image": "./data/sample_image.jpg",
#             "audio": "./data/sample_audio.m4a",
#         },
#     },
# ]

MAX_NEW_TOKENS = 100
response = model.chat(
    messages,
    sampling=False,
    max_new_tokens=MAX_NEW_TOKENS,
)
print(response)


================================================
FILE: finetune/dataset.py
================================================
# -*- encoding: utf-8 -*-
# File: dataset.py
# Description: None

import os

import numpy as np
from regex import F
import torch
from torch.utils.data import Dataset


class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(
        self,
        raw_data_list,
        processor,
        process_func,
        dataset_prefix="",
    ):
        super(SupervisedDataset, self).__init__()
        self.raw_data_list = raw_data_list
        self.processor = processor
        self.process_func = process_func
        self.dataset_prefix = dataset_prefix

    def __len__(self):
        return len(self.raw_data_list)

    def check_ret(self, ret):
        flag = True
        for key in ret.keys():
            value_list = ret[key]
            if not isinstance(value_list, list):
                value_list = [value_list]
            for value in value_list:
                if isinstance(value, torch.Tensor):
                    if torch.isnan(value).any():
                        flag = False
                    if torch.isinf(value).any():
                        flag = False
        return flag

    def check_audio(self, ret):
        flag = True
        for audio in ret["msgs_audio"]:
            if (audio["input_audio_lengths"][:, 1] == 0).any():
                flag = False
        return flag

    def prepare_labels(self, data):

        def prepare_labels(tokenizer, input_ids, padding_value=-100):
            # <|role_start|>assistant<|role_end|> 后面的内容才是需要算loss的部分
            def find_start_header_idxs():
                start_header_tokens = tokenizer.encode("<|role_start|>assistant<|role_end|>", add_special_tokens=False)
                start_header_idxs = np.where(input_ids == start_header_tokens[-1])[0]

                kept_start_header_idxs = []
                for start_header_idx in start_header_idxs:
                    keep = True
                    for i in range(1, len(start_header_tokens)):
                        if start_header_tokens[-(i + 1)] != input_ids[start_header_idx - i]:
                            keep = False
                            break
                    if keep:
                        kept_start_header_idxs.append(start_header_idx)
                return kept_start_header_idxs

            turn_end_token_id = tokenizer.encode("<|turn_end|>")[0]
            start_header_idxs = find_start_header_idxs()
            end_header_idxs = np.where(input_ids == turn_end_token_id)[0]
            label_mask = np.zeros_like(input_ids, dtype=np.bool_)

            def find_next_greater_number(lst, num):
                next_greater = None
                for n in lst:
                    if n > num:
                        if next_greater is None or n < next_greater:
                            next_greater = n
                return next_greater

            nr_tokens = len(input_ids)
            for start_head_idx in start_header_idxs:
                start_idx = start_head_idx + 1
                end_idx = find_next_greater_number(end_header_idxs, start_head_idx)
                end_idx = min(end_idx + 1, nr_tokens)
                label_mask[start_idx:end_idx] = True

            labels = torch.ones(input_ids.shape[0] + 1) * padding_value
            labels[: input_ids.shape[0]] = input_ids
            labels[: input_ids.shape[0]][~label_mask] = padding_value
            labels = labels[1:]
            return labels.long()

        return prepare_labels(self.processor.tokenizer, data["input_ids"])

    def add_dataset_prefix(self, item):
        conv = item["conversations"]
        for i in range(len(conv)):
            content = conv[i]["content"]
            if "image" in content:
                content["image"] = os.path.join(self.dataset_prefix, content["image"])
            if "audio" in content:
                content["audio"] = os.path.join(self.dataset_prefix, content["audio"])

        return conv

    def __getitem__(self, i):
        raw_data_item = self.raw_data_list[i]
        item = self.add_dataset_prefix(raw_data_item)
        processed_data = self.processor(
            item,
            add_generation_prompt=False,
            apply_data_collator=False,
        )
        if "labels" not in processed_data:
            processed_data["labels"] = self.prepare_labels(processed_data)

        return processed_data


================================================
FILE: finetune/ds_config_zero2.json
================================================
{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": false
}

================================================
FILE: finetune/finetune.py
================================================
# -*- encoding: utf-8 -*-
# File: finetune.py
# Description: None


import glob
import json
import logging
import os
from dataclasses import dataclass
from dataclasses import field
from functools import partial
from glob import glob
from typing import Dict, List, Literal, Optional, Tuple, Union

import torch
import transformers
from accelerate.utils import DistributedType
from dataset import SupervisedDataset
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
from trainer import MegrezOTrainer
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import AutoTokenizer
from transformers.integrations import deepspeed


@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="openbmb/MiniCPM-V-2")


@dataclass
class DataArguments:
    data_path: str = field(default=None, metadata={"help": "Path to the training data."})
    eval_data_path: str = field(default=None, metadata={"help": "Path to the evaluation data."})
    dataset_prefix: str = field(default="data", metadata={"help": "Prefix for the multimodal data."})


@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(
        default=2048,
        metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
    )
    tune_vision_encoder: Optional[bool] = field(default=True)
    tune_vision_proj: Optional[bool] = field(default=True)
    tune_llm: Optional[bool] = field(default=True)
    tune_audio_encoder: Optional[bool] = field(default=True)
    tune_audio_proj: Optional[bool] = field(default=True)
    use_lora: Optional[bool] = field(default=False)
    max_slice_nums: Optional[int] = field(default=9)
    scale_resolution: Optional[int] = field(default=448)
    remove_unused_columns: Optional[bool] = field(default=False)


@dataclass
class LoraArguments:
    lora_r: int = 64
    lora_alpha: int = 64
    lora_dropout: float = 0.05
    lora_target_modules: str = r"llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj|v_proj)"
    lora_weight_path: str = ""
    lora_bias: str = "none"
    q_lora: bool = False
    lora_modules_to_save: str = ""
    lora_layer_replication: Optional[List[Tuple[int, int]]] = None
    lora_layers_to_transform: Optional[List[int]] = None
    lora_layers_pattern: Optional[str] = None


def maybe_zero_3(param):
    if hasattr(param, "ds_id"):
        assert param.ds_status == ZeroParamStatus.NOT_AVAILABLE
        with zero.GatheredParameters([param]):
            param = param.data.detach().cpu().clone()
    else:
        param = param.detach().cpu().clone()
    return param


# Borrowed from peft.utils.get_peft_model_state_dict
def get_peft_state_maybe_zero_3(named_params, bias):
    if bias == "none":
        to_return = {k: t for k, t in named_params if "lora_" in k}
    elif bias == "all":
        to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
    elif bias == "lora_only":
        to_return = {}
        maybe_lora_bias = {}
        lora_bias_names = set()
        for k, t in named_params:
            if "lora_" in k:
                to_return[k] = t
                bias_name = k.split("lora_")[0] + "bias"
                lora_bias_names.add(bias_name)
            elif "bias" in k:
                maybe_lora_bias[k] = t
        for k, t in maybe_lora_bias:
            if bias_name in lora_bias_names:
                to_return[bias_name] = t
    else:
        raise NotImplementedError
    to_return = {k: maybe_zero_3(v) for k, v in to_return.items()}
    return to_return


local_rank = None


def rank0_print(*args):
    if local_rank == 0:
        print(*args)


def safe_save_model_for_hf_trainer(trainer, output_dir: str, bias="none"):
    """Collects the state dict and dump to disk."""
    # check if zero3 mode enabled
    if deepspeed.is_deepspeed_zero3_enabled():
        state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
    else:
        if trainer.args.use_lora:
            state_dict = get_peft_state_maybe_zero_3(trainer.model.named_parameters(), bias)
        else:
            state_dict = trainer.model.state_dict()
    if trainer.args.should_save and trainer.args.local_rank == 0:
        trainer._save(output_dir, state_dict=state_dict)


def make_supervised_data_module(
    data_args,
    processor,
    process_func,
    data_collator=None,
    max_length=2048,
) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    rank0_print("Loading data...")

    with open(data_args.data_path, "r") as f:
        raw_data_list = [json.loads(line) for line in f]
        train_dataset = SupervisedDataset(
            raw_data_list,
            processor,
            process_func,
            data_args.dataset_prefix,
        )

    eval_dataset = None
    return dict(
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=partial(data_collator, max_length=max_length, collate_labels=True),
    )


def get_parameter_number(model):
    trainable_params, all_param = 0, 0
    for param in model.parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params

    return {"Total": all_param, "Trainable": trainable_params}


local_rank = 0


def load_model_from_pretrained(model_path, dtype=torch.bfloat16):
    model = AutoModelForCausalLM.from_pretrained(
        model_path, _attn_implementation="flash_attention_2", trust_remote_code=True, torch_dtype=dtype
    )
    return model


def load_tokenizer_from_pretrained(model_path):
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
    return tokenizer


def train():
    global local_rank
    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments, LoraArguments))

    (
        model_args,
        data_args,
        training_args,
        lora_args,
    ) = parser.parse_args_into_dataclasses()

    if getattr(training_args, "deepspeed", None):
        training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED

    compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)

    local_rank = training_args.local_rank
    world_size = int(os.environ.get("WORLD_SIZE", 1))
    ddp = world_size != 1
    device_map = None
    if lora_args.q_lora:
        device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
        if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled():
            logging.warning("FSDP or ZeRO3 are not incompatible with QLoRA.")

    model = load_model_from_pretrained(model_args.model_name_or_path, dtype=compute_dtype)
    tokenizer = load_tokenizer_from_pretrained(model_args.model_name_or_path)
    processor = AutoProcessor.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)

    model.tune_llm = training_args.tune_llm
    model.tune_vision = training_args.tune_vision_encoder or training_args.tune_vision_proj
    model.tune_audio = training_args.tune_audio_encoder or training_args.tune_audio_proj

    if not training_args.tune_vision_encoder:
        model.vision.vpm.requires_grad_(False)
    if not training_args.tune_vision_proj:
        model.vision.resampler.requires_grad_(False)
    if not training_args.tune_llm:
        model.llm.requires_grad_(False)
    if not training_args.tune_audio_encoder:
        model.audio.requires_grad_(False)
        model.audio.audio.proj.requires_grad_(True)
        if model.audio.audio.audio_bos_eos_token is not None:
            model.audio.audio.audio_bos_eos_token.requires_grad_(True)
    if not training_args.tune_audio_proj:
        model.audio.audio.proj.requires_grad_(False)
        if model.audio.audio.audio_bos_eos_token is not None:
            model.audio.audio.audio_bos_eos_token.requires_grad_(False)

    rank0_print(get_parameter_number(model))
    data_module = make_supervised_data_module(
        data_args=data_args,
        processor=processor,
        process_func=None,
        data_collator=processor.data_collator,
        max_length=training_args.model_max_length,
    )
    if training_args.lr_scheduler_type == "cosine_with_min_lr":
        training_args.lr_scheduler_kwargs = {"min_lr_rate": 0.1}
    trainer = MegrezOTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        **data_module,
    )

    train_dataset = trainer.train_dataset
    nr_data = len(train_dataset)
    rank0_print("nr dataset: {}".format(nr_data))

    checkpoint_path = os.path.join(training_args.output_dir, "checkpoint*")
    checkpoint_paths = sorted(list(glob(checkpoint_path)))

    valid_checkpoint_paths = []
    for checkpoint_path in checkpoint_paths:
        checkpoint_num = checkpoint_path.split("-")[-1]
        if checkpoint_num.isdigit():
            valid_checkpoint_paths.append(checkpoint_path)
    checkpoint_paths = sorted(list(valid_checkpoint_paths))
    checkpoint_paths = sorted(checkpoint_paths, key=lambda x: int(x.split("-")[-1]))
    checkpoint_paths = list(checkpoint_paths)
    load_checkpoint = True

    if load_checkpoint and checkpoint_paths:
        checkpoint_path = checkpoint_paths[-1]
        rank0_print("Continue Checkpoint Training: {}".format(checkpoint_path))
        trainer.train(checkpoint_path)
    else:
        trainer.train()

    trainer.save_state()
    final_path = os.path.join(training_args.output_dir, "final")
    os.makedirs(final_path, exist_ok=True)
    rank0_print("save final path to {}".format(final_path))
    safe_save_model_for_hf_trainer(trainer, final_path)


if __name__ == "__main__":
    train()


================================================
FILE: finetune/finetune.sh
================================================
DATA_PATH=$(pwd)/../data/train/records.jsonl
DATASET_PREFIX=$(pwd)/../data/train/
CURRENT_TIME=$(date +%Y%m%d_%H%M%S)
OUTPUT_DIR=$(pwd)/test_finetune/$CURRENT_TIME
LOGGING_DIR=$(pwd)/test_finetune_log
MODEL_PATH=""

torchrun --nproc_per_node=2 finetune.py \
    --data_path $DATA_PATH \
    --dataset_prefix $DATASET_PREFIX \
    --output_dir $OUTPUT_DIR \
    --logging_dir $LOGGING_DIR \
    --model_name_or_path $MODEL_PATH \
    --learning_rate 1e-5 \
    --num_train_epochs 10 \
    --deepspeed ds_config_zero2.json \
    --prediction_loss_only false \
    --bf16 true \
    --fp16 false \
    --do_train \
    --tune_vision_encoder true \
    --tune_vision_proj true \
    --tune_llm true \
    --tune_audio_encoder false \
    --tune_audio_proj true \
    --model_max_length 2048 \
    --max_slice_nums 9 \
    --scale_resolution 448 \
    --logging_strategy "steps" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --save_steps 1000 \
    --save_total_limit 100 \
    --learning_rate 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.98 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1


================================================
FILE: finetune/requirements.txt
================================================
deepspeed
accelerate

================================================
FILE: finetune/trainer.py
================================================
# -*- encoding: utf-8 -*-
# File: trainer.py
# Description: None


from typing import Any, Dict, List, Optional, Tuple, Union

import deepspeed
import torch
import torch.nn as nn
from transformers import Trainer
from transformers.integrations import is_deepspeed_zero3_enabled
from transformers.trainer_pt_utils import nested_detach


class MegrezOTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        if "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None

        self.model.vision.resampler.pos_embed = self.model.vision.resampler.pos_embed.to(self.model.device)
        if is_deepspeed_zero3_enabled():
            with deepspeed.zero.GatheredParameters(self.model.vision.resampler.attn.parameters(), modifier_rank=0):
                if not self.args.use_lora:
                    outputs = self.model(data=inputs, use_cache=False)
                else:
                    outputs = self.model.base_model(data=inputs, use_cache=False)
        else:
            if not self.args.use_lora:
                outputs = self.model(data=inputs, use_cache=False)
            else:
                outputs = self.model.base_model(data=inputs, use_cache=False)

        if labels is not None:
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
            logits = outputs.logits.view(-1, self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            # Enable model parallelism
            labels = labels.to(logits.device)
            loss = loss_fct(logits, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            # We don't use .loss here since the model may return tuples instead of ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss

    def prediction_step(
        self,
        model: nn.Module,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        prediction_loss_only: bool,
        ignore_keys: Optional[List[str]] = None,
    ) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]:
        """
        Perform an evaluation step on `model` using `inputs`.

        Subclass and override to inject custom behavior.

        Args:
            model (`nn.Module`):
                The model to evaluate.
            inputs (`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.

                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument `labels`. Check your model's documentation for all accepted arguments.
            prediction_loss_only (`bool`):
                Whether or not to return the loss only.
            ignore_keys (`List[str]`, *optional*):
                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
                gathering predictions.

        Return:
            Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss,
            logits and labels (each being optional).
        """
        has_labels = False if len(self.label_names) == 0 else all(inputs.get(k) is not None for k in self.label_names)
        # For CLIP-like models capable of returning loss values.
        # If `return_loss` is not specified or being `None` in `inputs`, we check if the default value of `return_loss`
        # is `True` in `model.forward`.
        return_loss = inputs.get("return_loss", None)
        if return_loss is None:
            return_loss = self.can_return_loss
        loss_without_labels = True if len(self.label_names) == 0 and return_loss else False

        inputs = self._prepare_inputs(inputs)
        if ignore_keys is None:
            if hasattr(self.model, "config"):
                ignore_keys = getattr(self.model.config, "keys_to_ignore_at_inference", [])
            else:
                ignore_keys = []

        # labels may be popped when computing the loss (label smoothing for instance) so we grab them first.
        if has_labels or loss_without_labels:
            labels = nested_detach(tuple(inputs.get(name) for name in self.label_names))
            if len(labels) == 1:
                labels = labels[0]
        else:
            labels = None

        with torch.no_grad():
            if has_labels or loss_without_labels:
                with self.compute_loss_context_manager():
                    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
                loss = loss.mean().detach()

                if isinstance(outputs, dict):
                    logits = tuple(v for k, v in outputs.items() if k not in ignore_keys + ["loss"])
                else:
                    logits = outputs[1:]
            else:
                loss = None
                with self.compute_loss_context_manager():
                    outputs = model(**inputs)
                if isinstance(outputs, dict):
                    logits = tuple(v for k, v in outputs.items() if k not in ignore_keys)
                else:
                    logits = outputs
                # TODO: this needs to be fixed and made cleaner later.
                if self.args.past_index >= 0:
                    self._past = outputs[self.args.past_index - 1]

        if prediction_loss_only:
            return (loss, None, None)

        logits = nested_detach(logits)
        if len(logits) == 1:
            logits = logits[0]

        return (loss, logits, labels)

    def training_step(
        self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]], num_items_in_batch: int
    ) -> torch.Tensor:
        """
        Perform a training step on a batch of inputs.

        Subclass and override to inject custom behavior.

        Args:
            model (`nn.Module`):
                The model to train.
            inputs (`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.

                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument `labels`. Check your model's documentation for all accepted arguments.

        Return:
            `torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
        inputs = self._prepare_inputs(inputs)

        with self.compute_loss_context_manager():
            loss = self.compute_loss(model, inputs)

        del inputs
        torch.cuda.empty_cache()

        if self.args.n_gpu > 1:
            loss = loss.mean()  # mean() to average on multi-gpu parallel training

        if self.use_apex:
            from transformers.trainer import amp

            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        else:
            if is_deepspeed_zero3_enabled():
                with deepspeed.zero.GatheredParameters(self.model.resampler.attn.parameters(), modifier_rank=0):
                    self.accelerator.backward(loss)
            else:
                self.accelerator.backward(loss)

        return loss.detach() / self.args.gradient_accumulation_steps


================================================
FILE: gradio_app.py
================================================
# -*- encoding: utf-8 -*-
# File: app.py
# Description: None


import threading
from copy import deepcopy
from typing import Dict, List

import gradio as gr
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import TextIteratorStreamer

IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".webp")
VIDEO_EXTENSIONS = (".mp4", ".mkv", ".mov", ".avi", ".flv", ".wmv", ".webm", ".m4v")
AUDIO_EXTENSIONS = (".mp3", ".wav")

DEFAULT_SAMPLING_PARAMS = {
    "top_p": 0.8,
    "top_k": 100,
    "temperature": 0.7,
    "do_sample": True,
    "num_beams": 1,
    "repetition_penalty": 1.2,
}
MAX_NEW_TOKENS = 1024


def main(model_path: str, port: int):

    if gr.NO_RELOAD:
        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = (
            AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
            .eval()
            .cuda()
        )
        iterable_streamer = TextIteratorStreamer(
            tokenizer,
            skip_prompt=True,
            skip_special_tokens=True,
            timeout=30,
        )

    def history2messages(history: List[Dict]) -> List[Dict]:
        """
        Transform gradio history to chat messages.
        """
        messages = []
        cur_message = dict()
        for item in history:
            if item["role"] == "assistant":
                if len(cur_message) > 0:
                    messages.append(deepcopy(cur_message))
                    cur_message = dict()
                messages.append(deepcopy(item))
                continue

            if "role" not in cur_message:
                cur_message["role"] = "user"
            if "content" not in cur_message:
                cur_message["content"] = dict()

            if "metadata" not in item:
                item["metadata"] = {"title": None}
            if item["metadata"]["title"] is None:
                cur_message["content"]["text"] = item["content"]
            elif item["metadata"]["title"] == "image":
                cur_message["content"]["image"] = item["content"][0]
            elif item["metadata"]["title"] == "audio":
                cur_message["content"]["audio"] = item["content"][0]
        if len(cur_message) > 0:
            messages.append(cur_message)
        return messages

    def check_messages(history, message, audio):
        audios = []
        images = []

        for file_msg in message["files"]:
            if file_msg.endswith(AUDIO_EXTENSIONS) or file_msg.endswith(VIDEO_EXTENSIONS):
                audios.append(file_msg)
            elif file_msg.endswith(IMAGE_EXTENSIONS):
                images.append(file_msg)
            else:
                filename = file_msg.split("/")[-1]
                raise gr.Error(f"Unsupported file type: {filename}. It should be an image or audio file.")

        if len(audios) > 1:
            raise gr.Error("Please upload only one audio file.")

        if len(images) > 1:
            raise gr.Error("Please upload only one image file.")

        if audio is not None:
            if len(audios) > 0:
                raise gr.Error("Please upload only one audio file or record audio.")
            audios.append(audio)

        # Append the message to the history
        for image in images:
            history.append({"role": "user", "content": (image,), "metadata": {"title": "image"}})

        for audio in audios:
            history.append({"role": "user", "content": (audio,), "metadata": {"title": "audio"}})

        if message["text"] is not None:
            history.append({"role": "user", "content": message["text"]})

        return history, gr.MultimodalTextbox(value=None, interactive=False)

    def bot(
        history: list,
        top_p: float,
        top_k: int,
        temperature: float,
        repetition_penalty: float,
        max_new_tokens: int = MAX_NEW_TOKENS,
        regenerate: bool = False,
    ):
        sampling_params = {
            "top_p": top_p,
            "top_k": top_k,
            "temperature": temperature,
            "repetition_penalty": repetition_penalty,
        }

        if regenerate:
            history = history[:-1]

        msgs = history2messages(history)
        th = threading.Thread(
            target=model.chat,
            kwargs=dict(
                input_msgs=msgs,
                sampling=True,
                streamer=iterable_streamer,
                max_new_tokens=max_new_tokens,
                **sampling_params,
            ),
        )
        th.start()

        response = ""
        for subtext in iterable_streamer:
            response += subtext
            yield history + [{"role": "assistant", "content": response}]

        th.join()
        return response

    def change_state(state):
        return gr.update(visible=not state), not state

    with gr.Blocks() as demo:
        chatbot = gr.Chatbot(elem_id="chatbot", bubble_full_width=False, type="messages", height=800)

        sampling_params_group_hidden_state = gr.State(False)

        with gr.Row(equal_height=True):
            audio_input = gr.Audio(
                sources=["microphone", "upload"],
                type="filepath",
                scale=4,
            )
            chat_input = gr.MultimodalTextbox(
                file_count="multiple",
                show_label=False,
                scale=10,
                file_types=["image", "audio"],
                # stop_btn=True,
            )
            with gr.Column(scale=1, min_width=150):
                with gr.Row(equal_height=True):
                    regenerate_btn = gr.Button("Regenerate", variant="primary")
                    clear_btn = gr.ClearButton(
                        [chat_input, audio_input, chatbot],
                    )

        with gr.Row():
            sampling_params_toggle_btn = gr.Button("Sampling Parameters")

        with gr.Group(visible=False) as sampling_params_group:
            with gr.Row():
                temperature = gr.Slider(
                    minimum=0, maximum=1.2, value=DEFAULT_SAMPLING_PARAMS["temperature"], label="Temperature"
                )
                repetition_penalty = gr.Slider(
                    minimum=0,
                    maximum=2,
                    value=DEFAULT_SAMPLING_PARAMS["repetition_penalty"],
                    label="Repetition Penalty",
                )

            with gr.Row():
                top_p = gr.Slider(minimum=0, maximum=1, value=DEFAULT_SAMPLING_PARAMS["top_p"], label="Top-p")
                top_k = gr.Slider(minimum=0, maximum=1000, value=DEFAULT_SAMPLING_PARAMS["top_k"], label="Top-k")

            with gr.Row():
                max_new_tokens = gr.Slider(
                    minimum=1,
                    maximum=MAX_NEW_TOKENS,
                    value=MAX_NEW_TOKENS,
                    label="Max New Tokens",
                    interactive=True,
                )

        sampling_params_toggle_btn.click(
            change_state,
            sampling_params_group_hidden_state,
            [sampling_params_group, sampling_params_group_hidden_state],
        )

        chat_msg = chat_input.submit(
            check_messages,
            [chatbot, chat_input, audio_input],
            [chatbot, chat_input],
        )
        bot_msg = chat_msg.then(
            bot,
            inputs=[chatbot, top_p, top_k, temperature, repetition_penalty, max_new_tokens],
            outputs=chatbot,
            api_name="bot_response",
        )
        bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])

        regenerate_btn.click(
            bot,
            inputs=[chatbot, top_p, top_k, temperature, repetition_penalty, max_new_tokens, gr.State(True)],
            outputs=chatbot,
        )

    demo.launch(server_port=port)


if __name__ == "__main__":

    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", type=str, required=True)
    parser.add_argument("--port", type=int, default=7680)
    args = parser.parse_args()

    main(args.model_path, args.port)


================================================
FILE: requirements.txt
================================================
transformers>=4.44.0
tokenizers>=0.20.3
accelerate
datasets
gradio


================================================
FILE: vllm_demo/example_infer_vllm.py
================================================
# -*- encoding: utf-8 -*-
# File: example_infer_vllm.py
# Description: None

from PIL import Image
from vllm import LLM
from vllm import ModelRegistry
from vllm import SamplingParams

from megrezo import MegrezOModel

ModelRegistry.register_model("MegrezO", MegrezOModel)

# Load the model.
# model_path = "{{PATH_TO_HF_PRETRAINED_MODEL}}"  # Change this to the path of the model.
model_path = "/mnt/algorithm/user_dir/zhoudong/workspace/models/megrez-o"  # Change this to the path of the model.
llm = LLM(
    model_path,
    trust_remote_code=True,
    gpu_memory_utilization=0.5,
)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=1000,
    repetition_penalty=1.2,
    stop=["<|turn_end|>", "<|eos|>"],
)

img = Image.open("../data/sample_image.jpg")

conversation = [
    {
        "role": "user",
        "content": {
            "text": "图片的内容是什么？",
            "image": img,
        },
    },
]

# Convert the conversation to vLLM acceptable format.
prompt = llm.get_tokenizer().apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True,
)
vllm_inputs = [
    {
        "prompt": prompt,
        "multi_modal_data": {
            "image": img,
        },
    }
]

# Generate the outputs.
outputs = llm.generate(
    vllm_inputs,
    sampling_params,
)

# Print the outputs.
for output in outputs:
    print(output.outputs[0].text)


================================================
FILE: vllm_demo/megrezo.py
================================================
# coding=utf-8
# Adapted from
# https://github.com/huggingface/transformers/blob/v4.28.0/src/transformers/models/llama/modeling_llama.py
# Copyright 2023 The vLLM team.
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Inference-only MegrezO model compatible with HuggingFace weights."""

from functools import lru_cache
from functools import partial
from typing import Any, Callable, Iterable, List, Literal, Mapping, Optional, Tuple, TypedDict, Union

import numpy as np
import torch
import torch.nn.functional as F
import torch.types
from PIL import Image
from torch import Tensor
from torch import nn
from torch.nn.init import trunc_normal_
from transformers import PretrainedConfig
from vllm.attention import AttentionMetadata
from vllm.config import CacheConfig
from vllm.config import MultiModalConfig
from vllm.inputs import INPUT_REGISTRY
from vllm.inputs import DecoderOnlyInputs
from vllm.inputs import InputContext
from vllm.inputs import token_inputs
from vllm.model_executor.layers.linear import ReplicatedLinear
from vllm.model_executor.layers.logits_processor import LogitsProcessor
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm.model_executor.layers.resampler import get_2d_sincos_pos_embed
from vllm.model_executor.layers.sampler import Sampler
from vllm.model_executor.layers.sampler import SamplerOutput
from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
from vllm.model_executor.models import VllmModelForTextGeneration
from vllm.model_executor.models.idefics2_vision_model import Idefics2VisionTransformer
from vllm.model_executor.models.interfaces import SupportsMultiModal
from vllm.model_executor.models.interfaces import SupportsPP
from vllm.model_executor.models.llama import LlamaModel
from vllm.model_executor.models.module_mapping import MultiModelKeys
from vllm.model_executor.models.utils import LLMWrapper
from vllm.model_executor.models.utils import is_pp_missing_parameter
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.multimodal import MULTIMODAL_REGISTRY
from vllm.multimodal.base import MultiModalInputs
from vllm.multimodal.utils import cached_get_tokenizer
from vllm.sequence import IntermediateTensors
from vllm.sequence import SequenceData
from vllm.transformers_utils.processor import get_processor

RawImageType = Union[Image.Image, torch.Tensor]
RawAudioType = Union[bytes, torch.Tensor]

cached_get_processor = lru_cache(get_processor)


class MegrezORawImageInput(TypedDict):
    """Input mapper input with auxiliary data for computing image bounds."""

    image: RawImageType


class MegrezOAudioInput(TypedDict):
    type: Literal["audio"]

    data: RawAudioType


class MegrezOAudioTensorInput(TypedDict):
    type: Literal["audio_tensor"]

    input_audios: torch.Tensor
    input_audio_lengths: torch.Tensor
    audio_span_tokens: torch.Tensor


class MegrezOImagePixelInputs(TypedDict):
    type: Literal["pixel_values"]
    pixel_values: torch.Tensor
    """
    Shape: `(batch_size * num_images, num_channels, height, width)`

    Note that the image size may vary, so we pass it as a list
    instead of a batched tensor.
    """

    tgt_sizes: torch.Tensor
    """
    Shape: `(batch_size * num_images, 2)`

    This should be in `(height, width)` format.
    """

    patch_attention_mask: torch.Tensor
    """
    Shape: `(batch_size * num_images, num_patches, num_patches)`
    """


class MegrezOImageEmbeddingInputs(TypedDict):
    type: Literal["image_embeds"]
    data: torch.Tensor
    """
    Shape: `(batch_size * num_images, image_feature_size, hidden_size)`

    `hidden_size` must match the hidden size of language model backbone.
    instead of a batched tensor.
    """

    image_bounds: torch.Tensor
    """
    Shape: `(batch_size * num_images, 2)`

    This should be in `(start, stop)` format.
    """


def insert_audio_embeddings(text_embeddings, inserted_embeddings, inserted_bounds):

    inserted_bounds = inserted_bounds.long()

    for idx in range(len(inserted_embeddings)):
        bid = inserted_bounds[idx][0]
        start_id = inserted_bounds[idx][1]
        end_id = inserted_bounds[idx][2]
        embedding = inserted_embeddings[idx]
        text_embeddings[start_id + 1 : end_id] = embedding
    return text_embeddings


def insert_image_embeddings(text_embeddings, inserted_embeddings, inserted_bounds):

    inserted_bounds = inserted_bounds.long()
    for idx in range(len(inserted_embeddings)):
        bid = inserted_bounds[idx][0]
        start_id = inserted_bounds[idx][1]
        end_id = inserted_bounds[idx][2]
        embedding = inserted_embeddings[idx]
        text_embeddings[start_id:end_id] = embedding

    return text_embeddings


MegrezOImageInputs = Union[MegrezOImagePixelInputs]
MegrezOAudioInputs = Union[MegrezOAudioTensorInput]

# region: Resampler
DEFAULT_LN = partial(nn.LayerNorm, eps=1e-6)


class Resampler(nn.Module):

    def __init__(
        self,
        num_queries: int,
        embed_dim: int,
        num_heads: int,
        kv_dim: Optional[int] = None,
        norm_layer: Callable[[int], nn.LayerNorm] = DEFAULT_LN,
        max_size: Tuple[int, int] = (70, 70),
        quant_config: Optional[QuantizationConfig] = None,
        prefix: str = "",
    ) -> None:
        super().__init__()

        self.num_queries = num_queries
        self.embed_dim = embed_dim
        self.num_heads = num_heads

        self.query = nn.Parameter(torch.zeros(self.num_queries, embed_dim))
        trunc_normal_(self.query, std=0.02)
        if kv_dim is not None and kv_dim != embed_dim:
            self.kv_proj = ReplicatedLinear(kv_dim, embed_dim, bias=False, quant_config=quant_config, prefix=prefix)
        else:
            # Maintain the same return value with ReplicatedLinear.forward
            self.kv_proj = lambda *args, **kwargs: (  # type: ignore # noqa
                nn.Identity()(*args, **kwargs),
                None,
            )

        self.attn = nn.MultiheadAttention(embed_dim, num_heads)
        self.ln_q = norm_layer(embed_dim)
        self.ln_kv = norm_layer(embed_dim)
        self.do_post_projection = True
        self.ln_post = norm_layer(embed_dim)
        self.proj = nn.Parameter((embed_dim**-0.5) * torch.randn(embed_dim, embed_dim))

        self.max_size = max_size
        self._set_2d_pos_cache(self.max_size)

        self.apply(self._init_weights)

    def _init_weights(self, m: nn.Module) -> None:
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=0.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def _repeat(self, query, N: int):
        return query.unsqueeze(1).repeat(1, N, 1)

    def _set_2d_pos_cache(self, max_size: Tuple[int, int], device: torch.types.Device = "cpu") -> None:
        pos_embed_arr = get_2d_sincos_pos_embed(self.embed_dim, max_size, version=(2, 5))
        pos_embed = torch.from_numpy(pos_embed_arr).float().to(device)
        self.register_buffer("pos_embed", pos_embed, persistent=False)

    def _adjust_pos_cache(self, tgt_sizes: torch.Tensor, device: torch.types.Device) -> None:
        max_h = tgt_sizes[:, 0].max().item()
        max_w = tgt_sizes[:, 1].max().item()
        assert isinstance(max_h, int) and isinstance(max_w, int)

        if max_h > self.max_size[0] or max_w > self.max_size[1]:
            self.max_size = (
                max(max_h, self.max_size[0]),
                max(max_w, self.max_size[1]),
            )
            self._set_2d_pos_cache(self.max_size, device)

    def forward(self, x: torch.Tensor, tgt_sizes: torch.Tensor) -> torch.Tensor:
        assert x.shape[0] == tgt_sizes.shape[0]
        bs = x.shape[0]

        device = x.device
        dtype = x.dtype

        patch_len = tgt_sizes[:, 0] * tgt_sizes[:, 1]

        self._adjust_pos_cache(tgt_sizes, device=device)

        max_patch_len = patch_len.max().item()
        assert isinstance(max_patch_len, int)

        key_padding_mask = torch.zeros((bs, max_patch_len), dtype=torch.bool, device=device)

        pos_embed = []
        for i in range(bs):
            tgt_h, tgt_w = tgt_sizes[i].tolist()
            pos_embed.append(self.pos_embed[:tgt_h, :tgt_w, :].reshape((tgt_h * tgt_w, -1)).to(dtype))  # patches * D
            key_padding_mask[i, patch_len[i] :] = True
        pos_embed = torch.nn.utils.rnn.pad_sequence(pos_embed, batch_first=True, padding_value=0.0).permute(
            1, 0, 2
        )  # BLD => L * B * D
        x, _ = self.kv_proj(x)  # B * L * D
        x = self.ln_kv(x).permute(1, 0, 2)  # L * B * D

        q = self.ln_q(self.query)  # Q * D

        out = self.attn(
            self._repeat(q, bs),  # Q * B * D
            x + pos_embed,  # L * B * D +  L * B * D
            x,
            key_padding_mask=key_padding_mask,
        )[0]
        #  out: Q * B * D
        x = out.permute(1, 0, 2)  # B * Q * D

        x = self.ln_post(x)
        x = x @ self.proj
        return x


# endregion

# region: AudioEncoder


class LayerNorm(nn.LayerNorm):
    def forward(self, x: Tensor) -> Tensor:
        # return super().forward(x.float()).type(x.dtype)
        return super().forward(x).type(x.dtype)


class Linear(nn.Linear):
    def forward(self, x: Tensor) -> Tensor:
        return F.linear(
            x,
            self.weight.to(x.dtype),
            None if self.bias is None else self.bias.to(x.dtype),
        )


class Conv1d(nn.Conv1d):
    def _conv_forward(self, x: Tensor, weight: Tensor, bias: Optional[Tensor]) -> Tensor:
        return super()._conv_forward(x, weight.to(x.dtype), None if bias is None else bias.to(x.dtype))


def sinusoids(length, channels, max_timescale=10000):
    """Returns sinusoids for positional embedding"""
    assert channels % 2 == 0
    log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
    inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2))
    scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
    return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)


class MultiHeadAttention(nn.Module):
    def __init__(self, n_state: int, n_head: int):
        super().__init__()
        self.n_head = n_head
        self.query = Linear(n_state, n_state)
        self.key = Linear(n_state, n_state, bias=False)
        self.value = Linear(n_state, n_state)
        self.out = Linear(n_state, n_state)

    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        q = self.query(x)

        if kv_cache is None or xa is None or self.key not in kv_cache:
            # hooks, if installed (i.e. kv_cache is not None), will prepend the cached kv tensors;
            # otherwise, perform key/value projections for self- or cross-attention as usual.
            k = self.key(x if xa is None else xa)
            v = self.value(x if xa is None else xa)
        else:
            # for cross-attention, calculate keys and values once and reuse in subsequent calls.
            k = kv_cache[self.key]
            v = kv_cache[self.value]

        wv, qk = self.qkv_attention(q, k, v, mask)
        return self.out(wv), qk

    def qkv_attention(self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None):
        n_batch, n_ctx, n_state = q.shape
        scale = (n_state // self.n_head) ** -0.25
        q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3) * scale
        k = k.view(*k.shape[:2], self.n_head, -1).permute(0, 2, 3, 1) * scale
        v = v.view(*v.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)

        qk = q @ k
        if mask is not None:
            qk += mask

        w = F.softmax(qk, dim=-1).to(q.dtype)
        return (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2), qk.detach()


class ResidualAttentionBlock(nn.Module):
    def __init__(self, n_state: int, n_head: int, cross_attention: bool = False):
        super().__init__()

        self.attn = MultiHeadAttention(n_state, n_head)
        self.attn_ln = LayerNorm(n_state)

        self.cross_attn = MultiHeadAttention(n_state, n_head) if cross_attention else None
        self.cross_attn_ln = LayerNorm(n_state) if cross_attention else None

        n_mlp = n_state * 4
        self.mlp = nn.Sequential(Linear(n_state, n_mlp), nn.GELU(), Linear(n_mlp, n_state))
        self.mlp_ln = LayerNorm(n_state)

    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)[0]
        if self.cross_attn:
            x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
        x = x + self.mlp(self.mlp_ln(x))
        return x


class AudioEncoder(nn.Module):
    def __init__(
        self,
        n_mels: int,
        n_ctx: int,
        n_state: int,
        n_head: int,
        n_layer: int,
        output_dim: int = 512,
        avg_pool: bool = True,
        add_audio_bos_eos_token: bool = True,
        **kwargs,
    ):
        super().__init__()
        self.conv1 = Conv1d(n_mels, n_state, kernel_size=3, padding=1)
        self.conv2 = Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1)
        # self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state))
        self.positional_embedding = nn.Parameter(sinusoids(n_ctx, n_state), requires_grad=False)

        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [ResidualAttentionBlock(n_state, n_head) for _ in range(n_layer)]
        )
        self.ln_post = LayerNorm(n_state)

        if avg_pool:
            self.avg_pooler = nn.AvgPool1d(2, stride=2)
        else:
            self.avg_pooler = None
        self.proj = nn.Linear(n_state, output_dim)
        if add_audio_bos_eos_token:
            self.audio_bos_eos_token = nn.Embedding(2, output_dim)
        else:
            self.audio_bos_eos_token = None
        self.output_dim = output_dim
        self.n_head = n_head

    def forward(self, x: Tensor, padding_mask: Tensor = None, audio_lengths: Tensor = None):
        """
        x : torch.Tensor, shape = (batch_size, n_mels, n_ctx)
            the mel spectrogram of the audio
        """
        x = x.to(dtype=self.conv1.weight.dtype, device=self.conv1.weight.device)
        if audio_lengths is not None:
            input_mel_len = audio_lengths[:, 0] * 2
            max_mel_len_in_batch = input_mel_len.max()
            x = x[:, :, :max_mel_len_in_batch]
        x = F.gelu(self.conv1(x))
        x = F.gelu(self.conv2(x))
        x = x.permute(0, 2, 1)  # B, L, D
        bsz = x.size(0)
        src_len = x.size(1)

        self.input_positional_embedding = self.positional_embedding[:src_len]
        assert (
            x.shape[1:] == self.input_positional_embedding.shape
        ), f"incorrect audio shape: {x.shape[1:], self.input_positional_embedding.shape}"
        x = (x + self.input_positional_embedding).to(x.dtype)
        if padding_mask is not None:
            padding_mask = padding_mask.to(dtype=self.conv1.weight.dtype, device=self.conv1.weight.device)
            batch_src_len = padding_mask.size(1)
            x = x[:, :batch_src_len, :]
            padding_mask = padding_mask.view(bsz, -1, batch_src_len)
            padding_mask_ = padding_mask.all(1)
            x[padding_mask_] = 0
            key_padding_mask = (
                padding_mask_.view(bsz, 1, 1, batch_src_len)
                .expand(-1, self.n_head, -1, -1)
                .reshape(bsz, self.n_head, 1, batch_src_len)
            )
            new_padding_mask = torch.zeros_like(key_padding_mask, dtype=x.dtype)
            padding_mask = new_padding_mask.masked_fill(key_padding_mask, float("-inf"))

        for block in self.blocks:
            x = block(x, mask=padding_mask)

        if self.avg_pooler:
            x = x.permute(0, 2, 1)
            x = self.avg_pooler(x)
            x = x.permute(0, 2, 1)

        x = self.ln_post(x)
        x = self.proj(x)

        if self.audio_bos_eos_token is not None:
            bos = self.audio_bos_eos_token.weight[0][None, :]
            eos = self.audio_bos_eos_token.weight[1][None, :]
        else:
            bos, eos = None, None
        return x, bos, eos

    def encode(
        self,
        input_audios: Tensor,
        input_audio_lengths: Tensor,
        audio_span_tokens: List,
    ):
        real_input_audio_lens = input_audio_lengths[:, 0].tolist()
        max_len_in_batch = max(real_input_audio_lens)
        padding_mask = torch.ones([input_audios.size(0), max_len_in_batch]).to(
            dtype=self.conv1.weight.dtype, device=self.conv1.weight.device
        )
        for index in range(len(input_audios)):
            padding_mask[index, : input_audio_lengths[index][0].item()] = 0
        x, bos, eos = self(input_audios, padding_mask, input_audio_lengths)
        output_audios = []
        for i in range(len(audio_span_tokens)):
            audio_span = audio_span_tokens[i]
            audio = x[i][: audio_span - 2]
            if bos is not None:
                audio = torch.concat([bos, audio, eos])
            assert len(audio) == audio_span
            output_audios.append(audio)
        return output_audios


class AudioModel(torch.nn.Module):

    def __init__(self, config):
        super(AudioModel, self).__init__()
        self.config = config
        self.audio = AudioEncoder(**config.audio_config.to_dict())

    def forward(self, audio_info):
        audios = audio_info["input_audios"][0]
        input_audio_lengths = audio_info["input_audio_lengths"][0]
        audio_span_tokens = audio_info["audio_span_tokens"][0]
        audios_features = self.audio.encode(audios, input_audio_lengths, audio_span_tokens)
        return audios_features


# endregion


def get_max_megrezo_image_tokens(ctx: InputContext):
    hf_config = ctx.get_hf_config()
    return getattr(hf_config, "query_num", 64) * 10


def dummy_seq_data_for_minicpmv(seq_len: int, num_images: int):
    return SequenceData.from_prompt_token_counts((0, seq_len))


def dummy_image_for_minicpmv(ctx: InputContext, hf_config: PretrainedConfig, num_images: int):
    width = height = hf_config.vision_config.image_size
    imgs = [MegrezORawImageInput(image=Image.new("RGB", (width, height), color=0)) for _ in range(num_images)]
    return {"image": imgs}


def dummy_data_for_minicpmv(ctx: InputContext, seq_len: int, mm_counts: Mapping[str, int]):
    hf_config = ctx.get_hf_config()
    num_images = mm_counts["image"]

    seq_data = dummy_seq_data_for_minicpmv(seq_len, num_images)
    mm_data = dummy_image_for_minicpmv(ctx, hf_config, num_images)  # skip audio for now
    return (seq_data, mm_data)


def input_processor_for_megrezo(ctx: InputContext, inputs: DecoderOnlyInputs):
    multi_modal_data = inputs.get("multi_modal_data")
    if multi_modal_data is None or ("image" not in multi_modal_data and "audio" not in multi_modal_data):
        return inputs

    model_config = ctx.model_config
    tokenizer = cached_get_tokenizer(model_config.tokenizer, trust_remote_code=model_config.trust_remote_code)
    processor = cached_get_processor(model_config.model, trust_remote_code=model_config.trust_remote_code)

    prompt = inputs.get("prompt")
    token_ids = inputs.get("prompt_token_ids")
    if prompt is None:
        prompt = tokenizer.decode(token_ids)

    images = multi_modal_data.get("image")
    audios = multi_modal_data.get("audio")
    prompt, multimodal_inputs = processor.process_multimodal_inputs(
        prompt,
        images=images,
        audios=audios,
        return_tensors="pt",
    )
    text_encodings = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        padding_side="left",
    )
    encodings = processor.merge_encodings(text_encodings, multimodal_inputs)
    data = processor.data_collator([encodings])

    new_prompt = tokenizer.decode(data["input_ids"][0])
    new_multi_modal_data = {
        "image": data["image_encoding"],
        "audio": data["audio_encoding"],
    }

    return token_inputs(
        prompt_token_ids=data["input_ids"][0],
        prompt=new_prompt,
        multi_modal_data=new_multi_modal_data,
    )


def input_mapper_for_megrezo(ctx: InputContext, data: object):
    return MultiModalInputs(data)


@MULTIMODAL_REGISTRY.register_image_input_mapper(input_mapper_for_megrezo)
@MULTIMODAL_REGISTRY.register_input_mapper("audio", input_mapper_for_megrezo)
@MULTIMODAL_REGISTRY.register_max_multimodal_tokens("audio", 3000)
@MULTIMODAL_REGISTRY.register_max_image_tokens(get_max_megrezo_image_tokens)
@INPUT_REGISTRY.register_input_processor(input_processor_for_megrezo)
class MegrezOModel(nn.Module, VllmModelForTextGeneration, SupportsMultiModal, SupportsPP):

    packed_modules_mapping = {
        "qkv_proj": ["q_proj", "k_proj", "v_proj"],
        "gate_up_proj": ["gate_proj", "up_proj"],
    }

    def __init__(
        self,
        config: PretrainedConfig,
        multimodal_config: MultiModalConfig,
        cache_config: Optional[CacheConfig] = None,
        quant_config: Optional[QuantizationConfig] = None,
    ):
        super().__init__()
        # All MiniCPM-V models disable `tie_word_embeddings` but
        # `PretrainedConfig.tie_word_embeddings` defaults to True; we cannot
        # check `tie_word_embeddings` until vLLM integrate MiniCPM-V model
        # and config class
        self.config = config
        self.multimodal_config = multimodal_config

        self.llm = self.init_llm(config, cache_config, quant_config, prefix="model")
        self.vision = self.init_vision_module(config, quant_config, prefix="vpm")
        param_dtype = torch.get_default_dtype()
        self.vision.to(dtype=param_dtype)

        self.audio = self.init_audio_module(config, quant_config)
        self.audio.to(dtype=param_dtype)

        self.vision_dim = self.vision.embeddings.embed_dim
        self.embed_dim = self.config.hidden_size
        self.resampler = self.init_resampler(
            self.embed_dim, self.vision_dim, quant_config=quant_config, prefix="vision.resampler"
        )
        self.resampler.to(device="cuda", dtype=param_dtype)
        self.lm_head = ParallelLMHead(
            config.vocab_size, config.hidden_size, quant_config=quant_config, prefix="llm.lm_head"
        )
        self.logits_processor = LogitsProcessor(config.vocab_size)
        self.sampler = Sampler()

        self.make_empty_intermediate_tensors = self.llm.make_empty_intermediate_tensors

        self._called_cnt = 0

    def get_vision_hidden_states(
        self,
        pixel_values,
        tgt_sizes,
        patch_attn_mask,
    ) -> torch.Tensor:

        device = self.vision.embeddings.position_embedding.weight.device
        dtype = self.vision.embeddings.position_embedding.weight.dtype
        pixel_values = torch.stack([(image.to(device) - 127.5) / 127.5 for image in pixel_values]).type(dtype)
        vision_embedding = self.vision(
            pixel_values.type(dtype),
            patch_attention_mask=patch_attn_mask,
            tgt_sizes=tgt_sizes,
        )

        return self.resampler(vision_embedding, tgt_sizes)

    def compose_embeddings(self, mini_batch):
        input_ids = mini_batch["input_ids"]
        image_encoding = mini_batch.get("image_encoding")
        audio_encoding = mini_batch.get("audio_encoding")

        embeddings_text = self.llm.model.embed_tokens(input_ids)
        input_embeds = embeddings_text
        if image_encoding:
            pixel_values = image_encoding["pixel_values"][0]
            tgt_sizes = image_encoding["tgt_sizes"][0]
            patch_attention_mask = image_encoding["patch_attention_mask"][0]
            bounds_image = image_encoding["image_bounds"][0]
            device = self.vision.embeddings.position_embedding.weight.device
            dtype = self.vision.embeddings.position_embedding.weight.dtype

            embeddings_image = self.get_vision_hidden_states(
                pixel_values.to(device, dtype),
                tgt_sizes,
                patch_attention_mask.to(device),
            )
            input_embeds = insert_image_embeddings(embeddings_text, embeddings_image, bounds_image)

        if audio_encoding:
            embeddings_audio = self.audio(audio_encoding)
            bounds_audio = audio_encoding["audio_bounds"][0]
            input_embeds = insert_audio_embeddings(embeddings_text, embeddings_audio, bounds_audio)

        return input_embeds

    def _parse_inputs(self, input_ids: torch.Tensor, **kwargs):
        if kwargs.get("pixel_values") is not None:
            image_encoding = {
                "pixel_values": kwargs.get("pixel_values"),
                "tgt_sizes": kwargs.get("tgt_sizes"),
                "patch_attention_mask": kwargs.get("patch_attention_mask"),
                "image_bounds": kwargs.get("image_bounds"),
            }
        else:
            image_encoding = None

        if kwargs.get("input_audios") is not None:
            audio_encoding = {
                "input_audios": kwargs.get("input_audios"),
                "input_audio_lengths": kwargs.get("input_audio_lengths"),
                "audio_span_tokens": kwargs.get("audio_span_tokens"),
                "audio_bounds": kwargs.get("audio_bounds"),
            }
        else:
            audio_encoding = None

        return {
            "input_ids": input_ids,
            "image_encoding": image_encoding,
            "audio_encoding": audio_encoding,
        }

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        kv_caches: List[torch.Tensor],
        attn_metadata: AttentionMetadata,
        intermediate_tensors: Optional[IntermediateTensors] = None,
        **kwargs: Any,
    ) -> torch.Tensor:
        if intermediate_tensors is not None:
            embeddings = None
        else:
            mini_batch = self._parse_inputs(input_ids, **kwargs)
            embeddings = self.compose_embeddings(mini_batch)

        # always pass the input via `inputs_embeds`
        # to make sure the computation graph is consistent
        # for `torch.compile` integration
        input_ids = None

        output = self.llm(
            input_ids=input_ids,
            positions=positions,
            kv_caches=kv_caches,
            attn_metadata=attn_metadata,
            intermediate_tensors=intermediate_tensors,
            inputs_embeds=embeddings,
        )

        self._called_cnt += 1
        return output

    def compute_logits(
        self,
        hidden_states: torch.Tensor,
        sampling_metadata: SamplingMetadata,
    ) -> Optional[torch.Tensor]:
        logits = self.logits_processor(self.lm_head, hidden_states, sampling_metadata)
        return logits

    def sample(
        self,
        logits: torch.Tensor,
        sampling_metadata: SamplingMetadata,
    ) -> Optional[SamplerOutput]:
        next_tokens = self.sampler(logits, sampling_metadata)
        return next_tokens

    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
        stacked_params_mapping = [
            # (param_name, shard_name, shard_id)
            (".qkv_proj", ".q_proj", "q"),
            (".qkv_proj", ".k_proj", "k"),
            (".qkv_proj", ".v_proj", "v"),
            (".gate_up_proj", ".gate_proj", 0),
            (".gate_up_proj", ".up_proj", 1),
        ]

        keys_to_modify_mapping = {
            "llm.lm_head": "lm_head",
            "vision.resampler": "resampler",
        }

        params_dict = dict(self.named_parameters())
        for name, loaded_weight in weights:
            for key_to_modify, new_key in keys_to_modify_mapping.items():
                if key_to_modify in name:
                    name = name.replace(key_to_modify, new_key)
            if "rotary_emb.inv_freq" in name:
                continue
            if "rotary_emb.cos_cached" in name or "rotary_emb.sin_cached" in name:
                # Models trained using ColossalAI may include these tensors in
                # the checkpoint. Skip them.
                continue
            # if "audio.positional_embedding" in name:
            #     continue

            for param_name, weight_name, shard_id in stacked_params_mapping:
                if weight_name not in name:
                    continue
                name = name.replace(weight_name, param_name)
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue

                if is_pp_missing_parameter(name, self):
                    continue

                if name in params_dict:
                    param = params_dict[name]
                    weight_loader = param.weight_loader
                    weight_loader(param, loaded_weight, shard_id)
                else:
                    print(f"Skipping loading of {name}")

                break
            else:
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue

                if name is None:
                    continue

                if is_pp_missing_parameter(name, self):
                    continue

                if name in params_dict:
                    param = params_dict[name]
                    weight_loader = getattr(param, "weight_loader", default_weight_loader)
                    weight_loader(param, loaded_weight)
                else:
                    print(f"Skipping loading of {name}")

    def get_mm_mapping(self) -> MultiModelKeys:
        """
        Get the module prefix in multimodal models
        """
        return MultiModelKeys.from_string_field(language_model="llm", connector="resampler", tower_model="vpm")

    def init_llm(
        self,
        config: PretrainedConfig,
        cache_config: Optional[CacheConfig] = None,
        quant_config: Optional[QuantizationConfig] = None,
        prefix: str = "",
    ) -> nn.Module:

        return LLMWrapper(
            LlamaModel(
                config,
                cache_config=cache_config,
                quant_config=quant_config,
                prefix=prefix,
            ),
            name=prefix,
        )

    def init_audio_module(
        self,
        config: PretrainedConfig,
        quant_config: Optional[QuantizationConfig],
        prefix: str = "",
    ) -> nn.Module:
        return AudioModel(config)

    def init_vision_module(
        self,
        config: PretrainedConfig,
        quant_config: Optional[QuantizationConfig],
        prefix: str = "",
    ) -> nn.Module:
        model = LLMWrapper(
            Idefics2VisionTransformer(config.vision_config),
            name=prefix,
        )
        if self.config.drop_vision_last_layer:
            model.encoder.layers = model.encoder.layers[:-1]
        return model

    def init_resampler(
        self,
        embed_dim: int,
        vision_dim: int,
        quant_config: Optional[QuantizationConfig] = None,
        prefix: str = "",
    ) -> nn.Module:
        resampler = Resampler(
            num_queries=self.config.query_num,
            embed_dim=embed_dim,
            num_heads=embed_dim // 128,
            kv_dim=vision_dim,
            quant_config=quant_config,
            prefix=prefix,
        )
        return resampler


================================================
FILE: vllm_demo/requirements.txt
================================================
vllm==0.6.3.post1
flash_attn==2.5.8
xformers==0.0.27.post2

================================================
FILE: vllm_demo/try_minicpm_v.py
================================================
from transformers import AutoTokenizer
from PIL import Image
from vllm import LLM, SamplingParams

MODEL_NAME = "/mnt/public/algm/models/MiniCPM-V-2_6/"


image = Image.open("../data/sample_image.jpg").convert("RGB")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
    gpu_memory_utilization=1,
    max_model_len=2048
)

messages = [{
    "role":
    "user",
    "content":
    # Number of images
    "(<image>./</image>)" + \
    "\nWhat is the content of this image?" 
}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Single Inference
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        # Multi images, the number of images should be equal to that of `(<image>./</image>)`
        # "image": [image, image] 
    },
}
# Batch Inference
# inputs = [{
#     "prompt": prompt,
#     "multi_modal_data": {
#         "image": image
#     },
# } for _ in 2]


# 2.6
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]

sampling_params = SamplingParams(
    stop_token_ids=stop_token_ids, 
    use_beam_search=True,
    temperature=0, 
    best_of=3,
    max_tokens=1024
)

outputs = llm.generate(inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

================================================
FILE: vllm_demo/try_qwen_vl.py
================================================
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "/mnt/public/algm/models/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)


# default processer
processor = AutoProcessor.from_pretrained("/mnt/public/algm/models/Qwen2-VL-2B-Instruct")


prompt = "hi" * (128 - 1) 
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "../data/sample_image.jpg",
            },
            {"type": "text", "text": prompt},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
import pdb;pdb.set_trace()
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
import pdb;pdb.set_trace()
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

================================================
FILE: vllm_demo/vllm_profling.py
================================================
# -*- encoding: utf-8 -*-
# File: example_infer_vllm.py
# Description: None

from PIL import Image
from vllm import LLM
from vllm import ModelRegistry
from vllm import SamplingParams

from megrezo import MegrezOModel

ModelRegistry.register_model("MegrezO", MegrezOModel)

# Load the model.
model_path = "/mnt/algorithm/user_dir/zhoudong/workspace/models/megrez-o"  # Change this to the path of the model.
llm = LLM(
    model_path,
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    max_num_seqs=8,
)

num_requests = 100
input_len = 128
output_length = 128
# prepare data 
prompt = "hi" * (input_len - 1) 
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=output_length,
    repetition_penalty=1.2,
    stop=["<|turn_end|>", "<|eos|>"],
    ignore_eos=True,
)

img = Image.open("../data/sample_image.jpg")

conversation = [
    {
        "role": "user",
        "content": {
            "text": prompt,
            "image": img,
        },
    },
]

# Convert the conversation to vLLM acceptable format.
prompt = llm.get_tokenizer().apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True,
)
vllm_inputs = [
    {
        "prompt": prompt,
        "multi_modal_data": {
            "image": img,
        },
    }
    for _ in range(num_requests)
]

# Generate the outputs.
outputs = llm.generate(
    vllm_inputs,
    sampling_params,
)

# Print the outputs.
# for output in outputs:
#     print(output.outputs[0].text)


================================================
FILE: vllm_demo/vllm_profling_minicpm.py
================================================
from transformers import AutoTokenizer
from PIL import Image
from vllm import LLM, SamplingParams


model_path = "/mnt/public/algm/models/MiniCPM-V-2_6/"
image = Image.open("../data/sample_image.jpg").convert("RGB")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
llm = LLM(
    model=model_path,
    gpu_memory_utilization=0.9,
    max_num_seqs=8,
    trust_remote_code=True,
    max_model_len=4096
)

num_requests = 100
input_len = 128
output_length = 128
# prepare data 
prompt = "hi" * (input_len - 1) 
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=output_length,
    repetition_penalty=1.2,
    ignore_eos=True,
)


messages = [{
    "role":
    "user",
    "content":
    # Number of images
    "(<image>./</image>)" + \
    prompt
}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Single Inference
llm_inputs = [{
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
} for _ in range(num_requests)]


outputs = llm.generate(llm_inputs, sampling_params=sampling_params)


================================================
FILE: vllm_demo/vllm_profling_qwen.py
================================================
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info


# Load the model.
model_path = "/mnt/public/algm/models/Qwen2-VL-2B-Instruct"  # Change this to the path of the model.

llm = LLM(
    model=model_path,
    limit_mm_per_prompt={"image": 10, "video": 10},
    gpu_memory_utilization=0.9,
    max_num_seqs=8,
)

num_requests = 100
input_len = 128
output_length = 128
# prepare data 
prompt = "hi" * (input_len - 1) 
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=output_length,
    repetition_penalty=1.2,
    ignore_eos=True,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "../data/sample_image.jpg",
                "min_pixels": 224 * 224,
                "max_pixels": 1024 * 1024,
            },
            {"type": "text", "text": prompt},
        ],
    },
]
processor = AutoProcessor.from_pretrained(model_path)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = [
        {
        "prompt": prompt,
        "multi_modal_data": mm_data,
    }
    for _ in range(num_requests)
]


outputs = llm.generate(llm_inputs, sampling_params=sampling_params)