Repository: QwenLM/Qwen-VL Branch: master Commit: aa00ed04091e Files: 61 Total size: 375.0 KB Directory structure: gitextract_e0pgouhu/ ├── .github/ │ └── ISSUE_TEMPLATE/ │ ├── bug_report.yaml │ ├── config.yaml │ └── feature_request.yaml ├── .gitignore ├── BUILD.md ├── Dockerfile.qwendemo ├── Dockerfile.qwenint4openai ├── Dockerfile.qwenopenai ├── FAQ.md ├── FAQ_ja.md ├── FAQ_ko.md ├── FAQ_zh.md ├── LICENSE ├── NOTICE ├── README.md ├── README_CN.md ├── README_JA.md ├── README_KO.md ├── TUTORIAL.md ├── TUTORIAL_ja.md ├── TUTORIAL_ko.md ├── TUTORIAL_zh.md ├── assets/ │ └── mm_tutorial/ │ └── TUTORIAL.ipynb ├── eval_mm/ │ ├── EVALUATION.md │ ├── evaluate_caption.py │ ├── evaluate_grounding.py │ ├── evaluate_multiple_choice.py │ ├── evaluate_vqa.py │ ├── infographicsvqa_eval.py │ ├── mmbench/ │ │ ├── MMBENCH.md │ │ ├── evaluate_multiple_choice_mmbench.py │ │ ├── mmbench_converter_dev.py │ │ ├── mmbench_converter_test.py │ │ ├── mmbench_evaluation.py │ │ ├── mmbench_evaluation_tricky.py │ │ └── mmbench_predict_to_submission.py │ ├── mme/ │ │ ├── EVAL_MME.md │ │ ├── eval.py │ │ └── get_images.py │ ├── seed_bench/ │ │ ├── EVAL_SEED.md │ │ ├── eval.py │ │ └── trans.py │ ├── vqa.py │ └── vqa_eval.py ├── finetune/ │ ├── ds_config_zero2.json │ ├── ds_config_zero3.json │ ├── finetune_ds.sh │ ├── finetune_lora_ds.sh │ ├── finetune_lora_single_gpu.sh │ ├── finetune_qlora_ds.sh │ └── finetune_qlora_single_gpu.sh ├── finetune.py ├── openai_api.py ├── requirements.txt ├── requirements_openai_api.txt ├── requirements_web_demo.txt ├── touchstone/ │ ├── README.md │ ├── README_CN.md │ ├── README_JA.md │ └── README_KO.md └── web_demo_mm.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/ISSUE_TEMPLATE/bug_report.yaml ================================================ name: 🐞 Bug description: 提交错误报告 | File a bug/issue title: "[BUG]
中文  |  English   |  日本語 |  한국어 
Qwen-VL
🤗
🤖  |
Qwen-VL-Chat
🤗
🤖 
(Int4:
🤗
🤖 ) |
Qwen-VL-Plus
🤗
🤖  |
Qwen-VL-Max
🤗
🤖 
Web   |   
APP   |   
API   |   
WeChat   |   
Discord   |   
Paper   |   
Tutorial
| Model | DocVQA | ChartQA | AI2D | TextVQA | MMMU | MathVista | MM-Bench-CN |
|---|---|---|---|---|---|---|---|
| Other Best Open-source LVLM |
81.6% (CogAgent) |
68.4% (CogAgent) |
73.7% (Fuyu-Medium) |
76.1% (CogAgent) |
45.9% (Yi-VL-34B) |
36.7% (SPHINX-V2) |
72.4% (InternLM-XComposer-VL) |
| Gemini Pro | 88.1% | 74.1% | 73.9% | 74.6% | 47.9% | 45.2% | 74.3% |
| Gemini Ultra | 90.9% | 80.8% 1 | 79.5% 1 | 82.3% 1 | 59.4% 1 | 53.0% 1 | - |
| GPT-4V | 88.4% | 78.5% | 78.2% | 78.0% | 56.8% | 49.9% | 73.9% |
| Qwen-VL-Plus | 91.4% | 78.1% | 75.9% | 78.9% | 45.2% | 43.3% | 68.0% |
| Qwen-VL-Max | 93.1% 1 | 79.8% 2 | 79.3% 2 | 79.5% 2 | 51.4% 3 | 51.0% 2 | 75.1% 1 |
We release two models of the Qwen-VL series:
- Qwen-VL: The pre-trained LVLM model uses Qwen-7B as the initialization of the LLM, and [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) as the initialization of the visual encoder. And connects them with a randomly initialized cross-attention layer.
- Qwen-VL-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities.
## Evaluation
We evaluated the model's abilities from three perspectives:
1. **Standard Benchmarks**: We evaluate the model's basic task capabilities on four major categories of multimodal tasks:
- Zero-shot Captioning: Evaluate model's zero-shot image captioning ability on unseen datasets;
- General VQA: Evaluate the general question-answering ability of pictures, such as the judgment, color, number, category, etc;
- Text-based VQA: Evaluate the model's ability to recognize text in pictures, such as document QA, chart QA, etc;
- Referring Expression Comprehension: Evaluate the ability to localize a target object in an image described by a referring expression.
2. **TouchStone**: To evaluate the overall text-image dialogue capability and alignment level with humans, we have constructed a benchmark called [TouchStone](https://github.com/OFA-Sys/TouchStone), which is based on scoring with GPT4 to evaluate the LVLM model.
- The TouchStone benchmark covers a total of 300+ images, 800+ questions, and 27 categories. Such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc;
- In order to break the current limitation of GPT4 in terms of direct image input, TouchStone provides fine-grained image annotations by human labeling. These detailed annotations, along with the questions and the model's output, are then presented to GPT4 for scoring.
- The benchmark includes both English and Chinese versions.
3. **Other Multimodal Benchmarks**: We also evaluated our model's capabilities in other multimodal benchmarks:
- [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), a comprehensive evaluation benchmark for multimodal large language models. Qwen-VL-Chat achieves SOTAs on both perception and cognition tracks.
- [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard), a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs. Qwen series achieves SOTAs on this benchmark.
The results of the evaluation are as follows:
Qwen-VL outperforms current SOTA generalist models on multiple VL tasks and has a more comprehensive coverage in terms of capability range.
### Zero-shot Captioning & General VQA
| Model type | Model | Zero-shot Captioning | General VQA | |||||
|---|---|---|---|---|---|---|---|---|
| NoCaps | Flickr30K | VQAv2dev | OK-VQA | GQA | SciQA-Img (0-shot) |
VizWiz (0-shot) |
||
| Generalist Models |
Flamingo-9B | - | 61.5 | 51.8 | 44.7 | - | - | 28.8 |
| Flamingo-80B | - | 67.2 | 56.3 | 50.6 | - | - | 31.6 | |
| Unified-IO-XL | 100.0 | - | 77.9 | 54.0 | - | - | - | |
| Kosmos-1 | - | 67.1 | 51.0 | - | - | - | 29.2 | |
| Kosmos-2 | - | 80.5 | 51.1 | - | - | - | - | |
| BLIP-2 (Vicuna-13B) | 103.9 | 71.6 | 65.0 | 45.9 | 32.3 | 61.0 | 19.6 | |
| InstructBLIP (Vicuna-13B) | 121.9 | 82.8 | - | - | 49.5 | 63.1 | 33.4 | |
| Shikra (Vicuna-13B) | - | 73.9 | 77.36 | 47.16 | - | - | - | |
| Qwen-VL (Qwen-7B) | 121.4 | 85.8 | 78.8 | 58.6 | 59.3 | 67.1 | 35.2 | |
| Qwen-VL-Chat | 120.2 | 81.0 | 78.2 | 56.6 | 57.5 | 68.2 | 38.9 | |
| Previous SOTA (Per Task Fine-tuning) |
- | 127.0 (PALI-17B) |
84.5 (InstructBLIP -FlanT5-XL) |
86.1 (PALI-X -55B) |
66.1 (PALI-X -55B) |
72.1 (CFR) |
92.53 (LLaVa+ GPT-4) |
70.9 (PALI-X -55B) |
| Model type | Model | TextVQA | DocVQA | ChartQA | AI2D | OCR-VQA |
|---|---|---|---|---|---|---|
| Generalist Models | BLIP-2 (Vicuna-13B) | 42.4 | - | - | - | - |
| InstructBLIP (Vicuna-13B) | 50.7 | - | - | - | - | |
| mPLUG-DocOwl (LLaMA-7B) | 52.6 | 62.2 | 57.4 | - | - | |
| Pix2Struct-Large (1.3B) | - | 76.6 | 58.6 | 42.1 | 71.3 | |
| Qwen-VL (Qwen-7B) | 63.8 | 65.1 | 65.7 | 62.3 | 75.7 | |
| Specialist SOTAs (Specialist/Finetuned) |
PALI-X-55B (Single-task FT) (Without OCR Pipeline) |
71.44 | 80.0 | 70.0 | 81.2 | 75.0 |
| Model type | Model | RefCOCO | RefCOCO+ | RefCOCOg | GRIT | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| val | test-A | test-B | val | test-A | test-B | val-u | test-u | refexp | ||
| Generalist Models | GPV-2 | - | - | - | - | - | - | - | - | 51.50 |
| OFA-L* | 79.96 | 83.67 | 76.39 | 68.29 | 76.00 | 61.75 | 67.57 | 67.58 | 61.70 | |
| Unified-IO | - | - | - | - | - | - | - | - | 78.61 | |
| VisionLLM-H | 86.70 | - | - | - | - | - | - | - | ||
| Shikra-7B | 87.01 | 90.61 | 80.24 | 81.60 | 87.36 | 72.12 | 82.27 | 82.19 | 69.34 | |
| Shikra-13B | 87.83 | 91.11 | 81.81 | 82.89 | 87.79 | 74.41 | 82.64 | 83.16 | 69.03 | |
| Qwen-VL-7B | 89.36 | 92.26 | 85.34 | 83.12 | 88.25 | 77.21 | 85.58 | 85.48 | 78.22 | |
| Qwen-VL-7B-Chat | 88.55 | 92.27 | 84.51 | 82.82 | 88.59 | 76.79 | 85.96 | 86.32 | - | |
| Specialist SOTAs (Specialist/Finetuned) |
G-DINO-L | 90.56 | 93.19 | 88.24 | 82.75 | 88.95 | 75.92 | 86.13 | 87.02 | - |
| UNINEXT-H | 92.64 | 94.33 | 91.46 | 85.24 | 89.63 | 79.79 | 88.73 | 89.37 | - | |
| ONE-PEACE | 92.58 | 94.18 | 89.26 | 88.77 | 92.21 | 83.23 | 89.22 | 89.27 | - | |
#### SEED-Bench [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both **image** and **video** understanding. See more details on [HERE](eval_mm/seed_bench/EVAL_SEED.md). Qwen-VL and Qwen-VL-Chat achieve SOTAs on this benchmark.
## Requirements
* python 3.8 and above
* pytorch 1.12 and above, 2.0 and above are recommended
* CUDA 11.4 and above are recommended (this is for GPU users)
## Quickstart
Below, we provide simple examples to show how to use Qwen-VL and Qwen-VL-Chat with 🤖 ModelScope and 🤗 Transformers.
Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
```bash
pip install -r requirements.txt
```
Now you can start with ModelScope or Transformers. More usage aboue vision encoder, please refer to the [tutorial](TUTORIAL.md).
#### 🤗 Transformers
To use Qwen-VL-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 1st dialogue turn
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': '这是什么?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名女子在沙滩上和狗玩耍,旁边是一只拉布拉多犬,它们处于沙滩上。
# 2nd dialogue turn
response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history)
print(response)
# 击掌
Running Qwen-VL
Running Qwen-VL pretrained base model is also simple.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()
# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': 'Generate the caption in English with grounding:'},
])
inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(response)
# https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpegGenerate the caption in English with grounding: Woman
tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
# 图中是一名年轻女子在沙滩上和她的狗玩耍,狗的品种是拉布拉多。她们坐在沙滩上,狗的前腿抬起来,与人互动。
# 2nd dialogue turn
response, history = model.chat(tokenizer, '输出击掌的检测框', history=history)
print(response)
# "击掌"
## Quantization
### Usage
We provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-VL-Chat, Qwen-VL-Chat-Int4 [Click here](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed.
Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
```bash
pip install optimum
git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ
pip install -v .
```
If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel.
Then you can load the quantized model easily and run inference as same as usual:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-VL-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
# Either a local path or an url between tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
```
### Performance
We illustrate the model performance of both BF16 and Int4 models on the benchmark **[TouchStone](https://github.com/OFA-Sys/TouchStone)**, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
| Quantization | ZH | EN |
| ------------ | :--------: | :-----------: |
| BF16 | 401.2 | 645.2 |
| Int4 | 386.6 | 651.4 |
### Inference Speed
We measured the average inference speed (tokens/s) of generating 1792 (2048-258) and 7934 (8192-258) tokens with the context of an image (which takes 258 tokens) under BF16 precision and Int4 quantization, respectively.
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------ | :-----------------: | :-----------------: |
| BF16 | 28.87 | 24.32 |
| Int4 | 37.79 | 34.34 |
The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4.
### GPU Memory Usage
We also profile the peak GPU memory usage for encoding 1792 (2048-258) tokens (including an image) as context (and generating single token) and generating 7934 (8192-258) tokens (with an image as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 22.60GB | 28.01GB |
| Int4 | 11.82GB | 17.23GB |
The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py).
## Finetuning
Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with DeepSpeed and FSDP. The shell scripts that we provide use DeepSpeed, and thus we advise you to install DeepSpeed before you start:
```bash
pip install deepspeed
```
### Data preparation
To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample:
```json
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是Qwen-VL,一个支持视觉输入的大模型。"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "user",
"value": "Picture 1: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg\n图中的狗是什么品种?"
},
{
"from": "assistant",
"value": "图中是一只拉布拉多犬。"
},
{
"from": "user",
"value": "框出图中的格子衬衫"
},
{
"from": "assistant",
"value": "格子衬衫
assets/mm_tutorial/Chongqing.jpeg\nPicture 2:
assets/mm_tutorial/Beijing.jpeg\n图中都是哪"
},
{
"from": "assistant",
"value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
}
]
}
]
```
For the VL tasks, there are special tokens that are used, including `
img_path\n{your prompt}`, where `id` indicates the position of the image in the conversation, starting from 1. The "img_path" can be a local file path or a web link.
The coordinate box is expressed as `
| Method | Sequence Length | |||
|---|---|---|---|---|
| 384 | 512 | 1024 | 2048 | |
| LoRA (Base) | 37.1G / 2.3s/it | 37.3G / 2.4s/it | 38.7G / 3.6s/it | 38.7G / 6.1s/it |
| LoRA (Chat) | 23.3G / 2.2s/it | 23.6G / 2.3s/it | 25.1G / 3.5s/it | 27.3G / 5.9s/it |
| Q-LoRA | 17.0G / 4.2s/it | 17.2G / 4.5s/it | 18.2G / 5.5s/it | 19.3G / 7.9s/it |
中文  |  English   |  日本語 |  한국어 
Qwen-VL
🤗
🤖  |
Qwen-VL-Chat
🤗
🤖 
(Int4:
🤗
🤖 ) |
Qwen-VL-Plus
🤗
🤖  |
Qwen-VL-Max
🤗
🤖 
Web   |   
APP   |   
API   |   
WeChat   |   
Discord   |   
Paper   |   
Tutorial
| Model | DocVQA (文档理解) |
ChartQA (图表理解) |
AI2D (科学图例) |
TextVQA (文字阅读) |
MMMU (多学科问题) |
MathVista (数学推理) |
MM-Bench-CN (中文问答) |
|---|---|---|---|---|---|---|---|
| Other Best Open-source LVLM |
81.6% (CogAgent) |
68.4% (CogAgent) |
73.7% (Fuyu-Medium) |
76.1% (CogAgent) |
45.9% (Yi-VL-34B) |
36.7% (SPHINX-V2) |
72.4% (InternLM-XComposer-VL) |
| Gemini Pro | 88.1% | 74.1% | 73.9% | 74.6% | 47.9% | 45.2% | 74.3% |
| Gemini Ultra | 90.9% | 80.8% 1 | 79.5% 1 | 82.3% 1 | 59.4% 1 | 53.0% 1 | - |
| GPT-4V | 88.4% | 78.5% | 78.2% | 78.0% | 56.8% | 49.9% | 73.9% |
| Qwen-VL-Plus | 91.4% | 78.1% | 75.9% | 78.9% | 44.0% | 43.3% | 68.0% |
| Qwen-VL-Max | 92.5% 1 | 79.8% 2 | 79.3% 2 | 79.5% 2 | 51.4% 3 | 51.0% 2 | 75.1% 1 |
目前,我们提供了 Qwen-VL 系列的两个模型:
- Qwen-VL: Qwen-VL 以 Qwen-7B 的预训练模型作为语言模型的初始化,并以 [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) 作为视觉编码器的初始化,中间加入单层随机初始化的 cross-attention,经过约1.5B的图文数据训练得到。最终图像输入分辨率为448。
- Qwen-VL-Chat: 在 Qwen-VL 的基础上,我们使用对齐机制打造了基于大语言模型的视觉AI助手Qwen-VL-Chat,它支持更灵活的交互方式,包括多图、多轮问答、创作等能力。
## 评测
我们从三个角度评测了模型的能力:
1. 在**英文标准 Benchmark** 上评测模型的基础任务能力。目前评测了四大类多模态任务:
- Zero-shot Captioning: 评测模型在未见过数据集上的零样本图片描述能力;
- General VQA: 评测模型的通用问答能力,例如判断题、颜色、个数、类目等问答能力;
- Text-based VQA:评测模型对于图片中文字相关的识别/问答能力,例如文档问答、图表问答、文字问答等;
- Referring Expression Compression:评测模型给定物体描述画检测框的能力;
2. **试金石 (TouchStone)**:为了评测模型整体的图文对话能力和人类对齐水平。我们为此构建了一个基于 GPT4 打分来评测 LVLM 模型的 Benchmark:TouchStone。在 TouchStone-v0.1 中:
- 评测基准总计涵盖 300+张图片、800+道题目、27个类别。包括基础属性问答、人物地标问答、影视作品问答、视觉推理、反事实推理、诗歌创作、故事写作,商品比较、图片解题等**尽可能广泛的类别**。
- 为了弥补目前 GPT4 无法直接读取图片的缺陷,我们给所有的带评测图片提供了**人工标注的充分详细描述**,并且将图片的详细描述、问题和模型的输出结果一起交给 GPT4 打分。
- 评测同时包含英文版本和中文版本。
3. **其它多模态通用模型榜单**:我们也在其它多模态通用模型榜单中评测了模型的能力:
- MME Benchmark: 是一个多模态大型语言模型的综合评价基准。它在总共14个子任务上评测**感知和认知**能力,Qwen-VL-Chat在这两个总维度上都实现了当前最好结果。
- SEED-Bench: 是一个包含1.9万选择题的多模态基准测评,通过人工注释的结果评估多模态大模型,涵盖12个评估维度,包括**图像和视频理解**,Qwen-VL和Qwen-VL-chat在这个基准上实现了当前最好结果。
评测结果如下:
Qwen-VL在多个VL任务上相比目前SOTA的Generalist Models都有明显优势,并且在能力范围也覆盖更加全面。
### 零样本图像描述生成(Zero-shot Image Caption) 及 通用视觉问答(General VQA)
| Model type | Model | Zero-shot Captioning | General VQA | |||||
|---|---|---|---|---|---|---|---|---|
| NoCaps | Flickr30K | VQAv2dev | OK-VQA | GQA | SciQA-Img (0-shot) |
VizWiz (0-shot) |
||
| Generalist Models |
Flamingo-9B | - | 61.5 | 51.8 | 44.7 | - | - | 28.8 |
| Flamingo-80B | - | 67.2 | 56.3 | 50.6 | - | - | 31.6 | |
| Unified-IO-XL | 100.0 | - | 77.9 | 54.0 | - | - | - | |
| Kosmos-1 | - | 67.1 | 51.0 | - | - | - | 29.2 | |
| Kosmos-2 | - | 66.7 | 45.6 | - | - | - | - | |
| BLIP-2 (Vicuna-13B) | 103.9 | 71.6 | 65.0 | 45.9 | 32.3 | 61.0 | 19.6 | |
| InstructBLIP (Vicuna-13B) | 121.9 | 82.8 | - | - | 49.5 | 63.1 | 33.4 | |
| Shikra (Vicuna-13B) | - | 73.9 | 77.36 | 47.16 | - | - | - | |
| Qwen-VL (Qwen-7B) | 121.4 | 85.8 | 78.8 | 58.6 | 59.3 | 67.1 | 35.2 | |
| Qwen-VL-Chat | 120.2 | 81.0 | 78.2 | 56.6 | 57.5 | 68.2 | 38.9 | |
| Previous SOTA (Per Task Fine-tuning) |
- | 127.0 (PALI-17B) |
84.5 (InstructBLIP -FlanT5-XL) |
86.1 (PALI-X -55B) |
66.1 (PALI-X -55B) |
72.1 (CFR) |
92.53 (LLaVa+ GPT-4) |
70.9 (PALI-X -55B) |
| Model type | Model | TextVQA | DocVQA | ChartQA | AI2D | OCR-VQA |
|---|---|---|---|---|---|---|
| Generalist Models | BLIP-2 (Vicuna-13B) | 42.4 | - | - | - | - |
| InstructBLIP (Vicuna-13B) | 50.7 | - | - | - | - | |
| mPLUG-DocOwl (LLaMA-7B) | 52.6 | 62.2 | 57.4 | - | - | |
| Pix2Struct-Large (1.3B) | - | 76.6 | 58.6 | 42.1 | 71.3 | |
| Qwen-VL (Qwen-7B) | 63.8 | 65.1 | 65.7 | 62.3 | 75.7 | |
| Specialist SOTAs (Specialist/Finetuned) |
PALI-X-55B (Single-task FT) (Without OCR Pipeline) |
71.44 | 80.0 | 70.0 | 81.2 | 75.0 |
| Model type | Model | RefCOCO | RefCOCO+ | RefCOCOg | GRIT | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| val | test-A | test-B | val | test-A | test-B | val-u | test-u | refexp | ||
| Generalist Models | GPV-2 | - | - | - | - | - | - | - | - | 51.50 |
| OFA-L* | 79.96 | 83.67 | 76.39 | 68.29 | 76.00 | 61.75 | 67.57 | 67.58 | 61.70 | |
| Unified-IO | - | - | - | - | - | - | - | - | 78.61 | |
| VisionLLM-H | 86.70 | - | - | - | - | - | - | - | ||
| Shikra-7B | 87.01 | 90.61 | 80.24 | 81.60 | 87.36 | 72.12 | 82.27 | 82.19 | 69.34 | |
| Shikra-13B | 87.83 | 91.11 | 81.81 | 82.89 | 87.79 | 74.41 | 82.64 | 83.16 | 69.03 | |
| Qwen-VL-7B | 89.36 | 92.26 | 85.34 | 83.12 | 88.25 | 77.21 | 85.58 | 85.48 | 78.22 | |
| Qwen-VL-7B-Chat | 88.55 | 92.27 | 84.51 | 82.82 | 88.59 | 76.79 | 85.96 | 86.32 | - | |
| Specialist SOTAs (Specialist/Finetuned) |
G-DINO-L | 90.56 | 93.19 | 88.24 | 82.75 | 88.95 | 75.92 | 86.13 | 87.02 | - |
| UNINEXT-H | 92.64 | 94.33 | 91.46 | 85.24 | 89.63 | 79.79 | 88.73 | 89.37 | - | |
| ONE-PEACE | 92.58 | 94.18 | 89.26 | 88.77 | 92.21 | 83.23 | 89.22 | 89.27 | - | |
#### SEED-Bench SEED-Bench是一个包含1.9万选择题的多模态基准测评,通过人工注释的结果评估多模态大模型,涵盖12个评估维度,包括**图像和视频理解**。Qwen-VL和Qwen-VL-chat在这个基准上实现了SOTAs。完整复现[见此](eval_mm/seed_bench/EVAL_SEED.md)。
## 部署要求
* python 3.8及以上版本
* pytorch 1.12及以上版本,推荐2.0及以上版本
* 建议使用CUDA 11.4及以上(GPU用户需考虑此选项)
## 快速使用
我们提供简单的示例来说明如何利用 🤖 ModelScope 和 🤗 Transformers 快速使用 Qwen-VL 和 Qwen-VL-Chat。
在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。
```bash
pip install -r requirements.txt
```
接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。关于视觉模块的更多用法,请参考[教程](TUTORIAL_zh.md)。
#### 🤗 Transformers
如希望使用 Qwen-VL-chat 进行推理,所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码。**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
# 请注意:分词器默认行为已更改为默认关闭特殊token攻击防护。
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理,需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# 默认gpu进行推理,需要约24GB显存
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
# 可指定不同的生成长度、top_p等相关超参(transformers 4.32.0及以上无需执行此操作)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 第一轮对话
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': '这是什么?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名女子在沙滩上和狗玩耍,旁边是一只拉布拉多犬,它们处于沙滩上。
# 第二轮对话
response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history)
print(response)
# 击掌
运行Qwen-VL同样非常简单。
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpegGenerate the caption in English with grounding: Woman
若在使用上述代码时由于各种原因无法从 HuggingFace 拉取模型和代码,可以先从 ModelScope 下载模型及代码至本地,再从本地加载模型:
```python
from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
# Downloading model checkpoint to a local dir model_dir
# model_dir = snapshot_download('qwen/Qwen-VL')
model_dir = snapshot_download('qwen/Qwen-VL-Chat')
# Loading local checkpoints
# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map="cuda",
trust_remote_code=True
).eval()
```
#### 🤖 ModelScope
魔搭(ModelScope)是开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单,代码如下所示:
```python
from modelscope import (
snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch
model_id = 'qwen/Qwen-VL-Chat'
revision = 'v1.0.0'
model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
tokenizer.model_dir = model_dir
# 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理,需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# 默认gpu进行推理,需要约24GB显存
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()
# 指定生成超参数(transformers 4.32.0及以上无需执行此操作)
# model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True)
# 第一轮对话
# Either a local path or an url between tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
# 图中是一名年轻女子在沙滩上和她的狗玩耍,狗的品种是拉布拉多。她们坐在沙滩上,狗的前腿抬起来,与人互动。
# 第二轮对话
response, history = model.chat(tokenizer, '输出击掌的检测框', history=history)
print(response)
# "击掌"
## 量化
### 用法
当前我们提供了基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化方案,并提供了Qwen-VL-Chat的Int4量化版本Qwen-VL-Chat-Int4 [点击此处](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)。该模型在效果评测上几乎无损,并在显存占用和推理速度上具有明显优势。
下文说明如何使用该量化模型。开始之前,请确保你满足要求(如torch2.0及以上、transformers 4.32.0及以上,等)并安装所需的代码库:
```bash
pip install optimum
git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ
pip install -v .
```
如遇到安装 `auto-gptq` 的问题,建议您前往官方[repo](https://github.com/PanQiWei/AutoGPTQ) 寻找合适的wheel。
随后你便可以按照上述用法****,轻松调用量化模型:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-VL-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
# Either a local path or an url between tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
```
### 效果评测
我们列出不同精度下模型在评测基准 **[TouchStone](https://github.com/OFA-Sys/TouchStone)** 上的表现,并发现量化模型并没有显著性能损失。结果如下所示:
| Quantization | ZH | EN |
| ------------ | :--------: | :-----------: |
| BF16 | 401.2 | 645.2 |
| Int4 | 386.6 | 651.4 |
### 推理速度
我们测算了在输入一张图片(即258个token)的条件下BF16和Int4的模型生成1792 (2048-258) 和 7934 (8192-258) 个token的平均速度。
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------ | :-----------------: | :-----------------: |
| BF16 | 28.87 | 24.32 |
| Int4 | 37.79 | 34.34 |
推理速度测算是在单卡 A100-SXM4-80G GPU上运行,使用PyTorch 2.0.1及CUDA 11.4。
### GPU显存占用
我们还测算了在一张图片输入的条件下BF16和Int4模型生成1792 (2048-258) 和 7934 (8192-258) 个token所需显存。结果如下所示:
| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 22.60GB | 28.01GB |
| Int4 | 11.82GB | 17.23GB |
上述速度和显存测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)完成。
## 微调
我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能,以接入下游任务。此外,我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed,因此建议您确保已经安装DeepSpeed。
首先,你需要准备你的训练数据。你需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含id和conversation,其中后者为一个列表。示例如下所示:
```json
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是Qwen-VL,一个支持视觉输入的大模型。"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "user",
"value": "Picture 1: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg\n图中的狗是什么品种?"
},
{
"from": "assistant",
"value": "图中是一只拉布拉多犬。"
},
{
"from": "user",
"value": "框出图中的格子衬衫"
},
{
"from": "assistant",
"value": "格子衬衫
assets/mm_tutorial/Chongqing.jpeg\nPicture 2:
assets/mm_tutorial/Beijing.jpeg\n图中都是哪"
},
{
"from": "assistant",
"value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
}
]
}
]
```
为针对多样的VL任务,我们增加了一下的特殊tokens: `
img_path\n{your prompt}`,其中`id`表示对话中的第几张图片。"img_path"可以是本地的图片或网络地址。
对话中的检测框可以表示为`
| Method | Sequence Length | |||
|---|---|---|---|---|
| 384 | 512 | 1024 | 2048 | |
| LoRA (Base) | 37.1G / 2.3s/it | 37.3G / 2.4s/it | 38.7G / 3.6s/it | 38.7G / 6.1s/it |
| LoRA (Chat) | 23.3G / 2.2s/it | 23.6G / 2.3s/it | 25.1G / 3.5s/it | 27.3G / 5.9s/it |
| Q-LoRA | 17.0G / 4.2s/it | 17.2G / 4.5s/it | 18.2G / 5.5s/it | 19.3G / 7.9s/it |
中文  |   English  |  日本語 
Qwen-VL 🤖 | 🤗  | Qwen-VL-Chat 🤖 | 🤗  | Qwen-VL-Chat-Int4 🤗
WeChat   |   Discord   |   Demo  |  Paper   |   Colab   |   Tutorial
日本語ドキュメントメンテナー: Ikko Eltociear Ashimine
Qwen-VL シリーズの 2 つのモデルを公開します:
- Qwen-VL: LLM の初期化に Qwen-7B を、視覚エンコーダの初期化に [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) を用いた学習済み LVLM モデル。そして、それらをランダムに初期化されたクロスアテンションレイヤーで接続する。
- Qwen-VL-Chat: マルチモーダルな LLM ベースの AI アシスタント。Qwen-VL-Chat は、複数の画像入力、複数ラウンドの質問応答、クリエイティブな機能など、より柔軟なインタラクションをサポートします。
## ニュースとアップデート
* 2023.11.28 Qwen-VL は、GPT4V、PALI-X を凌駕する最高レベルの [DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) をシングルモデルで達成し、直接画像を入力するだけで様々なタスクを分析理解できる汎用モデルであり。 https://qianwen.aliyun.com のマルチモーダルタブで直接新しいモデルを体験できます。
* 2023.9.25 Qwen-VL-Chat モデルが更新され、中国語コマンドのフォローがより堅牢になり、Web ページと表の画像の理解と質問と回答の機能が向上し、対話のパフォーマンスが向上しました (タッチストーン: 中国語: 401.2->481.7、英語: 645.2->711.6)。
* 2023.9.12 フルパラメータ微調整、LoRA、Q-LoRA を含む、Qwen-VL モデルの微調整をサポートするようになりました。
* 2023.9.8 [Colab](https://github.com/camenduru/Qwen-VL-Chat-colab) のサンプルを提供してくれた [camenduru](https://github.com/camenduru) に感謝します。これをチュートリアルとして使用して、12G GPU でローカルまたはオンラインのデモを行うことができます。
* 2023.9.4 Qwen-VL シリーズは、画像とビデオの両方の理解を含むマルチモーダル LLM を評価するための、正確な人による注釈を備えた 19,000 個の多肢選択質問のマルチモーダル ベンチマークである [Seed-Bench](eval_mm/seed_bench/EVAL_SEED.md) で SOTA を達成します。
* 2023.9.1 基本的な認識と理解だけでなく、文学創作までを含むマルチモーダル言語モデルの包括的な評価である [TouchStone](https://github.com/OFA-Sys/TouchStone) 評価をリリースします 。 強力な LLM を判定者として使用し、マルチモーダルな情報をテキストに変換します。
* 2023.8.31 低メモリコストでありながら推論速度の向上を実現する Qwen-VL-Chat 用の Int4 量子化モデル **Qwen-VL-Chat-Int4** をリリースしました。 また、ベンチマーク評価においても大きなパフォーマンスの低下はありません。
* 2023.8.22 ModelScope と Hugging Face で **Qwen-VL** と **Qwen-VL-Chat** をリリースしました。 また、トレーニングの詳細やモデルのパフォーマンスなど、モデルの詳細については [論文](https://arxiv.org/abs/2308.12966) も提供しています。
## 評価
モデルの能力を2つの観点から評価しました:
1. **標準ベンチマーク**: マルチモーダルなタスクの 4 つの主要カテゴリーについて、モデルの基本的なタスク能力を評価する:
- ゼロショットキャプション: 未見のデータセットに対して、モデルのゼロショット画像キャプション能力を評価する;
- 一般的な VQA: 判定、色、数、カテゴリなど、画像の一般的な質問応答能力を評価する;
- テキストベース VQA: 文書 QA、図表 QAなど、写真内のテキストを認識するモデルの能力を評価する;
- 参照表現理解: 参照表現理解: 参照表現で記述された画像内の対象物を特定する能力を評価する。
2. **TouchStone**: 総合的なテキスト画像対話能力と人間とのアライメントレベルを評価するために、GPT4 によるスコアリングに基づく TouchStone と呼ばれるベンチマークを構築し、LVLM モデルを評価しました。
- TouchStone ベンチマークは、合計 300 以上の画像、800 以上の質問、27 のカテゴリをカバーしています。例えば、属性ベースの Q&A、有名人の認識、詩の作文、複数の画像の要約、商品比較、数学の問題解決などです;
- 画像の直接入力という GPT4 の現在の制限を打ち破るため、TouchStone は人間のラベル付けによるきめ細かい画像注釈を提供します。これらの詳細な注釈は、質問とモデルの出力と共に、採点のために GPT4 に提示されます。
- ベンチマークには英語版と中国語版があります。
評価結果は以下の通りです:
Qwen-VL は、複数の VL タスクにおいて、現行の SOTA ジェネラリストモデルを上回り、また、能力範囲の点でより包括的なカバレッジを持ちます。
### ゼロショットキャプションと一般的な VQA
| Model type | Model | Zero-shot Captioning | General VQA | |||||
|---|---|---|---|---|---|---|---|---|
| NoCaps | Flickr30K | VQAv2dev | OK-VQA | GQA | SciQA-Img (0-shot) |
VizWiz (0-shot) |
||
| Generalist Models |
Flamingo-9B | - | 61.5 | 51.8 | 44.7 | - | - | 28.8 |
| Flamingo-80B | - | 67.2 | 56.3 | 50.6 | - | - | 31.6 | |
| Unified-IO-XL | 100.0 | - | 77.9 | 54.0 | - | - | - | |
| Kosmos-1 | - | 67.1 | 51.0 | - | - | - | 29.2 | |
| Kosmos-2 | - | 80.5 | 51.1 | - | - | - | - | |
| BLIP-2 (Vicuna-13B) | 103.9 | 71.6 | 65.0 | 45.9 | 32.3 | 61.0 | 19.6 | |
| InstructBLIP (Vicuna-13B) | 121.9 | 82.8 | - | - | 49.5 | 63.1 | 33.4 | |
| Shikra (Vicuna-13B) | - | 73.9 | 77.36 | 47.16 | - | - | - | |
| Qwen-VL (Qwen-7B) | 121.4 | 85.8 | 78.8 | 58.6 | 59.3 | 67.1 | 35.2 | |
| Qwen-VL-Chat | 120.2 | 81.0 | 78.2 | 56.6 | 57.5 | 68.2 | 38.9 | |
| Previous SOTA (Per Task Fine-tuning) |
- | 127.0 (PALI-17B) |
84.5 (InstructBLIP -FlanT5-XL) |
86.1 (PALI-X -55B) |
66.1 (PALI-X -55B) |
72.1 (CFR) |
92.53 (LLaVa+ GPT-4) |
70.9 (PALI-X -55B) |
| Model type | Model | TextVQA | DocVQA | ChartQA | AI2D | OCR-VQA |
|---|---|---|---|---|---|---|
| Generalist Models | BLIP-2 (Vicuna-13B) | 42.4 | - | - | - | - |
| InstructBLIP (Vicuna-13B) | 50.7 | - | - | - | - | |
| mPLUG-DocOwl (LLaMA-7B) | 52.6 | 62.2 | 57.4 | - | - | |
| Pix2Struct-Large (1.3B) | - | 76.6 | 58.6 | 42.1 | 71.3 | |
| Qwen-VL (Qwen-7B) | 63.8 | 65.1 | 65.7 | 62.3 | 75.7 | |
| Specialist SOTAs (Specialist/Finetuned) |
PALI-X-55B (Single-task FT) (Without OCR Pipeline) |
71.44 | 80.0 | 70.0 | 81.2 | 75.0 |
| Model type | Model | RefCOCO | RefCOCO+ | RefCOCOg | GRIT | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| val | test-A | test-B | val | test-A | test-B | val-u | test-u | refexp | ||
| Generalist Models | GPV-2 | - | - | - | - | - | - | - | - | 51.50 |
| OFA-L* | 79.96 | 83.67 | 76.39 | 68.29 | 76.00 | 61.75 | 67.57 | 67.58 | 61.70 | |
| Unified-IO | - | - | - | - | - | - | - | - | 78.61 | |
| VisionLLM-H | 86.70 | - | - | - | - | - | - | - | ||
| Shikra-7B | 87.01 | 90.61 | 80.24 | 81.60 | 87.36 | 72.12 | 82.27 | 82.19 | 69.34 | |
| Shikra-13B | 87.83 | 91.11 | 81.81 | 82.89 | 87.79 | 74.41 | 82.64 | 83.16 | 69.03 | |
| Qwen-VL-7B | 89.36 | 92.26 | 85.34 | 83.12 | 88.25 | 77.21 | 85.58 | 85.48 | 78.22 | |
| Qwen-VL-7B-Chat | 88.55 | 92.27 | 84.51 | 82.82 | 88.59 | 76.79 | 85.96 | 86.32 | - | |
| Specialist SOTAs (Specialist/Finetuned) |
G-DINO-L | 90.56 | 93.19 | 88.24 | 82.75 | 88.95 | 75.92 | 86.13 | 87.02 | - |
| UNINEXT-H | 92.64 | 94.33 | 91.46 | 85.24 | 89.63 | 79.79 | 88.73 | 89.37 | - | |
| ONE-PEACE | 92.58 | 94.18 | 89.26 | 88.77 | 92.21 | 83.23 | 89.22 | 89.27 | - | |
## 必要条件
* python 3.8 以上
* pytorch 1.12 以上、2.0 以上を推奨
* CUDA 11.4 以上を推奨(GPU ユーザー向けです)
## クイックスタート
以下では、Qwen-VL と Qwen-VL-Chat を 🤖 ModelScope と 🤗 Transformers とともに使う方法を、簡単な例で示します。
コードを実行する前に、環境のセットアップと必要なパッケージのインストールが済んでいることを 確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。
```bash
pip install -r requirements.txt
```
これで ModelScope や Transformers を使い始めることができます。ビジョンエンコーダについての詳しい使い方は、[チュートリアル](TUTORIAL_ja.md)を参照してください。
#### 🤗 Transformers
Qwen-VL-Chat を推論に使用するために必要なのは、以下に示す数行のコードを入力することだけです。ただし、**最新のコードを使用していることを確認してください。**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
# Note: デフォルトの動作では、インジェクション攻撃防止機能がオフになりました。
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# bf16 の使用
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 の使用
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# cpu のみの使用
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# cuda デバイスの使用
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
# 生成のためのハイパーパラメータの指定
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 第 1 回 対話ターン
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # ローカルパスまたは url
{'text': '这是什么?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 写真はビーチでラブラドールの隣で愛犬と戯れる女性が写っており、彼らは砂の中にいる。
# 第 2 回 対話ターン
response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history)
print(response)
# 击掌
Running Qwen-VL
Running Qwen-VL pretrained base model is also simple.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
# bf16 の使用
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 の使用
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval()
# cpu のみの使用
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval()
# cuda デバイスの使用
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()
# 生成のためのハイパーパラメータの指定
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # ローカルパスまたは url
{'text': 'Generate the caption in English with grounding:'},
])
inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(response)
# https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpegGenerate the caption in English with grounding: Woman
tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
# 写真は、若い女性がビーチで愛犬のラブラドール種と戯れているところ。 二人は浜辺に座り、犬の前脚を上げて触れ合っている。
# 第 2 回 対話ターン
response, history = model.chat(tokenizer, '输出击掌的检测框', history=history)
print(response)
# "击掌"
## 量子化
### 使用方法
私たちは、[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)に基づいた新しいソリューションを提供し、Qwen-VL-ChatのためのInt4量子化モデル、Qwen-VL-Chat-Int4[Click here](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)をリリースします。このモデルは、ほぼ無損失なモデル効果を達成しながら、メモリコストと推論速度の両方のパフォーマンスを向上させます。
ここでは、量子化されたモデルを推論に使用する方法を説明します。始める前に、必要な要件(torch 2.0以上、transformers 4.32.0以上など)を満たしていることを確認し、必要なパッケージをインストールしてください:
```bash
pip install optimum
git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ
pip install -v .
```
`auto-gptq`のインストールに問題がある場合は、公式の[repo](https://github.com/PanQiWei/AutoGPTQ)をチェックして、ホイールを見つけることをお勧めする。
そうすれば、量子化されたモデルを簡単にロードすることができ、いつもと同じように推論を実行することができる:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-VL-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
# Either a local path or an url between tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
```
### 性能
ベンチマーク **[TouchStone](https://github.com/OFA-Sys/TouchStone)** において、BF16 モデルと Int4 モデルの両方のモデル性能を例示し、量子化モデルが大きな性能劣化に悩まされないことを見出しました。結果を以下に示します:
| Quantization | ZH | EN |
| ------------ | :--------: | :-----------: |
| BF16 | 401.2 | 645.2 |
| Int4 | 386.6 | 651.4 |
### 推論スピード
BF16 精度と Int4 量子化の下で、画像(258 トークンを要する)のコンテキストで 1792(2048-258)トークンと 7934(8192-258)トークンを生成する平均推論速度(トークン/秒)をそれぞれ測定した。
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------ | :-----------------: | :-----------------: |
| BF16 | 28.87 | 24.32 |
| Int4 | 37.79 | 34.34 |
プロファイリングは、PyTorch 2.0.1 と CUDA 11.4 を搭載したシングル A100-SXM4-80G GPU で実行されます。
### GPU メモリ使用量
また、1792 (2048-258) 個のトークン (画像を含む) をコンテキストとしてエンコードする場合 (および単一のトークンを生成する場合) と、7934 (8192-258) 個のトークン (画像をコンテキストとして生成する場合) をそれぞれ BF16 または Int4 量子化レベルでエンコードする場合の GPU メモリ使用量のピーク値をプロファイリングしました。結果を以下に示します。
| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 22.60GB | 28.01GB |
| Int4 | 11.82GB | 17.23GB |
上記のスピードとメモリーのプロファイリングは、[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)を使用しています。
## ファインチューニング
現在、公式のトレーニングスクリプト `finetune.py` を提供しています。さらに、finetune.py のシェルスクリプトを提供し、finetune.py を実行することで、finetune.py を起動することができる。さらに、安心してファインチューニングを開始するためのシェルスクリプトも提供しています。このスクリプトは、[DeepSpeed](https://github.com/microsoft/DeepSpeed) および [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) を使用したトレーニングをサポートします。弊社が提供するシェル・スクリプトは DeepSpeed を使用するため、事前に DeepSpeed をインストールすることをお勧めします:
学習データを準備するには、すべてのサンプルをリストにまとめ、json ファイルに保存する必要があります。各サンプルは id と会話リストで構成される辞書です。以下は 1 つのサンプルを含む単純なリストの例です:
```json
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是Qwen-VL,一个支持视觉输入的大模型。"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "user",
"value": "Picture 1: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg\n图中的狗是什么品种?"
},
{
"from": "assistant",
"value": "图中是一只拉布拉多犬。"
},
{
"from": "user",
"value": "框出图中的格子衬衫"
},
{
"from": "assistant",
"value": "格子衬衫
assets/mm_tutorial/Chongqing.jpeg\nPicture 2:
assets/mm_tutorial/Beijing.jpeg\n图中都是哪"
},
{
"from": "assistant",
"value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
}
]
}
]
```
VL タスクの場合、`
img_path\n{your prompt}`」として表されます。ここで、「id」は会話内の画像の位置を 1 から示します。「img_path」は ローカル ファイル パスまたは Web リンク。
座標ボックスは `
| Method | Sequence Length | |||
|---|---|---|---|---|
| 384 | 512 | 1024 | 2048 | |
| LoRA (Base) | 37.1G / 2.3s/it | 37.3G / 2.4s/it | 38.7G / 3.6s/it | 38.7G / 6.1s/it |
| LoRA (Chat) | 23.3G / 2.2s/it | 23.6G / 2.3s/it | 25.1G / 3.5s/it | 27.3G / 5.9s/it |
| Q-LoRA | 17.0G / 4.2s/it | 17.2G / 4.5s/it | 18.2G / 5.5s/it | 19.3G / 7.9s/it |
中文  |  English   |  日本語  |  한국어 
Qwen-VL 🤖 | 🤗  | Qwen-VL-Chat 🤖 | 🤗  | Qwen-VL-Chat-Int4 🤗
WeChat   |   Discord   |   Demo  |  Paper   |   Colab   |   Tutorial
Qwen-VL 시리즈의 두 모델을 출시합니다.
- Qwen-VL: 사전 훈련된 LVLM 모델로, Qwen-7B를 LLM의 초기화에 사용하며, 시각 인코더의 초기화로는 [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip)를 사용하여, 무작위로 초기화된 교차 어텐션 레이어(randomly initialized cross-attention layer)에 연결합니다.
- Qwen-VL-Chat: 정렬 기술로 훈련된 멀티모달 LLM 기반 AI 어시스턴트입니다. Qwen-VL-Chat은 여러 이미지 입력, 다중 라운드 질문 응답, 창의적 능력과 같은 더 유연한 상호작용을 지원합니다.
## 뉴스 및 업데이트
* ```2023.9.25``` 🚀🚀🚀 Qwen-VL-Chat을 더욱 강력한 중국어 지시 수행 능력, 웹페이지 및 표 이미지에 대한 개선된 이해력, 더 나은 대화 성능(TouchStone: CN: 401.2->481.7, EN: 645.2->711.6)으로 업데이트 되었습니다.
* ```2023.9.12``` 😃😃😃 이제 Qwen-VL 모델에 대한 파인튜닝을 지원합니다. 이에는 전체 파라미터 파인튜닝, LoRA 및 Q-LoRA가 포함됩니다.
* ```2023.9.8``` 👍👍👍 camenduru가 멋진 Colab을 기여해 주셔서 감사합니다. 모두가 12G GPU에서 로컬 또는 온라인 Qwen-VL-Chat-Int4 데모 튜토리얼로 사용할 수 있습니다.
* ```2023.9.5``` 👏👏👏 Qwen-VL-Chat은 MME Benchmark, 멀티모달 대형 언어 모델을 위한 종합적인 평가 벤치마크에서 SOTAs를 달성했습니다. 이는 총 14개의 하위 과제에서 인식과 인지 능력을 모두 측정합니다.
* ```2023.9.4``` ⭐⭐⭐ Qwen-VL 시리즈는 Seed-Bench, 이미지 및 비디오 이해를 평가하는 19K 다중 선택 질문의 멀티모달 벤치마크에서 SOTAs를 달성했습니다. 이는 정확한 인간 주석을 갖추고 있습니다.
* ```2023.9.1``` 🔥🔥🔥 기본적인 인식과 이해력뿐만 아니라 문학 창작까지 아우르는 복합 언어 모델에 대한 종합적인 평가인 [TouchStone](https://github.com/OFA-Sys/TouchStone) 평가를 출시합니다. 강력한 LLM을 심사위원으로 활용하고, 멀티모달 정보를 텍스트로 변환하여 평가합니다.
* ```2023.8.31``` 🌟🌟🌟 Qwen-VL-Chat용 Int4 양자화 모델인 **Qwen-VL-Chat-Int4**를 출시하여 메모리 비용은 낮추고 추론 속도는 향상시켰습니다. 또한 벤치마크 평가에서도 성능 저하가 크지 않습니다.
* ```2023.8.22``` 🎉🎉🎉 모델스코프와 허깅페이스에 **Qwen-VL**과 **Qwen-VL-Chat**을 모두 출시합니다. 학습 내용 및 모델 성능 등 모델에 대한 자세한 내용은 [논문](https://arxiv.org/abs/2308.12966)을 통해 확인할 수 있습니다.
## Evaluation
세 가지 관점에서 모델의 기능을 평가했습니다:
1. **표준 벤치마크**: 멀티모달 작업의 네 가지 주요 범주에 대한 모델의 기본 작업 기능을 평가합니다:
- 제로 샷 캡션: 보이지 않는 데이터 세트에 대한 모델의 제로샷 이미지 캡션 능력을 평가합니다.
- 일반 VQA: 판단, 색상, 숫자, 카테고리 등과 같은 사진의 일반적인 질문에 대한 답변 능력을 평가합니다.
- 텍스트 기반 VQA: 문서 QA, 차트 QA 등과 같이 사진 속 텍스트를 인식하는 모델의 능력을 평가합니다.
- 참조 표현 이해: 참조 표현식으로 설명된 이미지에서 대상 객체를 찾아내는 능력을 평가합니다.
2. **터치스톤**: 전반적인 텍스트-이미지 대화 능력과 사람과의 일치도를 평가하기 위해 [TouchStone](https://github.com/OFA-Sys/TouchStone)이라는 벤치마크를 구축했으며, 이 벤치마크는 GPT4로 채점하여 LVLM 모델을 평가합니다.
- 터치스톤 벤치마크는 총 300개 이상의 이미지, 800개 이상의 질문, 27개 카테고리를 다룹니다. 속성 기반 Q&A, 유명인 인식, 시 쓰기, 여러 이미지 요약, 제품 비교, 수학 문제 풀이 등이 포함됩니다.
- 직접 이미지 입력이라는 현재 GPT4의 한계를 극복하기 위해 TouchStone은 사람이 직접 라벨을 지정하여 세분화된 이미지 주석을 제공합니다. 이러한 세부 주석은 문제 및 모델의 출력과 함께 채점을 위해 GPT4에 제공됩니다.
- 벤치마크에는 영어와 중국어 버전이 모두 포함되어 있습니다.
3. **기타 멀티모달 벤치마크**: 다른 멀티모달 벤치마크에서도 모델의 성능을 평가했습니다:
- 멀티모달 대규모 언어 모델에 대한 종합적인 평가 벤치마크인 [MME 벤치마크](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation). Qwen-VL-Chat은 지각과 인지 트랙 모두에서 SOTA를 달성했습니다.
- [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard)는 멀티모달 LLM을 평가하기 위한 정확한 인간 주석이 포함된 19K 객관식 질문으로 구성된 멀티모달 벤치마크입니다. 큐원 시리즈는 이 벤치마크에서 SOTA를 달성했습니다.
평가 결과는 다음과 같습니다.
Qwen-VL은 여러 VL 작업에서 현재 SOTA 제너럴리스트 모델보다 성능이 뛰어나며, 기능 범위 측면에서 더 포괄적인 기능을 지원합니다.
### Zero-shot Captioning & General VQA
| Model type | Model | Zero-shot Captioning | General VQA | |||||
|---|---|---|---|---|---|---|---|---|
| NoCaps | Flickr30K | VQAv2dev | OK-VQA | GQA | SciQA-Img (0-shot) |
VizWiz (0-shot) |
||
| Generalist Models |
Flamingo-9B | - | 61.5 | 51.8 | 44.7 | - | - | 28.8 |
| Flamingo-80B | - | 67.2 | 56.3 | 50.6 | - | - | 31.6 | |
| Unified-IO-XL | 100.0 | - | 77.9 | 54.0 | - | - | - | |
| Kosmos-1 | - | 67.1 | 51.0 | - | - | - | 29.2 | |
| Kosmos-2 | - | 80.5 | 51.1 | - | - | - | - | |
| BLIP-2 (Vicuna-13B) | 103.9 | 71.6 | 65.0 | 45.9 | 32.3 | 61.0 | 19.6 | |
| InstructBLIP (Vicuna-13B) | 121.9 | 82.8 | - | - | 49.5 | 63.1 | 33.4 | |
| Shikra (Vicuna-13B) | - | 73.9 | 77.36 | 47.16 | - | - | - | |
| Qwen-VL (Qwen-7B) | 121.4 | 85.8 | 78.8 | 58.6 | 59.3 | 67.1 | 35.2 | |
| Qwen-VL-Chat | 120.2 | 81.0 | 78.2 | 56.6 | 57.5 | 68.2 | 38.9 | |
| Previous SOTA (Per Task Fine-tuning) |
- | 127.0 (PALI-17B) |
84.5 (InstructBLIP -FlanT5-XL) |
86.1 (PALI-X -55B) |
66.1 (PALI-X -55B) |
72.1 (CFR) |
92.53 (LLaVa+ GPT-4) |
70.9 (PALI-X -55B) |
| Model type | Model | TextVQA | DocVQA | ChartQA | AI2D | OCR-VQA |
|---|---|---|---|---|---|---|
| Generalist Models | BLIP-2 (Vicuna-13B) | 42.4 | - | - | - | - |
| InstructBLIP (Vicuna-13B) | 50.7 | - | - | - | - | |
| mPLUG-DocOwl (LLaMA-7B) | 52.6 | 62.2 | 57.4 | - | - | |
| Pix2Struct-Large (1.3B) | - | 76.6 | 58.6 | 42.1 | 71.3 | |
| Qwen-VL (Qwen-7B) | 63.8 | 65.1 | 65.7 | 62.3 | 75.7 | |
| Specialist SOTAs (Specialist/Finetuned) |
PALI-X-55B (Single-task FT) (Without OCR Pipeline) |
71.44 | 80.0 | 70.0 | 81.2 | 75.0 |
| Model type | Model | RefCOCO | RefCOCO+ | RefCOCOg | GRIT | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| val | test-A | test-B | val | test-A | test-B | val-u | test-u | refexp | ||
| Generalist Models | GPV-2 | - | - | - | - | - | - | - | - | 51.50 |
| OFA-L* | 79.96 | 83.67 | 76.39 | 68.29 | 76.00 | 61.75 | 67.57 | 67.58 | 61.70 | |
| Unified-IO | - | - | - | - | - | - | - | - | 78.61 | |
| VisionLLM-H | 86.70 | - | - | - | - | - | - | - | ||
| Shikra-7B | 87.01 | 90.61 | 80.24 | 81.60 | 87.36 | 72.12 | 82.27 | 82.19 | 69.34 | |
| Shikra-13B | 87.83 | 91.11 | 81.81 | 82.89 | 87.79 | 74.41 | 82.64 | 83.16 | 69.03 | |
| Qwen-VL-7B | 89.36 | 92.26 | 85.34 | 83.12 | 88.25 | 77.21 | 85.58 | 85.48 | 78.22 | |
| Qwen-VL-7B-Chat | 88.55 | 92.27 | 84.51 | 82.82 | 88.59 | 76.79 | 85.96 | 86.32 | - | |
| Specialist SOTAs (Specialist/Finetuned) |
G-DINO-L | 90.56 | 93.19 | 88.24 | 82.75 | 88.95 | 75.92 | 86.13 | 87.02 | - |
| UNINEXT-H | 92.64 | 94.33 | 91.46 | 85.24 | 89.63 | 79.79 | 88.73 | 89.37 | - | |
| ONE-PEACE | 92.58 | 94.18 | 89.26 | 88.77 | 92.21 | 83.23 | 89.22 | 89.27 | - | |
#### SEED-Bench [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard)는 **이미지** 및 **동영상** 이해도를 포함한 12가지 평가 차원을 포괄하는 멀티모달 LLM을 평가하기 위한 정확한 사람의 주석이 포함된 19K 개의 객관식 문항으로 구성된 멀티모달 벤치마크입니다. 자세한 내용은 [여기](eval_mm/seed_bench/EVAL_SEED.md)에서 확인할 수 있습니다. 이 벤치마크에서 Qwen-VL과 Qwen-VL-Chat은 SOTA를 달성했습니다.
## Requirements
* python 3.8 and above
* pytorch 1.12 and above, 2.0 and above are recommended
* CUDA 11.4 and above are recommended (this is for GPU users)
## Quickstart
아래에서는 🤖 모델스코프 및 🤗 트랜스포머와 함께 Qwen-VL 및 Qwen-VL-Chat을 사용하는 방법을 보여주는 간단한 예제를 제공합니다.
코드를 실행하기 전에 환경을 설정하고 필요한 패키지를 설치했는지 확인하세요. 위의 요구 사항을 충족하는지 확인한 다음 종속 라이브러리를 설치하세요.
```bash
pip install -r requirements.txt
```
이제 모델스코프 또는 트랜스포머로 시작할 수 있습니다. 비전 인코더에 대한 자세한 사용법은 [튜토리얼](TUTORIAL.md)을 참조하세요.
#### 🤗 Transformers
추론에 Qwen-VL-Chat을 사용하려면 아래에 설명된 대로 몇 줄의 코드를 입력하기만 하면 됩니다. 단, **최신 코드를 사용하고 있는지 확인하세요**.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 1st dialogue turn
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': '这是什么?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名女子在沙滩上和狗玩耍,旁边是一只拉布拉多犬,它们处于沙滩上。
# 2nd dialogue turn
response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history)
print(response)
# 击掌
Running Qwen-VL
Qwen-VL pretrained base model을 실행하는 것도 매우 간단합니다.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval()
# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
{'text': 'Generate the caption in English with grounding:'},
])
inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(response)
# https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpegGenerate the caption in English with grounding: Woman
tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
# 图中是一名年轻女子在沙滩上和她的狗玩耍,狗的品种是拉布拉多。她们坐在沙滩上,狗的前腿抬起来,与人互动。
# 2nd dialogue turn
response, history = model.chat(tokenizer, '输出击掌的检测框', history=history)
print(response)
# "击掌"
## Quantization
### Usage
[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)를 기반으로 하는 새로운 솔루션을 제공하고, 거의 무손실 모델 효과를 달성하면서도 메모리 비용과 추론 속도 모두에서 성능이 향상된 Qwen-VL-Chat용 Int4 양자화 모델인 [Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)를 출시했습니다.
여기에서는 제공된 양자화된 모델을 추론에 사용하는 방법을 보여줍니다. 시작하기 전에 요구 사항(예: torch 2.0 이상, transformers 4.32.0 이상 등) 및 필요한 패키지를 제대로 설치했는지 확인하세요.
```bash
pip install optimum
git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ
pip install -v .
```
만약 'auto-gptq' 설치에 문제가 있다면, 공식 [repo](https://github.com/PanQiWei/AutoGPTQ)에서 휠을 찾아보시길 권장합니다.
그러면 정량화된 모델을 쉽게 로드하고 평소와 동일하게 추론을 실행할 수 있습니다.
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-VL-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
# Either a local path or an url between tags.
image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'
response, history = model.chat(tokenizer, query=f'
{image_path}这是什么', history=None)
print(response)
```
### Performance
[TouchStone](https://github.com/OFA-Sys/TouchStone)벤치마크에서 BF16 및 Int4 모델의 모델 성능을 살펴본 결과, 양자화된 모델에서 성능 저하가 크지 않은 것으로 나타났습니다. 결과는 아래와 같습니다.
| Quantization | ZH | EN |
| ------------ | :--------: | :-----------: |
| BF16 | 401.2 | 645.2 |
| Int4 | 386.6 | 651.4 |
### Inference Speed
이미지의 컨텍스트(258개의 토큰이 필요한)를 가지고 각각 1792개(2048-258개), 7934개(8192-258개)의 토큰을 생성하는 평균 추론 속도(토큰/초)를 BF16 정밀도와 Int4 양자화 하에서 측정했습니다.
| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
| ------------ | :-----------------: | :-----------------: |
| BF16 | 28.87 | 24.32 |
| Int4 | 37.79 | 34.34 |
프로파일링은 PyTorch 2.0.1 및 CUDA 11.4가 탑재된 단일 A100-SXM4-80G GPU에서 실행됩니다.
### GPU Memory Usage
또한 1792개(2048-258개)의 토큰(이미지 포함)을 컨텍스트로 인코딩하고 단일 토큰을 생성할 때와 7934개(8192-258개)의 토큰(이미지가 컨텍스트로 포함)을 생성할 때 각각 BF16 또는 Int4 양자화 수준에서 최대 GPU 메모리 사용량을 프로파일링했습니다. 결과는 아래와 같습니다.
| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| ------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 22.60GB | 28.01GB |
| Int4 | 11.82GB | 17.23GB |
위의 속도 및 메모리 프로파일링은 [이 스크립트](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)를 사용하여 수행되었습니다.
## Finetuning
이제 사용자가 다운스트림 애플리케이션을 위해 사전 학습된 모델을 간단한 방식으로 미세 조정할 수 있도록 공식 학습 스크립트인 `finetune.py`를 제공합니다. 또한, 걱정 없이 미세 조정을 시작할 수 있는 셸 스크립트도 제공합니다. 이 스크립트는 딥스피드와 FSDP를 통한 학습을 지원합니다. 제공되는 셸 스크립트는 DeepSpeed를 사용하므로 시작하기 전에 DeepSpeed를 설치하는 것이 좋습니다.
```bash
pip install deepspeed
```
### Data preparation
학습 데이터를 준비하려면 모든 샘플을 목록에 넣고 json 파일에 저장해야 합니다. 각 샘플은 ID와 대화 목록으로 구성된 사전입니다. 아래는 샘플 1개가 포함된 간단한 예제 목록입니다.
```json
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是Qwen-VL,一个支持视觉输入的大模型。"
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "user",
"value": "Picture 1: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg\n图中的狗是什么品种?"
},
{
"from": "assistant",
"value": "图中是一只拉布拉多犬。"
},
{
"from": "user",
"value": "框出图中的格子衬衫"
},
{
"from": "assistant",
"value": "格子衬衫
assets/mm_tutorial/Chongqing.jpeg\nPicture 2:
assets/mm_tutorial/Beijing.jpeg\n图中都是哪"
},
{
"from": "assistant",
"value": "第一张图片是重庆的城市天际线,第二张图片是北京的天际线。"
}
]
}
]
```
VL 작업에서는 `
img_path\n{your prompt}`로 표시되며, 여기서 `id`는 대화에서 이미지의 위치(1부터 시작)를 나타냅니다. `img_path`는 로컬 파일 경로 또는 웹 링크일 수 있습니다.
박스의 좌표는 `
| Method | Sequence Length | |||
|---|---|---|---|---|
| 384 | 512 | 1024 | 2048 | |
| LoRA (Base) | 37.1G / 2.3s/it | 37.3G / 2.4s/it | 38.7G / 3.6s/it | 38.7G / 6.1s/it |
| LoRA (Chat) | 23.3G / 2.2s/it | 23.6G / 2.3s/it | 25.1G / 3.5s/it | 27.3G / 5.9s/it |
| Q-LoRA | 17.0G / 4.2s/it | 17.2G / 4.5s/it | 18.2G / 5.5s/it | 19.3G / 7.9s/it |
#### How to get the caption without any box-like annotations
Sometimes you may expect no box-like annotations in the response. In the case, you can stably get the cleaned text by the following post-processing.
```
# response = ' Two apples
#### How to get the caption without any box-like annotations
때로는 응답에 박스형 주석이 없을 수도 있습니다. 이 경우 다음과 같은 후처리를 통해 안정적으로 정리된 텍스트를 얻을 수 있습니다.
```
# response = ' Two apples
## How To Process Video by Qwen-VL
Qwen-VL and Qwen-VL-Chat didn't train any video data or tasks during training, but they can understand some videos in a zero-shot way. For the video question-answering task, we utilize four uniformly sampled frames per video sample. These frames are treated as separate images and are stitched into the context. For example:
```
{
"question_id": "v0",
"prompt": "
中文  |  English |  日本語|  한국어
We comprehensively evaluate the model's ability from five dimensions. As shown in the figure above, an example of 27 subtasks is given. From perception to cognition to creativity, as the difficulty increases, the requirements for models are also getting higher and higher. Currently, LVLM capabilities are in their early stages. Our dataset contains 800+ questions and 27 categories.
## Methods
We apply a powerful LLM as a judge to enable automated evaluation. To effectively comprehend the contents of an image, we manually substitute the actual image input with fine-grained textual annotations. By inputting these annotations and corresponding questions to a powerful LLM like GPT4, we obtain reference answers.
For the evaluation of the LVLMs, we provide actual images and questions as input and obtain their respective answers. Finally, we employ GPT4 to score the answers generated by the LVLMs based on the fine-grained annotations and questions. The scoring instructions require the model to assess the usefulness, relevance, and accuracy of the answers, considering the annotations as the content of the images. To ensure fairness in the evaluation, each model's answer is compared against a consistent reference answer from GPT4. The average score of the model in all questions is taken as the final score.
To eliminate the influence of answer position, we perform a second scoring round by swapping the positions of the answers and then compute the average of the two scores obtained. This approach aims to mitigate any bias introduced by the placement of the answers.
### Evaluation
#### Evaluation in English-based Multimodal Dialogue
| Model | Score |
|---------------|-------|
| PandaGPT | 488.5 |
| MiniGPT4 | 531.7 |
| InstructBLIP | 552.4 |
| LLaMA-AdapterV2 | 590.1 |
| mPLUG-Owl | 605.4 |
| LLaVA | 602.7 |
| Qwen-VL-Chat | 645.2 |
#### Evaluation in Chinese-based Multimodal Dialogue
| Model | Score |
|---------------|-------|
| VisualGLM | 247.1 |
| Qwen-VL-Chat | 401.2 |
================================================
FILE: touchstone/README_CN.md
================================================
中文  |  English |  日本語
我们从五个维度综合评估了模型的能力。 如上图所示,给出了27个子任务的示例。 从感知到认知,再到创造力,随着难度的增加,对模型的要求也越来越高。 目前,LVLM的能力还处于早期阶段。 我们的数据集包含800+道题目、27个类别。
## 测评方式
我们应用SOTA的LLM进行自动化评估。 为了有效地理解图像的内容,我们人工用细粒度的文本注释替换实际的图像输入。 通过将这些注释和相应的问题输入到像GPT4这样强LLM中,我们可以获得参考答案。
对于待测评的LVLM,我们提供实际图像和问题作为输入并获得各自的答案。 最后,我们使用GPT4根据细粒度注释和问题对LVLM生成的答案进行评分。 评分指令要求模型评估答案的有用性、相关性和准确性,并将人工注解视为图像的内容。 为了确保评估的公平性,每个模型的答案都会与 GPT4生成的参考答案进行比较。 模型在所有问题上的平均得分作为最终得分。
为了消除答案位置的影响,我们通过交换答案的位置来进行第二轮评分,然后计算获得的两次分数的平均值。
## 测评结果
#### 英文版本测评
| Model | Score |
|---------------|-------|
| PandaGPT | 488.5 |
| MiniGPT4 | 531.7 |
| InstructBLIP | 552.4 |
| LLaMA-AdapterV2 | 590.1 |
| mPLUG-Owl | 605.4 |
| LLaVA | 602.7 |
| Qwen-VL-Chat | 645.2 |
#### 中文版本测评
| Model | Score |
|---------------|-------|
| VisualGLM | 247.1 |
| Qwen-VL-Chat | 401.2 |
================================================
FILE: touchstone/README_JA.md
================================================
中文  |  English|  日本語
モデルの能力を 5 つの次元から総合的に評価する。上図のように、27 のサブタスクの例を示す。知覚から認知、創造性まで、難易度が上がるにつれて、モデルに求められる要件もどんどん高くなっている。現在、LVLM の機能は初期段階にある。我々のデータセットには 800 以上の質問と 27 のカテゴリーが含まれている。
## 方法
自動評価を可能にするために、強力な LLM を判定器として適用する。画像の内容を効果的に理解するために、実際の画像入力をきめ細かいテキスト注釈に手動で置き換える。これらの注釈と対応する質問を GPT4 のような強力な LLM に入力することで、参照解答を得る。
LVLMの評価には、実際の画像と質問を入力として与え、それぞれの回答を得る。最後に、GPT4を用いて、LVLMが生成した回答を、細かいアノテーションと質問に基づいてスコアリングする。スコアリングの指示は、注釈を画像の内容とみなして、回答の有用性、関連性、正確性を評価するようモデルに要求する。評価の公平性を確保するため、各モデルの回答はGPT4の一貫した参照回答と比較されます。全問題におけるモデルの平均スコアを最終スコアとする。
解答位置の影響を排除するために、解答位置を入れ替えて2回目の採点ラウンドを行い、得られた2つのスコアの平均を計算します。このアプローチは、解答の配置によって生じるバイアスを軽減することを目的としています。
### 評価
#### 英語ベースのマルチモーダル対話における評価
| Model | Score |
|---------------|-------|
| PandaGPT | 488.5 |
| MiniGPT4 | 531.7 |
| InstructBLIP | 552.4 |
| LLaMA-AdapterV2 | 590.1 |
| mPLUG-Owl | 605.4 |
| LLaVA | 602.7 |
| Qwen-VL-Chat | 645.2 |
#### 中国語ベースのマルチモーダル対話における評価
| Model | Score |
|---------------|-------|
| VisualGLM | 247.1 |
| Qwen-VL-Chat | 401.2 |
================================================
FILE: touchstone/README_KO.md
================================================
中文  |  English |  日本語 |  한국어
5가지 측면에서 모델의 능력을 종합적으로 평가합니다. 위 그림과 같이 27개의 하위 과제를 예로 들었습니다. 지각부터 인지, 창의력까지 난이도가 높아질수록 모델에 대한 요구 사항도 점점 더 높아지고 있습니다. 현재 LVLM 기능은 초기 단계에 있습니다. 데이터 세트에는 800개 이상의 질문과 27개 카테고리가 포함되어 있습니다.
## Methods
당사는 자동화된 평가를 위해 강력한 LLM을 심사자로 적용합니다. 이미지의 내용을 효과적으로 이해하기 위해 실제 이미지 입력을 세분화된 텍스트 주석으로 수동으로 대체합니다. 이러한 주석과 해당 질문을 GPT4와 같은 강력한 LLM에 입력하면 참조 답변을 얻을 수 있습니다.
LVLM의 평가를 위해 실제 이미지와 질문을 입력으로 제공하고 각각의 답변을 얻습니다. 마지막으로, 세분화된 주석과 질문을 기반으로 LVLM이 생성한 답변에 GPT4를 사용하여 점수를 매깁니다. 채점 지침에 따라 모델은 주석을 이미지의 콘텐츠로 간주하여 답변의 유용성, 관련성 및 정확성을 평가해야 합니다. 평가의 공정성을 보장하기 위해 각 모델의 답변은 GPT4의 일관된 참조 답변과 비교됩니다. 모든 문제에서 모델의 평균 점수가 최종 점수로 사용됩니다.
답안 위치의 영향을 제거하기 위해 답안 위치를 바꿔서 두 번째 채점 라운드를 수행한 다음 얻은 두 점수의 평균을 계산합니다. 이 접근 방식은 답안 배치로 인해 발생하는 편향을 완화하는 것을 목표로 합니다.
### Evaluation
#### Evaluation in English-based Multimodal Dialogue
| Model | Score |
|---------------|-------|
| PandaGPT | 488.5 |
| MiniGPT4 | 531.7 |
| InstructBLIP | 552.4 |
| LLaMA-AdapterV2 | 590.1 |
| mPLUG-Owl | 605.4 |
| LLaVA | 602.7 |
| Qwen-VL-Chat | 645.2 |
#### Evaluation in Chinese-based Multimodal Dialogue
| Model | Score |
|---------------|-------|
| VisualGLM | 247.1 |
| Qwen-VL-Chat | 401.2 |
================================================
FILE: web_demo_mm.py
================================================
# Copyright (c) Alibaba Cloud.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""A simple web interactive chat demo based on gradio."""
from argparse import ArgumentParser
from pathlib import Path
import copy
import gradio as gr
import os
import re
import secrets
import tempfile
from modelscope import (
snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
DEFAULT_CKPT_PATH = 'qwen/Qwen-VL-Chat'
BOX_TAG_PATTERN = r" """)
gr.Markdown("""Data Preparation
```bash
mkdir -p data/flickr && cd data/flickr
# download images from https://bryanplummer.com/Flickr30kEntities/
# karpathy split annotations can be downloaded from https://cs.stanford.edu/people/karpathy/deepimagesent/
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/flickr30k/flickr30k_karpathy_test.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/flickr30k/flickr30k_karpathy_train.json
cd ../..
```
Evaluate
```bash
ds="flickr"
checkpoint=/PATH/TO/CHECKPOINT
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_caption.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/nocaps && cd data/nocaps
# download images from https://nocaps.org/download
# original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/nocaps/nocaps_val.json
cd ../..
```
Evaluate
```bash
ds="nocaps"
checkpoint=/PATH/TO/CHECKPOINT
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_caption.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/coco && cd data/coco
# download coco2014 images
wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip
wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip
cd ../..
```
Data Preparation
```bash
mkdir -p data/vqav2 && cd data/vqav2
# make sure you have downloaded COCO images
# download questions and annotations
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip && unzip v2_Annotations_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip && unzip v2_Questions_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip && unzip v2_Annotations_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip && unzip v2_Questions_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip && unzip v2_Questions_Test_mscoco.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_testdev.jsonl
```
Evaluate
```bash
checkpoint=/PATH/TO/CHECKPOINT
for ds in "vqav2_val" "vqav2_testdev"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/okvqa && cd data/okvqa
# download annotations and questions
wget https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip && unzip mscoco_train2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip && unzip OpenEnded_mscoco_train2014_questions.json.zip
wget https://okvqa.allenai.org/static/data/mscoco_val2014_annotations.json.zip && unzip mscoco_val2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip && unzip OpenEnded_mscoco_val2014_questions.json.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_val.jsonl
cd ../..
```
Evaluate
```bash
ds="okvqa_val"
checkpoint=/PATH/TO/CHECKPOINT
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/textvqa && cd data/textvqa
# download images
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip && unzip train_val_images.zip
# download annotations and questions
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val.jsonl
cd ../..
```
Evaluate
```bash
ds="textvqa_val"
checkpoint=/PATH/TO/CHECKPOINT
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/vizwiz && cd data/vizwiz
# download images
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/train.zip && unzip train.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/val.zip && unzip val.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip && unzip test.zip
# download annotations
wget https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip && unzip Annotations.zip
# download converted files
# train
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train.jsonl
# val
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val.jsonl
# test
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_test.jsonl
cd ../..
```
Evaluation
```bash
# evaluate vqa score on vizwiz val split
ds="vizwiz_val"
checkpoint=/PATH/TO/CHECKPOINT
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/docvqa && cd data/docvqa
# download images and annotations from https://www.docvqa.org/datasets
# download converted files
# train
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.jsonl
# val
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.jsonl
# test
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.jsonl
cd ../..
```
Evaluation
```bash
# evaluate vqa score on docvqa val split
ds="docvqa_val"
checkpoint=/PATH/TO/CHECKPOINT
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/chartqa && cd data/chartqa
# download images from https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_augmented.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_augmented.jsonl
cd ../..
```
Evaluate
```bash
checkpoint=/PATH/TO/CHECKPOINT
for ds in "chartqa_test_human" "chartqa_test_augmented"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/gqa && cd data/gqa
# download images
wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip
unzip images.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/testdev_balanced.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/train_balanced.jsonl
cd ../..
```
Evaluate
```bash
checkpoint=/PATH/TO/CHECKPOINT
ds="gqa_testdev"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/ocrvqa && cd data/ocrvqa
# download images by following instructions at https://ocr-vqa.github.io/kvqa_ProjectFiles/README.txt
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_test.jsonl
cd ../..
```
Evaluate
```bash
checkpoint=/PATH/TO/CHECKPOINT
ds="ocrvqa_test"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/ai2diagram && cd data/ai2diagram
# download images
wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/test.jsonl
cd ../..
```
Evaluate
```bash
checkpoint=/PATH/TO/CHECKPOINT
ds="ai2diagram_test"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_vqa.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/scienceqa/images && cd data/scienceqa/images
# download images
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip && unzip test.zip
cd ..
# download original questions
wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/scienceqa_test_img.jsonl
cd ../..
```
Evaluate
```bash
ds="scienceqa_test_img"
checkpoint=/PATH/TO/CHECKPOINT
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_multiple_choice.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/refcoco && cd data/refcoco
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testB.jsonl
cd ../..
```
Evaluation
```bash
checkpoint=/PATH/TO/CHECKPOINT
for ds in "refcoco_val" "refcoco_testA" "refcoco_testB"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_grounding.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/refcoco+ && cd data/refcoco+
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testB.jsonl
cd ../..
```
Data Preparation
```bash
checkpoint=/PATH/TO/CHECKPOINT
for ds in "refcoco+_val" "refcoco+_testA" "refcoco+_testB"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_grounding.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
Data Preparation
```bash
mkdir -p data/refcocog && data/refcocog
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_test.jsonl
cd ../..
```
Evaluate
```bash
checkpoint=/PATH/TO/CHECKPOINT
for ds in "refcocog_val" "refcocog_test"
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_grounding.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 8 \
--num-workers 2
```
{}Describe the image in English:'
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint,
trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id
random.seed(args.seed)
dataset = CaptionDataset(
train=ds_collections[args.dataset]['train'],
test=ds_collections[args.dataset]['test'],
prompt=prompt,
few_shot=args.few_shot,
)
coco_karpathy_test_loader = torch.utils.data.DataLoader(
dataset=dataset,
sampler=InferenceSampler(len(dataset)),
batch_size=args.batch_size,
num_workers=args.num_workers,
pin_memory=True,
drop_last=False,
collate_fn=partial(collate_fn, tokenizer=tokenizer),
)
image_ids = []
captions = []
for _, (ids, input_ids,
attention_mask) in tqdm(enumerate(coco_karpathy_test_loader)):
pred = model.generate(
input_ids=input_ids.cuda(),
attention_mask=attention_mask.cuda(),
do_sample=False,
num_beams=1,
max_new_tokens=30,
min_new_tokens=8,
length_penalty=0,
num_return_sequences=1,
use_cache=True,
pad_token_id=tokenizer.eod_id,
eos_token_id=tokenizer.eod_id,
)
image_ids.extend(ids)
captions.extend([
tokenizer.decode(_[input_ids.size(1):].cpu(),
skip_special_tokens=True).strip() for _ in pred
])
torch.distributed.barrier()
world_size = torch.distributed.get_world_size()
merged_ids = [None for _ in range(world_size)]
merged_captions = [None for _ in range(world_size)]
torch.distributed.all_gather_object(merged_ids, image_ids)
torch.distributed.all_gather_object(merged_captions, captions)
merged_ids = [_ for _ in itertools.chain.from_iterable(merged_ids)]
merged_captions = [
_ for _ in itertools.chain.from_iterable(merged_captions)
]
if torch.distributed.get_rank() == 0:
print(f"Evaluating {args.dataset} ...")
results = []
for image_id, caption in zip(merged_ids, merged_captions):
results.append({
'image_id': int(image_id),
'caption': caption,
})
time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime())
results_file = f'{args.dataset}_{time_prefix}.json'
json.dump(results, open(results_file, 'w'))
coco = COCO(ds_collections[args.dataset]['test'])
coco_result = coco.loadRes(results_file)
coco_eval = COCOEvalCap(coco, coco_result)
coco_eval.evaluate()
print(coco_eval.eval.items())
torch.distributed.barrier()
================================================
FILE: eval_mm/evaluate_grounding.py
================================================
import argparse
import itertools
import json
import os
import re
from functools import partial
import torch
from torchvision.ops.boxes import box_area
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
ds_collections = {
'refcoco_val': 'data/refcoco/refcoco_val.jsonl',
'refcoco_testA': 'data/refcoco/refcoco_testA.jsonl',
'refcoco_testB': 'data/refcoco/refcoco_testB.jsonl',
'refcoco+_val': 'data/refcoco+/refcoco+_val.jsonl',
'refcoco+_testA': 'data/refcoco+/refcoco+_testA.jsonl',
'refcoco+_testB': 'data/refcoco+/refcoco+_testB.jsonl',
'refcocog_val': 'data/refcocog/refcocog_val.jsonl',
'refcocog_test': 'data/refcocog/refcocog_test.jsonl',
}
def box_iou(boxes1, boxes2):
area1 = box_area(boxes1)
area2 = box_area(boxes2)
lt = torch.max(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2]
rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2]
wh = (rb - lt).clamp(min=0) # [N,M,2]
inter = wh[:, :, 0] * wh[:, :, 1] # [N,M]
union = area1[:, None] + area2 - inter
iou = inter / union
return iou, union
def collate_fn(batches, tokenizer):
texts = [_['text'] for _ in batches]
bboxes = [_['bbox'] for _ in batches]
hws = [_['hw'] for _ in batches]
input_ids = tokenizer(texts, return_tensors='pt', padding='longest')
return input_ids.input_ids, input_ids.attention_mask, bboxes, hws
class RefCOCODataset(torch.utils.data.Dataset):
def __init__(self, test, tokenizer, prompt):
self.datas = open(test).readlines()
self.tokenizer = tokenizer
self.prompt = prompt
def __len__(self):
return len(self.datas)
def __getitem__(self, idx):
data = json.loads(self.datas[idx].strip())
image = data['image']
text = data['sent']
bbox = data['bbox']
w, h = data['width'], data['height']
return {
'text': self.prompt.format(image, text),
'bbox': bbox,
'hw': (h, w),
}
class InferenceSampler(torch.utils.data.sampler.Sampler):
def __init__(self, size):
self._size = int(size)
assert size > 0
self._rank = torch.distributed.get_rank()
self._world_size = torch.distributed.get_world_size()
self._local_indices = self._get_local_indices(size, self._world_size,
self._rank)
@staticmethod
def _get_local_indices(total_size, world_size, rank):
shard_size = total_size // world_size
left = total_size % world_size
shard_sizes = [shard_size + int(r < left) for r in range(world_size)]
begin = sum(shard_sizes[:rank])
end = min(sum(shard_sizes[:rank + 1]), total_size)
return range(begin, end)
def __iter__(self):
yield from self._local_indices
def __len__(self):
return len(self._local_indices)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--checkpoint', type=str, default='')
parser.add_argument('--dataset', type=str, default='')
parser.add_argument('--batch-size', type=int, default=1)
parser.add_argument('--num-workers', type=int, default=1)
args = parser.parse_args()
torch.distributed.init_process_group(
backend='nccl',
world_size=int(os.getenv('WORLD_SIZE', '1')),
rank=int(os.getenv('RANK', '0')),
)
torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0)))
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint,
trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id
prompt = '
{}{}
{}Context: {}\nQuestion: {}\nOptions: {}\nAnswer:'
dataset = MultipleChoiceDataste(test=ds_collections[args.dataset]['test'],
prompt=prompt,
tokenizer=tokenizer)
dataloader = torch.utils.data.DataLoader(
dataset=dataset,
sampler=InferenceSampler(len(dataset)),
batch_size=args.batch_size,
num_workers=args.num_workers,
pin_memory=True,
drop_last=False,
collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id),
)
results = []
with torch.no_grad():
for _, (input_tokens, attention_mask, target_lengths, answer,
chunk_sizes) in tqdm(enumerate(dataloader)):
outputs = model(
input_ids=input_tokens[:, :-1].cuda(),
attention_mask=attention_mask[:, :-1].cuda(),
return_dict=True,
)
losses = torch.nn.functional.cross_entropy(outputs.logits.permute(
0, 2, 1),
input_tokens[:,
1:].cuda(),
reduction='none')
losses = losses.split(chunk_sizes, dim=0)
for loss, target_length, answer in zip(losses, target_lengths,
answer):
target_loss = loss.mean(-1)
for _ in range(len(target_length)):
target_loss[_] = loss[_, -target_length[_]:].mean()
pred = target_loss.argmin().item()
if pred == answer:
results.append(1)
else:
results.append(0)
torch.distributed.barrier()
world_size = torch.distributed.get_world_size()
merged_results = [None for _ in range(world_size)]
torch.distributed.all_gather_object(merged_results, results)
merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)]
if torch.distributed.get_rank() == 0:
print(f"Evaluating {args.dataset} ...")
print(f'Acc@1: {sum(merged_results) / len(merged_results)}')
torch.distributed.barrier()
================================================
FILE: eval_mm/evaluate_vqa.py
================================================
import argparse
import itertools
import json
import os
import random
import time
from functools import partial
from typing import Optional
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from vqa import VQA
from vqa_eval import VQAEval
ds_collections = {
'vqav2_val': {
'train': 'data/vqav2/vqav2_train.jsonl',
'test': 'data/vqav2/vqav2_val.jsonl',
'question': 'data/vqav2/v2_OpenEnded_mscoco_val2014_questions.json',
'annotation': 'data/vqav2/v2_mscoco_val2014_annotations.json',
'metric': 'vqa_score',
'max_new_tokens': 10,
},
'vqav2_testdev': {
'train': 'data/vqav2/vqav2_train.jsonl',
'test': 'data/vqav2/vqav2_testdev.jsonl',
'metric': None,
'max_new_tokens': 10,
},
'okvqa_val': {
'train': 'data/okvqa/okvqa_train.jsonl',
'test': 'data/okvqa/okvqa_val.jsonl',
'question': 'data/okvqa/OpenEnded_mscoco_val2014_questions.json',
'annotation': 'data/okvqa/mscoco_val2014_annotations.json',
'metric': 'vqa_score',
'max_new_tokens': 10,
},
'textvqa_val': {
'train': 'data/textvqa/textvqa_train.jsonl',
'test': 'data/textvqa/textvqa_val.jsonl',
'question': 'data/textvqa/textvqa_val_questions.json',
'annotation': 'data/textvqa/textvqa_val_annotations.json',
'metric': 'vqa_score',
'max_new_tokens': 10,
},
'vizwiz_val': {
'train': 'data/vizwiz/vizwiz_train.jsonl',
'test': 'data/vizwiz/vizwiz_val.jsonl',
'question': 'data/vizwiz/vizwiz_val_questions.json',
'annotation': 'data/vizwiz/vizwiz_val_annotations.json',
'metric': 'vqa_score',
'max_new_tokens': 10,
},
'vizwiz_test': {
'train': 'data/vizwiz/vizwiz_train.jsonl',
'test': 'data/vizwiz/vizwiz_test.jsonl',
'metric': None,
'max_new_tokens': 10,
},
'docvqa_val': {
'train': 'data/docvqa/train.jsonl',
'test': 'data/docvqa/val.jsonl',
'annotation': 'data/docvqa/val/val_v1.0.json',
'metric': 'anls',
'max_new_tokens': 100,
},
'docvqa_test': {
'train': 'data/docvqa/train.jsonl',
'test': 'data/docvqa/test.jsonl',
'metric': None,
'max_new_tokens': 100,
},
'chartqa_test_human': {
'train': 'data/chartqa/train_human.jsonl',
'test': 'data/chartqa/test_human.jsonl',
'metric': 'relaxed_accuracy',
'max_new_tokens': 100,
},
'chartqa_test_augmented': {
'train': 'data/chartqa/train_augmented.jsonl',
'test': 'data/chartqa/test_augmented.jsonl',
'metric': 'relaxed_accuracy',
'max_new_tokens': 100,
},
'gqa_testdev': {
'train': 'data/gqa/train.jsonl',
'test': 'data/gqa/testdev_balanced.jsonl',
'metric': 'accuracy',
'max_new_tokens': 10,
},
'ocrvqa_val': {
'train': 'data/ocrvqa/ocrvqa_train.jsonl',
'test': 'data/ocrvqa/ocrvqa_val.jsonl',
'metric': 'accuracy',
'max_new_tokens': 100,
},
'ocrvqa_test': {
'train': 'data/ocrvqa/ocrvqa_train.jsonl',
'test': 'data/ocrvqa/ocrvqa_test.jsonl',
'metric': 'accuracy',
'max_new_tokens': 100,
},
'ai2diagram_test': {
'train': 'data/ai2diagram/train.jsonl',
'test': 'data/ai2diagram/test.jsonl',
'metric': 'accuracy',
'max_new_tokens': 10,
}
}
# https://github.com/google-research/pix2struct/blob/main/pix2struct/metrics.py#L81
def relaxed_correctness(target: str,
prediction: str,
max_relative_change: float = 0.05) -> bool:
"""Calculates relaxed correctness.
The correctness tolerates certain error ratio defined by max_relative_change.
See https://arxiv.org/pdf/2203.10244.pdf, end of section 5.1:
“Following Methani et al. (2020), we use a relaxed accuracy measure for the
numeric answers to allow a minor inaccuracy that may result from the automatic
data extraction process. We consider an answer to be correct if it is within
5% of the gold answer. For non-numeric answers, we still need an exact match
to consider an answer to be correct.”
Args:
target: Target string.
prediction: Predicted string.
max_relative_change: Maximum relative change.
Returns:
Whether the prediction was correct given the specified tolerance.
"""
def _to_float(text: str) -> Optional[float]:
try:
if text.endswith('%'):
# Convert percentages to floats.
return float(text.rstrip('%')) / 100.0
else:
return float(text)
except ValueError:
return None
prediction_float = _to_float(prediction)
target_float = _to_float(target)
if prediction_float is not None and target_float:
relative_change = abs(prediction_float -
target_float) / abs(target_float)
return relative_change <= max_relative_change
else:
return prediction.lower() == target.lower()
def evaluate_relaxed_accuracy(entries):
scores = []
for elem in entries:
if isinstance(elem['annotation'], str):
elem['annotation'] = [elem['annotation']]
score = max([
relaxed_correctness(elem['answer'].strip(), ann)
for ann in elem['annotation']
])
scores.append(score)
return sum(scores) / len(scores)
def evaluate_exact_match_accuracy(entries):
scores = []
for elem in entries:
if isinstance(elem['annotation'], str):
elem['annotation'] = [elem['annotation']]
score = max([
(1.0 if
(elem['answer'].strip().lower() == ann.strip().lower()) else 0.0)
for ann in elem['annotation']
])
scores.append(score)
return sum(scores) / len(scores)
def collate_fn(batches, tokenizer):
questions = [_['question'] for _ in batches]
question_ids = [_['question_id'] for _ in batches]
annotations = [_['annotation'] for _ in batches]
input_ids = tokenizer(questions, return_tensors='pt', padding='longest')
return question_ids, input_ids.input_ids, input_ids.attention_mask, annotations
class VQADataset(torch.utils.data.Dataset):
def __init__(self, train, test, prompt, few_shot):
self.test = open(test).readlines()
self.prompt = prompt
self.few_shot = few_shot
if few_shot > 0:
self.train = open(train).readlines()
def __len__(self):
return len(self.test)
def __getitem__(self, idx):
data = json.loads(self.test[idx].strip())
image, question, question_id, annotation = data['image'], data[
'question'], data['question_id'], data.get('answer', None)
few_shot_prompt = ''
if self.few_shot > 0:
few_shot_samples = random.sample(self.train, self.few_shot)
for sample in few_shot_samples:
sample = json.loads(sample.strip())
few_shot_prompt += self.prompt.format(
sample['image'],
sample['question']) + f" {sample['answer']}"
return {
'question': few_shot_prompt + self.prompt.format(image, question),
'question_id': question_id,
'annotation': annotation
}
class InferenceSampler(torch.utils.data.sampler.Sampler):
def __init__(self, size):
self._size = int(size)
assert size > 0
self._rank = torch.distributed.get_rank()
self._world_size = torch.distributed.get_world_size()
self._local_indices = self._get_local_indices(size, self._world_size,
self._rank)
@staticmethod
def _get_local_indices(total_size, world_size, rank):
shard_size = total_size // world_size
left = total_size % world_size
shard_sizes = [shard_size + int(r < left) for r in range(world_size)]
begin = sum(shard_sizes[:rank])
end = min(sum(shard_sizes[:rank + 1]), total_size)
return range(begin, end)
def __iter__(self):
yield from self._local_indices
def __len__(self):
return len(self._local_indices)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--checkpoint', type=str, default='')
parser.add_argument('--dataset', type=str, default='')
parser.add_argument('--batch-size', type=int, default=1)
parser.add_argument('--num-workers', type=int, default=1)
parser.add_argument('--few-shot', type=int, default=0)
parser.add_argument('--seed', type=int, default=0)
args = parser.parse_args()
torch.distributed.init_process_group(
backend='nccl',
world_size=int(os.getenv('WORLD_SIZE', '1')),
rank=int(os.getenv('RANK', '0')),
)
torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0)))
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint,
trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id
prompt = '
{}{} Answer:'
random.seed(args.seed)
dataset = VQADataset(
train=ds_collections[args.dataset]['train'],
test=ds_collections[args.dataset]['test'],
prompt=prompt,
few_shot=args.few_shot,
)
dataloader = torch.utils.data.DataLoader(
dataset=dataset,
sampler=InferenceSampler(len(dataset)),
batch_size=args.batch_size,
num_workers=args.num_workers,
pin_memory=True,
drop_last=False,
collate_fn=partial(collate_fn, tokenizer=tokenizer),
)
outputs = []
for _, (question_ids, input_ids, attention_mask,
annotations) in tqdm(enumerate(dataloader)):
pred = model.generate(
input_ids=input_ids.cuda(),
attention_mask=attention_mask.cuda(),
do_sample=False,
num_beams=1,
max_new_tokens=ds_collections[args.dataset]['max_new_tokens'],
min_new_tokens=1,
length_penalty=1,
num_return_sequences=1,
output_hidden_states=True,
use_cache=True,
pad_token_id=tokenizer.eod_id,
eos_token_id=tokenizer.eod_id,
)
answers = [
tokenizer.decode(_[input_ids.size(1):].cpu(),
skip_special_tokens=True).strip() for _ in pred
]
for question_id, answer, annotation in zip(question_ids, answers,
annotations):
if args.dataset in ['vqav2_val', 'vqav2_testdev', 'okvqa_val', 'textvqa_val', 'vizwiz_val']:
outputs.append({
'question_id': question_id,
'answer': answer,
})
elif args.dataset in ['docvqa_val', 'infographicsvqa', 'gqa_testdev', 'ocrvqa_val', 'ocrvqa_test']:
outputs.append({
'questionId': question_id,
'answer': answer,
'annotation': annotation,
})
elif args.dataset in ['ai2diagram_test']:
outputs.append({
'image': question_id,
'answer': answer,
'annotation': annotation,
})
elif args.dataset in ['chartqa_test_human', 'chartqa_test_augmented']:
outputs.append({
'answer': answer,
'annotation': annotation,
})
elif args.dataset in ['docvqa_test']:
outputs.append({
'questionId': question_id,
'answer': answer,
})
elif args.dataset in ['vizwiz_test']:
outputs.append({
'image': question_id,
'answer': answer,
})
else:
raise NotImplementedError
torch.distributed.barrier()
world_size = torch.distributed.get_world_size()
merged_outputs = [None for _ in range(world_size)]
torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs))
merged_outputs = [json.loads(_) for _ in merged_outputs]
merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)]
if torch.distributed.get_rank() == 0:
print(f"Evaluating {args.dataset} ...")
time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime())
results_file = f'{args.dataset}_{time_prefix}_fs{args.few_shot}_s{args.seed}.json'
json.dump(merged_outputs, open(results_file, 'w'), ensure_ascii=False)
if ds_collections[args.dataset]['metric'] == 'vqa_score':
vqa = VQA(ds_collections[args.dataset]['annotation'],
ds_collections[args.dataset]['question'])
results = vqa.loadRes(
resFile=results_file,
quesFile=ds_collections[args.dataset]['question'])
vqa_scorer = VQAEval(vqa, results, n=2)
vqa_scorer.evaluate()
print(vqa_scorer.accuracy)
elif ds_collections[args.dataset]['metric'] == 'anls':
json.dump(merged_outputs,
open(results_file, 'w'),
ensure_ascii=False)
print('python infographicsvqa_eval.py -g ' +
ds_collections[args.dataset]['annotation'] + ' -s ' +
results_file)
os.system('python infographicsvqa_eval.py -g ' +
ds_collections[args.dataset]['annotation'] + ' -s ' +
results_file)
elif ds_collections[args.dataset]['metric'] == 'relaxed_accuracy':
print({
'relaxed_accuracy': evaluate_relaxed_accuracy(merged_outputs)
})
elif ds_collections[args.dataset]['metric'] == 'accuracy':
if 'gqa' in args.dataset:
for entry in merged_outputs:
response = entry['answer']
response = response.strip().split('.')[0].split(
',')[0].split('!')[0].lower()
if 'is ' in response:
response = response.split('is ')[1]
if 'are ' in response:
response = response.split('are ')[1]
if 'a ' in response:
response = response.split('a ')[1]
if 'an ' in response:
response = response.split('an ')[1]
if 'the ' in response:
response = response.split('the ')[1]
if ' of' in response:
response = response.split(' of')[0]
response = response.strip()
entry['answer'] = response
print({'accuracy': evaluate_exact_match_accuracy(merged_outputs)})
torch.distributed.barrier()
================================================
FILE: eval_mm/infographicsvqa_eval.py
================================================
# This file can be downloaded from: https://www.docvqa.org/datasets/infographicvqa and https://rrc.cvc.uab.es/?ch=17&com=introduction
import os, json
import argparse
question_ids_to_exclude = []
# answer_types = {'image span': 'Image-Span', 'question span': 'Question-Span', 'multiple spans': 'Multi-Span', 'non span': 'None span', 'list': 'List'}
answer_types = {'image span': 'Image-Span', 'question span': 'Question-Span', 'multiple spans': 'Multi-Span', 'non span': 'None span'}
evidence_types = {'table/list': 'Table/list', 'textual': 'Text', 'photo/pciture/visual_objects': 'Visual/Layout', 'figure': 'Figure', 'map': 'Map'}
reasoning_requirements = {'comparison': 'Sorting', 'arithmetic': 'Arithmetic', 'counting':'Counting'}
def save_json(file_path, data):
with open(file_path, 'w+') as json_file:
json.dump(data, json_file)
def levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
def validate_data(gtFilePath, submFilePath):
"""
Method validate_data: validates that all files in the results folder are correct (have the correct name contents).
Validates also that there are no missing files in the folder.
If some error detected, the method raises the error
"""
gtJson = json.load(open(gtFilePath,'rb'));
submJson = json.load(open(submFilePath,'rb'));
if not 'data' in gtJson:
raise Exception("The GT file is not valid (no data key)")
if not 'dataset_name' in gtJson:
raise Exception("The GT file is not valid (no dataset_name key)")
if isinstance(submJson, list) == False :
raise Exception("The Det file is not valid (root item must be an array)")
if len(submJson) != len(gtJson['data']) :
raise Exception("The Det file is not valid (invalid number of answers. Expected:" + str(len(gtJson['data'])) + " Found:" + str(len(submJson)) + ")")
gtQuestions = sorted([r['questionId'] for r in gtJson['data']])
res_id_to_index = {int(r['questionId']): ix for ix, r in enumerate(submJson)}
detQuestions = sorted([r['questionId'] for r in submJson])
if( (gtQuestions == detQuestions) == False ):
raise Exception("The Det file is not valid. Question IDs must much GT")
for gtObject in gtJson['data']:
try:
q_id = int(gtObject['questionId']);
res_ix = res_id_to_index[q_id];
except:
raise Exception("The Det file is not valid. Question " + str(gtObject['questionId']) + " not present")
else:
detObject = submJson[res_ix];
# if detObject['questionId'] != gtObject['questionId'] :
# raise Exception("Answer #" + str(i) + " not valid (invalid question ID. Expected:" + str(gtObject['questionId']) + "Found:" + detObject['questionId'] + ")")
if not 'answer' in detObject:
raise Exception("Question " + str(gtObject['questionId']) + " not valid (no answer key)")
if isinstance(detObject['answer'], list) == True :
raise Exception("Question " + str(gtObject['questionId']) + " not valid (answer key has to be a single string)")
def evaluate_method(gtFilePath, submFilePath, evaluationParams):
"""
Method evaluate_method: evaluate method and returns the results
Results. Dictionary with the following values:
- method (required) Global method metrics. Ex: { 'Precision':0.8,'Recall':0.9 }
- samples (optional) Per sample metrics. Ex: {'sample1' : { 'Precision':0.8,'Recall':0.9 } , 'sample2' : { 'Precision':0.8,'Recall':0.9 }
"""
show_scores_per_answer_type = evaluationParams.answer_types
gtJson = json.load(open(gtFilePath,'rb'));
submJson = json.load(open(submFilePath,'rb'));
res_id_to_index = {int(r['questionId']): ix for ix, r in enumerate(submJson)}
perSampleMetrics = {}
totalScore = 0
row = 0
if show_scores_per_answer_type:
answerTypeTotalScore = {x:0 for x in answer_types.keys()}
answerTypeNumQuestions = {x:0 for x in answer_types.keys()}
evidenceTypeTotalScore = {x:0 for x in evidence_types.keys()}
evidenceTypeNumQuestions = {x:0 for x in evidence_types.keys()}
reasoningTypeTotalScore = {x:0 for x in reasoning_requirements.keys()}
reasoningTypeNumQuestions = {x:0 for x in reasoning_requirements.keys()}
for gtObject in gtJson['data']:
q_id = int(gtObject['questionId']);
res_ix = res_id_to_index[q_id];
detObject = submJson[res_ix];
if q_id in question_ids_to_exclude:
question_result = 0
info = 'Question EXCLUDED from the result'
else:
info = ''
values = []
for answer in gtObject['answers']:
# preprocess both the answers - gt and prediction
gt_answer = ' '.join(answer.strip().lower().split())
det_answer = ' '.join(detObject['answer'].strip().lower().split())
#dist = levenshtein_distance(answer.lower(), detObject['answer'].lower())
dist = levenshtein_distance(gt_answer,det_answer)
length = max( len(answer.upper()), len(detObject['answer'].upper()) )
values.append( 0.0 if length == 0 else float(dist) / float(length) )
question_result = 1 - min(values)
if (question_result < evaluationParams.anls_threshold) :
question_result = 0
totalScore += question_result
if show_scores_per_answer_type:
for q_type in gtObject["answer_type"]:
answerTypeTotalScore[q_type] += question_result
answerTypeNumQuestions[q_type] += 1
for q_type in gtObject["evidence"]:
evidenceTypeTotalScore[q_type] += question_result
evidenceTypeNumQuestions[q_type] += 1
for q_type in gtObject["operation/reasoning"]:
reasoningTypeTotalScore[q_type] += question_result
reasoningTypeNumQuestions[q_type] += 1
perSampleMetrics[str(gtObject['questionId'])] = {
'score':question_result,
'question':gtObject['question'],
'gt':gtObject['answers'],
'det':detObject['answer'],
'info': info
}
row = row + 1
methodMetrics = {
'score': 0 if len(gtJson['data']) == 0 else totalScore/ (len(gtJson['data']) - len(question_ids_to_exclude) )
}
answer_types_scores = {}
evidence_types_scores = {}
operation_types_scores = {}
if show_scores_per_answer_type:
for a_type, ref in answer_types.items():
answer_types_scores[ref] = 0 if len(gtJson['data']) == 0 else answerTypeTotalScore[a_type] / (answerTypeNumQuestions[a_type] )
for e_type, ref in evidence_types.items():
evidence_types_scores[ref] = 0 if len(gtJson['data']) == 0 else evidenceTypeTotalScore[e_type] / (evidenceTypeNumQuestions[e_type] )
for r_type, ref in reasoning_requirements.items():
operation_types_scores[ref] = 0 if len(gtJson['data']) == 0 else reasoningTypeTotalScore[r_type] / (reasoningTypeNumQuestions[r_type] )
resDict = {
'result': methodMetrics,
'scores_by_types': {'answer_types': answer_types_scores, 'evidence_types': evidence_types_scores, 'operation_types': operation_types_scores},
'per_sample_result':perSampleMetrics
}
return resDict;
def display_results(results, show_answer_types):
print("\nOverall ANLS: {:2.4f}".format(results['result']['score']))
if show_answer_types:
print("\nAnswer types:")
for a_type in answer_types.values():
print("\t{:12s} {:2.4f}".format(a_type, results['scores_by_types']['answer_types'][a_type]))
print("\nEvidence types:")
for e_type in evidence_types.values():
print("\t{:12s} {:2.4f}".format(e_type, results['scores_by_types']['evidence_types'][e_type]))
print("\nOperation required:")
for r_type in reasoning_requirements.values():
print("\t{:12s} {:2.4f}".format(r_type, results['scores_by_types']['operation_types'][r_type]))
if __name__=='__main__':
parser = argparse.ArgumentParser(description="InfographVQA evaluation script.")
parser.add_argument('-g', '--ground_truth', type=str, help="Path of the Ground Truth file.", required=True)
parser.add_argument('-s', '--submission_file', type=str, help="Path of your method's results file.", required=True)
parser.add_argument('-t', '--anls_threshold', type=float, default=0.5, help="ANLS threshold to use (See Scene-Text VQA paper for more info.).", required=False)
parser.add_argument('-a', '--answer_types', type=bool, default=False, help="Score break down by answer types (special gt file required).", required=False)
parser.add_argument('-o', '--output', type=str, help="Path to a directory where to copy the file 'results.json' that contains per-sample results.", required=False)
args = parser.parse_args()
# Validate the format of ground truth and submission files.
validate_data(args.ground_truth, args.submission_file)
# Evaluate method
results = evaluate_method(args.ground_truth, args.submission_file, args)
display_results(results, args.answer_types)
if args.output:
output_dir = args.output
if not os.path.exists(output_dir):
os.makedirs(output_dir)
resultsOutputname = os.path.join(output_dir, 'results.json')
save_json(resultsOutputname, results)
print("All results including per-sample result has been correctly saved!")
================================================
FILE: eval_mm/mmbench/MMBENCH.md
================================================
# MMBench Evaluation
## Data
```bash
/cpfs01/shared/public/shusheng.yss/workspace/23082502_qwenvl_eval_test/eval_mm/data/mmbench
```
## Dev
```bash
checkpoint=/PATH/TO/CHECKPOINT
ds=mmbench_dev_20230712
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_multiple_choice_mmbench.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 2 \
--num-workers 2
# the results will be saved to mmbench_dev_20230712.json
# without consistency constrain
python mmbench_evaluation.py
# with consistency constrain
python mmbench_evaluation_tricky.py
```
## Test
```bash
checkpoint=/PATH/TO/CHECKPOINT
ds=mmbench_test_20230712
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
evaluate_multiple_choice_mmbench.py \
--checkpoint $checkpoint \
--dataset $ds \
--batch-size 2 \
--num-workers 2
# the results will be saved to mmbench_test_20230712.json
# convert to submission format with consistency constrain
python mmbench_predict_to_submission.py
```
================================================
FILE: eval_mm/mmbench/evaluate_multiple_choice_mmbench.py
================================================
import argparse
import itertools
import json
import os
from functools import partial
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
multiple_choices = ['A', 'B', 'C', 'D', 'E']
ds_collections = {
'mmbench_dev_20230712': {
'test': 'data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.jsonl',
},
'mmbench_test_20230712': {
'test': 'data/mmbench/mmbench_test_20230712/mmbench_test_20230712.jsonl',
}
}
def collate_fn(batches, pad_token_id):
indexes = [_['index'] for _ in batches]
input_tokens = [_['input_tokens'] for _ in batches]
target_lengths = [_['target_lengths'] for _ in batches]
chunk_sizes = [len(_) for _ in input_tokens]
input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)]
max_lengths = max([len(_) for _ in input_tokens])
input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _
for _ in input_tokens]
input_tokens = torch.LongTensor(input_tokens)
attention_mask = 1 - input_tokens.eq(pad_token_id).float()
return input_tokens, attention_mask, target_lengths, chunk_sizes, indexes
class MultipleChoiceDataste(torch.utils.data.Dataset):
def __init__(self, test, prompt, tokenizer):
self.datas = open(test).readlines()
self.prompt = prompt
self.tokenizer = tokenizer
def __len__(self):
return len(self.datas)
def __getitem__(self, idx):
data = json.loads(self.datas[idx].strip())
index = data['index']
image = data['image']
hint = data['hint'] if data['hint'] else 'N/A'
question = data['question']
choices = data['choices']
choice_list = []
for i, c in enumerate(choices):
choice_list.append('{}. {}'.format(multiple_choices[i], c))
choice_txt = '\n'.join(choice_list)
prompt = self.prompt.format(image, hint, question, choice_txt)
prompt_tokens = self.tokenizer(prompt).input_ids
target_tokens = [
self.tokenizer(' ' + _).input_ids
for _ in multiple_choices[:len(choices)]
]
return {
'index': index,
'input_tokens': [prompt_tokens + _ for _ in target_tokens],
'target_lengths': [len(_) for _ in target_tokens],
# 'answer': data['answer'],
}
class InferenceSampler(torch.utils.data.sampler.Sampler):
def __init__(self, size):
self._size = int(size)
assert size > 0
self._rank = torch.distributed.get_rank()
self._world_size = torch.distributed.get_world_size()
self._local_indices = self._get_local_indices(size, self._world_size,
self._rank)
@staticmethod
def _get_local_indices(total_size, world_size, rank):
shard_size = total_size // world_size
left = total_size % world_size
shard_sizes = [shard_size + int(r < left) for r in range(world_size)]
begin = sum(shard_sizes[:rank])
end = min(sum(shard_sizes[:rank + 1]), total_size)
return range(begin, end)
def __iter__(self):
yield from self._local_indices
def __len__(self):
return len(self._local_indices)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--checkpoint', type=str, default='')
parser.add_argument('--dataset', type=str, default='')
parser.add_argument('--batch-size', type=int, default=1)
parser.add_argument('--num-workers', type=int, default=1)
args = parser.parse_args()
torch.distributed.init_process_group(
backend='nccl',
world_size=int(os.getenv('WORLD_SIZE', '1')),
rank=int(os.getenv('RANK', '0')),
)
torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0)))
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint,
trust_remote_code=True)
prompt = '
{}Context: {}\nQuestion: {}\nOptions: {}\nAnswer:'
dataset = MultipleChoiceDataste(test=ds_collections[args.dataset]['test'],
prompt=prompt,
tokenizer=tokenizer)
dataloader = torch.utils.data.DataLoader(
dataset=dataset,
sampler=InferenceSampler(len(dataset)),
batch_size=args.batch_size,
num_workers=args.num_workers,
pin_memory=True,
drop_last=False,
collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id),
)
results = []
with torch.no_grad():
for _, (input_tokens, attention_mask, target_lengths,
chunk_sizes, indexes) in tqdm(enumerate(dataloader)):
outputs = model(
input_ids=input_tokens[:, :-1].cuda(),
attention_mask=attention_mask[:, :-1].cuda(),
return_dict=True,
)
losses = torch.nn.functional.cross_entropy(outputs.logits.permute(
0, 2, 1),
input_tokens[:,
1:].cuda(),
reduction='none')
losses = losses.split(chunk_sizes, dim=0)
for loss, target_length, index in zip(losses, target_lengths, indexes):
target_loss = loss.mean(-1)
for _ in range(len(target_length)):
target_loss[_] = loss[_, -target_length[_]:].mean()
pred = target_loss.argmin().item()
results.append({
"index": index,
"prediction": pred,
})
torch.distributed.barrier()
world_size = torch.distributed.get_world_size()
merged_results = [None for _ in range(world_size)]
torch.distributed.all_gather_object(merged_results, results)
merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)]
if torch.distributed.get_rank() == 0:
json.dump(merged_results, open(f"{args.dataset}.json", "w"))
torch.distributed.barrier()
================================================
FILE: eval_mm/mmbench/mmbench_converter_dev.py
================================================
import pandas as pd
import io
import base64
import json
from PIL import Image
'''
This scripts convert mmbench_dev tsv file to jsonl
'''
datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t')
global_choices = ['A', 'B', 'C', 'D']
def decode_base64_to_image(base64_string):
image_data = base64.b64decode(base64_string)
image = Image.open(io.BytesIO(image_data))
return image
with open('./data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.jsonl', 'w') as f:
for idx in range(len(datas)):
data = datas.iloc[idx]
index = int(data['index'])
question = data['question']
hint = data['hint'] if not pd.isna(data['hint']) else 'N/A'
choices = []
for opt in global_choices:
if pd.isna(data[opt]):
continue
choices.append(data[opt])
answer = global_choices.index(data['answer'])
image = decode_base64_to_image(data['image'])
image.save("data/mmbench/mmbench_dev_20230712/images/%d.jpg" % index)
f.write(json.dumps({
"index": index,
"image": "data/mmbench/mmbench_dev_20230712/images/%d.jpg" % index,
"hint": hint,
"question": question,
"choices": choices,
"answer": answer,
}) + "\n")
================================================
FILE: eval_mm/mmbench/mmbench_converter_test.py
================================================
import pandas as pd
import io
import base64
import json
from PIL import Image
'''
This script convert mmbench_test tsv file to jsonl
This script is very similar to mmbench_converter_dev except there's no answer for accuracy calculation
'''
datas = pd.read_csv("data/mmbench/mmbench_test_20230712/mmbench_test_20230712.tsv", sep='\t')
global_choices = ['A', 'B', 'C', 'D']
def decode_base64_to_image(base64_string):
image_data = base64.b64decode(base64_string)
image = Image.open(io.BytesIO(image_data))
return image
with open('./data/mmbench/mmbench_test_20230712/mmbench_test_20230712.jsonl', 'w') as f:
for idx in range(len(datas)):
data = datas.iloc[idx]
index = int(data['index'])
question = data['question']
hint = data['hint'] if not pd.isna(data['hint']) else 'N/A'
choices = []
for opt in global_choices:
if pd.isna(data[opt]):
continue
choices.append(data[opt])
# answer = global_choices.index(data['answer'])
image = decode_base64_to_image(data['image'])
image.save("data/mmbench/mmbench_test_20230712/images/%d.jpg" % index)
f.write(json.dumps({
"index": index,
"image": "data/mmbench/mmbench_test_20230712/images/%d.jpg" % index,
"hint": hint,
"question": question,
"choices": choices,
# "answer": answer,
}) + "\n")
================================================
FILE: eval_mm/mmbench/mmbench_evaluation.py
================================================
import pandas as pd
import json
'''
This script provides `global top-1 accuracy` metric calculation for mmbench_dev.
'''
predictions = json.load(open('mmbench_dev_20230712.json'))
index2predictions = {}
for pred in predictions:
index2predictions[pred['index']] = pred['prediction']
datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t')
glb_opts = ['A', 'B', 'C', 'D']
index2answer = {}
for idx in range(len(datas)):
data = datas.iloc[idx]
index2answer[data['index']] = glb_opts.index(data['answer'])
identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()]))
correct = 0
total = 0
for index in identity_indexes:
for _ in range(4):
cycle_index = int(_ * 1e6 + index)
if index2predictions.get(cycle_index, None) is not None:
if index2predictions[cycle_index] == index2answer[cycle_index]:
continue
else:
print(cycle_index)
break
else:
correct += 1
total += 1
print(correct, total)
================================================
FILE: eval_mm/mmbench/mmbench_evaluation_tricky.py
================================================
import pandas as pd
import json
import random
'''
This script provides metric calculation for mmbench_dev with the same accuarcy algo as OpenCompass server
'''
predictions = json.load(open('mmbench_dev_20230712.json'))
index2predictions = {}
for pred in predictions:
index2predictions[pred['index']] = pred['prediction']
from collections import Counter
def most_common_elements(lst):
counter = Counter(lst)
max_count = max(counter.values())
most_common = [element for element, count in counter.items() if count == max_count]
return random.choice(most_common) # random sample from random choice
datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t')
glb_opts = ['A', 'B', 'C', 'D']
index2answer = {}
index2choices = {}
index2rawanswer = {}
for idx in range(len(datas)):
data = datas.iloc[idx]
choices = []
for opt in glb_opts:
if not pd.isna(data[opt]):
choices.append(data[opt])
index2choices[data['index']] = choices
index2answer[data['index']] = glb_opts.index(data['answer'])
index2rawanswer[data['index']] = choices[glb_opts.index(data['answer'])]
identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()]))
correct = 0
total = 0
for index in identity_indexes:
raw_preds = []
raw_answer = []
for _ in range(4):
cycle_index = int(_ * 1e6 + index)
if index2predictions.get(cycle_index, None) is not None:
raw_answer = index2rawanswer[cycle_index]
raw_pred = index2choices[cycle_index][index2predictions[cycle_index]]
raw_preds.append(raw_pred)
if len(set(raw_preds)) == 1:
if raw_preds[0] == raw_answer:
correct += 1
else:
result = most_common_elements(raw_preds)
if result == raw_answer:
correct += 1
total += 1
print(correct, total, correct / total * 100.)
================================================
FILE: eval_mm/mmbench/mmbench_predict_to_submission.py
================================================
import pandas as pd
import json
import random
'''
This script convert the output file of our inference processor to target formation of OpenCompass evaluator server
'''
predictions = json.load(open('mmbench_test_20230712.json'))
index2predictions = {}
for pred in predictions:
index2predictions[pred['index']] = pred['prediction']
from collections import Counter
def most_common_elements(lst):
counter = Counter(lst)
max_count = max(counter.values())
most_common = [element for element, count in counter.items() if count == max_count]
print(most_common)
return random.choice(most_common)
# return most_common
datas = pd.read_csv("data/mmbench/mmbench_test_20230712/mmbench_test_20230712.tsv", sep='\t')
datas = datas.drop('image', axis=1)
glb_opts = ['A', 'B', 'C', 'D']
index2choices = {}
for idx in range(len(datas)):
data = datas.iloc[idx]
choices = []
for opt in glb_opts:
if not pd.isna(data[opt]):
choices.append(data[opt])
index2choices[data['index']] = choices
identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()]))
processed_index2predictions = {}
for index in identity_indexes:
raw_preds = []
for _ in range(4):
cycle_index = int(_ * 1e6 + index)
if index2predictions.get(cycle_index, None) is not None:
raw_pred = index2choices[cycle_index][index2predictions[cycle_index]]
raw_preds.append(raw_pred)
if len(set(raw_preds)) == 1:
pred_answer = raw_preds[0]
else:
pred_answer = most_common_elements(raw_preds)
print(index, pred_answer)
for _ in range(4):
cycle_index = int(_ * 1e6 + index)
if index2predictions.get(cycle_index, None) is not None:
processed_index2predictions[cycle_index] = index2choices[cycle_index].index(pred_answer)
predictions = []
for idx in range(len(datas)):
data = datas.iloc[idx]
index = data['index']
prediction = glb_opts[processed_index2predictions[index]]
predictions.append(prediction)
datas['prediction'] = predictions
datas.to_excel("mmbench_test_20230712_230831_constrained.xlsx", index=False)
# constrained means we force the model predict same answer when tested on a question for multiple times
================================================
FILE: eval_mm/mme/EVAL_MME.md
================================================
# MME Benchmark
[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.
Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation.
Perception Evaluation
| Rank | Model | Version | Score |
|:----:|:---------------:|:------------------------:|:-------:|
| 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)**| **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **1487.57** |
| 2 | Skywork-MM | Skywork-MM-13B | 1419.08 |
| 3 | MMICL | FlanT5xxl | 1376.00 |
| 4 | Lynx | vicuna-7b | 1373.23 |
| 5 | BLIVA | FlanT5xxl | 1337.73 |
Cognition Evaluation
| Rank | Model | Version | Score |
|:----:|:----------------:|:--------------:|:----------:|
| 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)** | **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **360.71** |
| 2 | MMICL | FlanT5xxl | 360.36 |
| 3 | Skywork-MM | Skywork-MM-13B | 356.43 |
| 4 | BLIVA | FlanT5xxl | 331.43 |
| 5 | LRV-Instruction | LRV-7B | 328.21 |
Full Metrics
```
=========== Perception ===========
total score: 1487.576330532213
existence score: 158.33333333333331
count score: 150.0
position score: 128.33333333333334
color score: 170.0
posters score: 178.57142857142856
celebrity score: 120.58823529411764
scene score: 152.25
landmark score: 164.0
artwork score: 125.5
OCR score: 140.0
=========== Cognition ===========
total score: 360.71428571428567
commonsense_reasoning score: 130.7142857142857
numerical_calculation score: 40.0
text_translation score: 147.5
code_reasoning score: 42.5
```
## How To Reproduce Results of MME Benchmark
1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md)
2. Rearrange images by executing `python get_images.py`
3. Evaluate Qwen-VL-Chat results by executing `python eval.py`
4. Calculate MME results by executing `python calculation.py --results_dir Qwen-VL-Chat`, which the calculation script comes from the MME eval_tool.
================================================
FILE: eval_mm/mme/eval.py
================================================
import os
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
checkpoint = 'Qwen/Qwen-VL-Chat'
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
checkpoint, device_map='cuda', trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained(checkpoint, trust_remote_code=True)
model.generation_config.top_p = 0.01
root = 'Your_Results'
output = 'Qwen-VL-Chat'
os.makedirs(output, exist_ok=True)
for filename in os.listdir(root):
with open(os.path.join(root, filename), 'r') as fin, open(os.path.join(output, filename), 'w') as fout:
lines = fin.read().splitlines()
filename = filename.replace('.txt', '')
for line in tqdm(lines):
img, question, gt = line.strip().split('\t')
img_path = os.path.join('images', filename, img)
assert os.path.exists(img_path), img_path
query = f'
{img_path}\n{question}'
response, _ = model.chat(tokenizer, query=query, history=None)
print(img, question, gt, response, sep='\t', file=fout)
================================================
FILE: eval_mm/mme/get_images.py
================================================
import os
from tqdm import tqdm
os.system('rm -rf images')
os.system('mkdir images')
os.system('cp -r ../MME_Benchmark_release/OCR images/')
os.system('mkdir images/artwork')
os.system('cp ../MME_Benchmark_release/artwork/questions_answers_YN/* images/artwork/')
with open('LaVIN/artwork.txt') as fin:
paths = [ line.strip().split('\t', 1)[0] for line in fin ]
paths = list(set(paths))
for path in tqdm(paths):
os.system(f'cp ../MME_Benchmark_release/artwork/images/toy_dataset/{path} images/artwork/{path}')
os.system('mkdir images/celebrity')
os.system('cp ../MME_Benchmark_release/celebrity/images/* images/celebrity/')
os.system('cp ../MME_Benchmark_release/celebrity/questions_answers_YN/* images/celebrity/')
os.system('cp -r ../MME_Benchmark_release/code_reasoning images/')
os.system('cp -r ../MME_Benchmark_release/color images/')
os.system('cp -r ../MME_Benchmark_release/commonsense_reasoning images/')
os.system('cp -r ../MME_Benchmark_release/count images/')
os.system('cp -r ../MME_Benchmark_release/existence images/')
os.system('mkdir images/landmark')
os.system('cp ../MME_Benchmark_release/landmark/images/* images/landmark/')
os.system('cp ../MME_Benchmark_release/landmark/questions_answers_YN/* images/landmark/')
os.system('cp -r ../MME_Benchmark_release/numerical_calculation images/')
os.system('cp -r ../MME_Benchmark_release/position images/')
os.system('mkdir images/posters')
os.system('cp ../MME_Benchmark_release/posters/images/* images/posters/')
os.system('cp ../MME_Benchmark_release/posters/questions_answers_YN/* images/posters/')
os.system('mkdir images/scene')
os.system('cp ../MME_Benchmark_release/scene/images/* images/scene/')
os.system('cp ../MME_Benchmark_release/scene/questions_answers_YN/* images/scene/')
os.system('cp -r ../MME_Benchmark_release/text_translation images/')
================================================
FILE: eval_mm/seed_bench/EVAL_SEED.md
================================================
# Seed-Bench Evaluation
[SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both **image** and **video** understanding.
Qwen-VL and Qwen-VL-Chat achieve SOTAs on this benchmark.
video_imgs_4/v0_0.jpg\n
video_imgs_4/v0_1.jpg\n
video_imgs_4/v0_2.jpg\n
video_imgs_4/v0_3.jpg\nQuestion: Can you identify the action taking place in the video?\nOptions: A. pretending to take something out of something\nB. pretending to take something from somewhere\nC. feigning to insert something into something\nD. simulating putting something onto something\nAnswer:"
}
```
The above JSON line can be used as the input by `eval_mm/seed_bench/eval.py` and output the following results:
```
{"question_id": "v0", "prediction": "B"}
```
Please see [eval_mm/seed_bench/eval.py](eval.py) for more inference details.
## How To Reproduce Results of Seed-Bench
1. Download all images and videos by following the [instruction](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md). Then modify the root path in `eval_mm/seed_bench/trans.py` with your customized path.
```
# path of SEED-Bench.json, download from https://huggingface.co/datasets/AILab-CVC/SEED-Bench/blob/main/SEED-Bench.json
seed_bench_input_path = 'SEED-Bench.json'
# root directory of evaluation dimension 1-9, following https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md
cc3m_dir = "/YOUR_PATH_TO/seed_bench_image"
# root directory of evaluation dimension 10
dimension10_dir = "/YOUR_PATH_TO/SSV2/videos"
# root directory of evaluation dimension 11
dimension11_dir = "/YOUR_PATH_TO/EPIC-KITCHENS/3h91syskeag572hl6tvuovwv4d/videos/test"
# root directory of evaluation dimension 12
dimension12_dir = "/YOUR_PATH_TO/BreakfastII_15fps_qvga_sync"
```
2. Generate input files of Qwen-VL with the JSON formatting.
```
cd eval_mm/seed_bench/
python trans.py
```
This script will output two JSONL files and one directory. `image_input.jsonl` is the input file of image evaluation and `video_input_4.jsonl` is the input file of video evaluation by 4 frames. The directory `video_imgs_4` contains all 4-framed images extracted from videos. We provide our [image_input.jsonl](http://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/seed_bench/image_input.jsonl) and [video_input_4.jsonl](http://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/seed_bench/video_input_4.jsonl) here for reference.
3. Produce the results of Seed-Bench.
```
# The number of available GPUs
export NPROC_PER_NODE=8
# Produce the Qwen-VL-Chat results of image understanding
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
eval.py \
--checkpoint Qwen/Qwen-VL-Chat \
--dataset image_input.jsonl \
--batch-size 4 \
--num-workers 2
# Collect the result files
cat result_?.jsonl >results_chat_img.jsonl
rm result_?.jsonl
# Produce the results of video understanding
python -m torch.distributed.launch --use-env \
--nproc_per_node ${NPROC_PER_NODE:-8} \
--nnodes ${WORLD_SIZE:-1} \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-12345} \
eval.py \
--checkpoint Qwen/Qwen-VL-Chat \
--dataset video_input_4.jsonl \
--batch-size 2 \
--num-workers 1
# Collect the result files
cat result_?.jsonl >results_chat_vid.jsonl
rm result_?.jsonl
# The file `results_chat.jsonl` can be submitted to the leaderboard
cat results_chat_img.jsonl results_chat_vid.jsonl >results_chat.jsonl
```
You can reproduce the Seed-Bench results of Qwen-VL by replacing `Qwen/Qwen-VL-Chat` with `Qwen/Qwen-VL` on the above script.
================================================
FILE: eval_mm/seed_bench/eval.py
================================================
import argparse
import itertools
import json
import os
from functools import partial
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
def collate_fn(batches, pad_token_id):
input_tokens = [_['input_tokens'] for _ in batches]
target_lengths = [_['target_lengths'] for _ in batches]
answers = [_['answer'] for _ in batches]
question_id = [_['question_id'] for _ in batches]
chunk_sizes = [len(_) for _ in input_tokens]
input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)]
max_lengths = max([len(_) for _ in input_tokens])
input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _
for _ in input_tokens]
input_tokens = torch.LongTensor(input_tokens)
attention_mask = 1 - input_tokens.eq(pad_token_id).float()
return input_tokens, attention_mask, target_lengths, answers, chunk_sizes, question_id
class MultipleChoiceDataste(torch.utils.data.Dataset):
def __init__(self, test, tokenizer):
self.datas = []
with open(test) as fin:
for line in tqdm(fin):
self.datas.append(json.loads(line.strip()))
self.tokenizer = tokenizer
def __len__(self):
return len(self.datas)
def __getitem__(self, idx):
data = self.datas[idx]
prompt = data['prompt']
prompt_tokens = self.tokenizer(prompt).input_ids
target_tokens = [
self.tokenizer(' ' + _).input_ids
for _ in ['A', 'B', 'C', 'D']
]
return {
'input_tokens': [prompt_tokens + _ for _ in target_tokens],
'target_lengths': [len(_) for _ in target_tokens],
'answer': data['answer'],
'question_id': data['question_id'],
}
class InferenceSampler(torch.utils.data.sampler.Sampler):
def __init__(self, size):
self._size = int(size)
assert size > 0
self._rank = torch.distributed.get_rank()
self._world_size = torch.distributed.get_world_size()
self._local_indices = self._get_local_indices(size, self._world_size,
self._rank)
@staticmethod
def _get_local_indices(total_size, world_size, rank):
shard_size = total_size // world_size
left = total_size % world_size
shard_sizes = [shard_size + int(r < left) for r in range(world_size)]
begin = sum(shard_sizes[:rank])
end = min(sum(shard_sizes[:rank + 1]), total_size)
return range(begin, end)
def __iter__(self):
yield from self._local_indices
def __len__(self):
return len(self._local_indices)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--checkpoint', type=str, default='')
parser.add_argument('--dataset', type=str, default='')
parser.add_argument('--batch-size', type=int, default=1)
parser.add_argument('--num-workers', type=int, default=1)
args = parser.parse_args()
torch.distributed.init_process_group(
backend='nccl',
world_size=int(os.getenv('WORLD_SIZE', '1')),
rank=int(os.getenv('RANK', '0')),
)
torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0)))
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint, device_map='cuda', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint,
trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(args.checkpoint, trust_remote_code=True)
model.generation_config.top_p = 0.01
dataset = MultipleChoiceDataste(test=args.dataset, tokenizer=tokenizer)
dataloader = torch.utils.data.DataLoader(
dataset=dataset,
# sampler=InferenceSampler(1000),
sampler=InferenceSampler(len(dataset)),
batch_size=args.batch_size,
num_workers=args.num_workers,
pin_memory=True,
drop_last=False,
collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id),
)
results = []
fout = open('result_{}.jsonl'.format(torch.distributed.get_rank()), 'w')
with torch.no_grad():
for _, (input_tokens, attention_mask, target_lengths, answers,
chunk_sizes, question_ids) in tqdm(enumerate(dataloader)):
outputs = model(
input_ids=input_tokens[:, :-1].cuda(),
attention_mask=attention_mask[:, :-1].cuda(),
return_dict=True,
)
losses = torch.nn.functional.cross_entropy(outputs.logits.permute(
0, 2, 1),
input_tokens[:,
1:].cuda(),
reduction='none')
losses = losses.split(chunk_sizes, dim=0)
for loss, target_length, answer, question_id in zip(losses, target_lengths,
answers, question_ids):
target_loss = loss.mean(-1)
for _ in range(len(target_length)):
target_loss[_] = loss[_, -target_length[_]:].mean()
pred = target_loss.argmin().item()
pred = chr(pred + 65)
if pred == answer:
results.append(1)
else:
results.append(0)
answer_record = {
'question_id': question_id,
'prediction': pred
}
print(json.dumps(answer_record), file=fout)
fout.close()
torch.distributed.barrier()
world_size = torch.distributed.get_world_size()
merged_results = [None for _ in range(world_size)]
torch.distributed.all_gather_object(merged_results, results)
merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)]
if torch.distributed.get_rank() == 0:
print(f"Evaluating {args.dataset} ...")
print(f'Acc@1: {sum(merged_results) / len(merged_results)}')
torch.distributed.barrier()
================================================
FILE: eval_mm/seed_bench/trans.py
================================================
import os
import av
import json
import torch
import numpy as np
from PIL import Image
from tqdm import tqdm
from decord import VideoReader, cpu
# path of SEED-Bench.json, download from https://huggingface.co/datasets/AILab-CVC/SEED-Bench/blob/main/SEED-Bench.json
seed_bench_input_path = 'SEED-Bench.json'
# root directory of evaluation dimension 1-9, following https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md
cc3m_dir = "/YOUR_PATH_TO/seed_bench_image"
# root directory of evaluation dimension 10
dimension10_dir = "/YOUR_PATH_TO/SSV2/videos"
# root directory of evaluation dimension 11
dimension11_dir = "/YOUR_PATH_TO/EPIC-KITCHENS/3h91syskeag572hl6tvuovwv4d/videos/test"
# root directory of evaluation dimension 12
dimension12_dir = "/YOUR_PATH_TO/BreakfastII_15fps_qvga_sync"
def is_integer_string(s):
try:
int(s)
return True
except ValueError:
return False
def filter_questions(data, task='all'):
if task == "image":
return [q for q in data if 1 <= q["question_type_id"] <= 9]
elif task == "video":
return [q for q in data if 10 <= q["question_type_id"] <= 12]
elif task == "all":
return data
elif is_integer_string(task):
return [q for q in data if q["question_type_id"] == int(task)]
else:
raise ValueError(f"Invalid task: {task}")
def get_index(num_frames, num_segments):
if num_segments > num_frames:
offsets = np.array([
idx for idx in range(num_frames)
])
else:
# uniform sampling
seg_size = float(num_frames - 1) / num_segments
start = int(seg_size / 2)
offsets = np.array([
start + int(np.round(seg_size * idx)) for idx in range(num_segments)
])
return offsets
with open(seed_bench_input_path) as fin:
qa_anno = json.load(fin)['questions']
fout = open('image_input.jsonl', 'w')
i_anno = filter_questions(qa_anno, 'image')
for qa_item in tqdm(i_anno):
data_path = cc3m_dir + qa_item['data_id']
choices = [qa_item['choice_a'], qa_item['choice_b'], qa_item['choice_c'], qa_item['choice_d']]
choice_list = []
for i, c in enumerate(choices):
choice_list.append('{}. {}'.format(chr(i + 65), c))
choice_txt = '\n'.join(choice_list)
prompt = '
{}\nQuestion: {}\nOptions: {}\nAnswer:'.format(
data_path, qa_item['question'], choice_txt)
print(json.dumps({
'question_id': qa_item['question_id'],
'prompt': prompt,
'answer': qa_item['answer'],
}), file=fout)
fout.close()
n_frames = 8
os.system('rm -rf video_input_' + str(n_frames))
os.makedirs('video_imgs_' + str(n_frames), exist_ok=True)
fout = open('video_input_{}.jsonl'.format(n_frames), 'w')
v_anno = filter_questions(qa_anno, 'video')
for qa_item in tqdm(v_anno):
if qa_item['question_type_id'] == 12:
data_path = dimension12_dir + qa_item['data_id']
elif qa_item['question_type_id'] == 11:
data_path = dimension11_dir + qa_item['data_id'].split('/')[-1]
elif qa_item['question_type_id'] == 10:
data_path = dimension10_dir + qa_item['data_id']
else:
assert False, str(qa_item)
print(data_path)
use_pyav = False
if 'segment' in qa_item.keys():
segment = qa_item['segment']
if isinstance(segment[0], int):
# using pyav for decoding videos in evaluation dimension 12
use_pyav = True
start, end = segment[0], segment[1]
else:
start = 0.0
end = 0.0
if use_pyav:
# using pyav for decoding videos in evaluation dimension 12
reader = av.open(data_path)
frames = [torch.from_numpy(f.to_rgb().to_ndarray()) for f in reader.decode(video=0)]
video_len = len(frames)
start_frame, end_frame = start, end
end_frame = min(end_frame, video_len)
offset = get_index(end_frame - start_frame, n_frames)
frame_indices = offset + start_frame
images = torch.stack([frames[idx] for idx in frame_indices]).numpy()
else:
# using decord for decoding videos in evaluation dimension 10-11
try:
vr = VideoReader(data_path, num_threads=1, ctx=cpu(0))
video_len = len(vr)
fps = vr.get_avg_fps()
if 'segment' in qa_item.keys():
# obtain start and end frame for the video segment in evaluation dimension 11
start_frame = int(min(max(start * fps, 0), video_len - 1))
end_frame = int(min(max(end * fps, 0), video_len - 1))
tot_frames = int(end_frame - start_frame)
offset = get_index(tot_frames, n_frames)
frame_indices = offset + start_frame
else:
# sample frames of the video in evaluation dimension 10
frame_indices = get_index(video_len - 1, n_frames)
vr.seek(0)
images = vr.get_batch(frame_indices).asnumpy()
except Exception as e:
print(json.dumps({
'question_id': qa_item['question_id'],
'prompt': "Error" + str(e),
'answer': qa_item['answer'],
}), file=fout)
continue
prompt = ''
for i in range(images.shape[0]):
data = Image.fromarray(images[i])
img_path = 'video_imgs_{}/{}_{}.jpg'.format(n_frames, qa_item['question_id'], i)
data.save(img_path)
prompt += '
' + img_path + '\n'
choices = [qa_item['choice_a'], qa_item['choice_b'], qa_item['choice_c'], qa_item['choice_d']]
choice_list = []
for i, c in enumerate(choices):
choice_list.append('{}. {}'.format(chr(i + 65), c))
choice_txt = '\n'.join(choice_list)
prompt += 'Question: {}\nOptions: {}\nAnswer:'.format(qa_item['question'], choice_txt)
print(json.dumps({
'question_id': qa_item['question_id'],
'prompt': prompt,
'answer': qa_item['answer'],
}), file=fout)
fout.close()
================================================
FILE: eval_mm/vqa.py
================================================
"""Copyright (c) 2022, salesforce.com, inc.
All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""
__author__ = 'aagrawal'
__version__ = '0.9'
# Interface for accessing the VQA dataset.
# This code is based on the code written by Tsung-Yi Lin for MSCOCO Python API available at the following link:
# (https://github.com/pdollar/coco/blob/master/PythonAPI/pycocotools/coco.py).
# The following functions are defined:
# VQA - VQA class that loads VQA annotation file and prepares data structures.
# getQuesIds - Get question ids that satisfy given filter conditions.
# getImgIds - Get image ids that satisfy given filter conditions.
# loadQA - Load questions and answers with the specified question ids.
# showQA - Display the specified questions and answers.
# loadRes - Load result file and create result object.
# Help on each function can be accessed by: "help(COCO.function)"
import copy
import datetime
import json
class VQA:
def __init__(self, annotation_file=None, question_file=None):
"""Constructor of VQA helper class for reading and visualizing
questions and answers.
:param annotation_file (str): location of VQA annotation file
:return:
"""
# load dataset
self.dataset = {}
self.questions = {}
self.qa = {}
self.qqa = {}
self.imgToQA = {}
if not annotation_file == None and not question_file == None:
print('loading VQA annotations and questions into memory...')
time_t = datetime.datetime.utcnow()
dataset = json.load(open(annotation_file, 'r'))
questions = json.load(open(question_file, 'r'))
self.dataset = dataset
self.questions = questions
self.createIndex()
def createIndex(self):
# create index
print('creating index...')
imgToQA = {ann['image_id']: [] for ann in self.dataset['annotations']}
qa = {ann['question_id']: [] for ann in self.dataset['annotations']}
qqa = {ann['question_id']: [] for ann in self.dataset['annotations']}
for ann in self.dataset['annotations']:
imgToQA[ann['image_id']] += [ann]
qa[ann['question_id']] = ann
for ques in self.questions['questions']:
qqa[ques['question_id']] = ques
print('index created!')
# create class members
self.qa = qa
self.qqa = qqa
self.imgToQA = imgToQA
def info(self):
"""Print information about the VQA annotation file.
:return:
"""
for key, value in self.datset['info'].items():
print('%s: %s' % (key, value))
def getQuesIds(self, imgIds=[], quesTypes=[], ansTypes=[]):
"""Get question ids that satisfy given filter conditions. default skips
that filter.
:param imgIds (int array) : get question ids for given imgs
quesTypes (str array) : get question ids for given question types
ansTypes (str array) : get question ids for given answer types
:return: ids (int array) : integer array of question ids
"""
imgIds = imgIds if type(imgIds) == list else [imgIds]
quesTypes = quesTypes if type(quesTypes) == list else [quesTypes]
ansTypes = ansTypes if type(ansTypes) == list else [ansTypes]
if len(imgIds) == len(quesTypes) == len(ansTypes) == 0:
anns = self.dataset['annotations']
else:
if not len(imgIds) == 0:
anns = sum(
[
self.imgToQA[imgId]
for imgId in imgIds if imgId in self.imgToQA
],
[],
)
else:
anns = self.dataset['annotations']
anns = (anns if len(quesTypes) == 0 else
[ann for ann in anns if ann['question_type'] in quesTypes])
anns = (anns if len(ansTypes) == 0 else
[ann for ann in anns if ann['answer_type'] in ansTypes])
ids = [ann['question_id'] for ann in anns]
return ids
def getImgIds(self, quesIds=[], quesTypes=[], ansTypes=[]):
"""Get image ids that satisfy given filter conditions. default skips
that filter.
:param quesIds (int array) : get image ids for given question ids
quesTypes (str array) : get image ids for given question types
ansTypes (str array) : get image ids for given answer types
:return: ids (int array) : integer array of image ids
"""
quesIds = quesIds if type(quesIds) == list else [quesIds]
quesTypes = quesTypes if type(quesTypes) == list else [quesTypes]
ansTypes = ansTypes if type(ansTypes) == list else [ansTypes]
if len(quesIds) == len(quesTypes) == len(ansTypes) == 0:
anns = self.dataset['annotations']
else:
if not len(quesIds) == 0:
anns = sum([
self.qa[quesId] for quesId in quesIds if quesId in self.qa
], [])
else:
anns = self.dataset['annotations']
anns = (anns if len(quesTypes) == 0 else
[ann for ann in anns if ann['question_type'] in quesTypes])
anns = (anns if len(ansTypes) == 0 else
[ann for ann in anns if ann['answer_type'] in ansTypes])
ids = [ann['image_id'] for ann in anns]
return ids
def loadQA(self, ids=[]):
"""Load questions and answers with the specified question ids.
:param ids (int array) : integer ids specifying question ids
:return: qa (object array) : loaded qa objects
"""
if type(ids) == list:
return [self.qa[id] for id in ids]
elif type(ids) == int:
return [self.qa[ids]]
def showQA(self, anns):
"""Display the specified annotations.
:param anns (array of object): annotations to display
:return: None
"""
if len(anns) == 0:
return 0
for ann in anns:
quesId = ann['question_id']
print('Question: %s' % (self.qqa[quesId]['question']))
for ans in ann['answers']:
print('Answer %d: %s' % (ans['answer_id'], ans['answer']))
def loadRes(self, resFile, quesFile):
"""Load result file and return a result object.
:param resFile (str) : file name of result file
:return: res (obj) : result api object
"""
res = VQA()
res.questions = json.load(open(quesFile))
res.dataset['info'] = copy.deepcopy(self.questions['info'])
res.dataset['task_type'] = copy.deepcopy(self.questions['task_type'])
res.dataset['data_type'] = copy.deepcopy(self.questions['data_type'])
res.dataset['data_subtype'] = copy.deepcopy(
self.questions['data_subtype'])
res.dataset['license'] = copy.deepcopy(self.questions['license'])
print('Loading and preparing results... ')
time_t = datetime.datetime.utcnow()
anns = json.load(open(resFile))
assert type(anns) == list, 'results is not an array of objects'
annsQuesIds = [ann['question_id'] for ann in anns]
assert set(annsQuesIds) == set(
self.getQuesIds()
), 'Results do not correspond to current VQA set. Either the results do not have predictions for all question ids in annotation file or there is atleast one question id that does not belong to the question ids in the annotation file.'
for ann in anns:
quesId = ann['question_id']
if res.dataset['task_type'] == 'Multiple Choice':
assert (
ann['answer'] in self.qqa[quesId]['multiple_choices']
), 'predicted answer is not one of the multiple choices'
qaAnn = self.qa[quesId]
ann['image_id'] = qaAnn['image_id']
ann['question_type'] = qaAnn['question_type']
ann['answer_type'] = qaAnn['answer_type']
print('DONE (t=%0.2fs)' %
((datetime.datetime.utcnow() - time_t).total_seconds()))
res.dataset['annotations'] = anns
res.createIndex()
return res
================================================
FILE: eval_mm/vqa_eval.py
================================================
"""Copyright (c) 2022, salesforce.com, inc.
All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""
# coding=utf-8
__author__ = 'aagrawal'
import re
# This code is based on the code written by Tsung-Yi Lin for MSCOCO Python API available at the following link:
# (https://github.com/tylin/coco-caption/blob/master/pycocoevalcap/eval.py).
import sys
class VQAEval:
def __init__(self, vqa=None, vqaRes=None, n=2):
self.n = n
self.accuracy = {}
self.evalQA = {}
self.evalQuesType = {}
self.evalAnsType = {}
self.vqa = vqa
self.vqaRes = vqaRes
if vqa is not None:
self.params = {'question_id': vqa.getQuesIds()}
self.contractions = {
'aint': "ain't",
'arent': "aren't",
'cant': "can't",
'couldve': "could've",
'couldnt': "couldn't",
"couldn'tve": "couldn't've",
"couldnt've": "couldn't've",
'didnt': "didn't",
'doesnt': "doesn't",
'dont': "don't",
'hadnt': "hadn't",
"hadnt've": "hadn't've",
"hadn'tve": "hadn't've",
'hasnt': "hasn't",
'havent': "haven't",
'hed': "he'd",
"hed've": "he'd've",
"he'dve": "he'd've",
'hes': "he's",
'howd': "how'd",
'howll': "how'll",
'hows': "how's",
"Id've": "I'd've",
"I'dve": "I'd've",
'Im': "I'm",
'Ive': "I've",
'isnt': "isn't",
'itd': "it'd",
"itd've": "it'd've",
"it'dve": "it'd've",
'itll': "it'll",
"let's": "let's",
'maam': "ma'am",
'mightnt': "mightn't",
"mightnt've": "mightn't've",
"mightn'tve": "mightn't've",
'mightve': "might've",
'mustnt': "mustn't",
'mustve': "must've",
'neednt': "needn't",
'notve': "not've",
'oclock': "o'clock",
'oughtnt': "oughtn't",
"ow's'at": "'ow's'at",
"'ows'at": "'ow's'at",
"'ow'sat": "'ow's'at",
'shant': "shan't",
"shed've": "she'd've",
"she'dve": "she'd've",
"she's": "she's",
'shouldve': "should've",
'shouldnt': "shouldn't",
"shouldnt've": "shouldn't've",
"shouldn'tve": "shouldn't've",
"somebody'd": 'somebodyd',
"somebodyd've": "somebody'd've",
"somebody'dve": "somebody'd've",
'somebodyll': "somebody'll",
'somebodys': "somebody's",
'someoned': "someone'd",
"someoned've": "someone'd've",
"someone'dve": "someone'd've",
'someonell': "someone'll",
'someones': "someone's",
'somethingd': "something'd",
"somethingd've": "something'd've",
"something'dve": "something'd've",
'somethingll': "something'll",
'thats': "that's",
'thered': "there'd",
"thered've": "there'd've",
"there'dve": "there'd've",
'therere': "there're",
'theres': "there's",
'theyd': "they'd",
"theyd've": "they'd've",
"they'dve": "they'd've",
'theyll': "they'll",
'theyre': "they're",
'theyve': "they've",
'twas': "'twas",
'wasnt': "wasn't",
"wed've": "we'd've",
"we'dve": "we'd've",
'weve': "we've",
'werent': "weren't",
'whatll': "what'll",
'whatre': "what're",
'whats': "what's",
'whatve': "what've",
'whens': "when's",
'whered': "where'd",
'wheres': "where's",
'whereve': "where've",
'whod': "who'd",
"whod've": "who'd've",
"who'dve": "who'd've",
'wholl': "who'll",
'whos': "who's",
'whove': "who've",
'whyll': "why'll",
'whyre': "why're",
'whys': "why's",
'wont': "won't",
'wouldve': "would've",
'wouldnt': "wouldn't",
"wouldnt've": "wouldn't've",
"wouldn'tve": "wouldn't've",
'yall': "y'all",
"yall'll": "y'all'll",
"y'allll": "y'all'll",
"yall'd've": "y'all'd've",
"y'alld've": "y'all'd've",
"y'all'dve": "y'all'd've",
'youd': "you'd",
"youd've": "you'd've",
"you'dve": "you'd've",
'youll': "you'll",
'youre': "you're",
'youve': "you've",
}
self.manualMap = {
'none': '0',
'zero': '0',
'one': '1',
'two': '2',
'three': '3',
'four': '4',
'five': '5',
'six': '6',
'seven': '7',
'eight': '8',
'nine': '9',
'ten': '10',
}
self.articles = ['a', 'an', 'the']
self.periodStrip = re.compile('(?!<=\d)(\.)(?!\d)')
self.commaStrip = re.compile('(\d)(,)(\d)')
self.punct = [
';',
r'/',
'[',
']',
'"',
'{',
'}',
'(',
')',
'=',
'+',
'\\',
'_',
'-',
'>',
'<',
'@',
'`',
',',
'?',
'!',
]
def evaluate(self, quesIds=None):
if quesIds == None:
quesIds = [quesId for quesId in self.params['question_id']]
gts = {}
res = {}
for quesId in quesIds:
gts[quesId] = self.vqa.qa[quesId]
res[quesId] = self.vqaRes.qa[quesId]
# =================================================
# Compute accuracy
# =================================================
accQA = []
accQuesType = {}
accAnsType = {}
print('computing accuracy')
step = 0
for quesId in quesIds:
resAns = res[quesId]['answer']
resAns = resAns.replace('\n', ' ')
resAns = resAns.replace('\t', ' ')
resAns = resAns.strip()
resAns = self.processPunctuation(resAns)
resAns = self.processDigitArticle(resAns)
gtAcc = []
gtAnswers = [ans['answer'] for ans in gts[quesId]['answers']]
if len(set(gtAnswers)) > 1:
for ansDic in gts[quesId]['answers']:
ansDic['answer'] = self.processPunctuation(
ansDic['answer'])
for gtAnsDatum in gts[quesId]['answers']:
otherGTAns = [
item for item in gts[quesId]['answers']
if item != gtAnsDatum
]
matchingAns = [
item for item in otherGTAns if item['answer'] == resAns
]
acc = min(1, float(len(matchingAns)) / 3)
gtAcc.append(acc)
quesType = gts[quesId]['question_type']
ansType = gts[quesId]['answer_type']
avgGTAcc = float(sum(gtAcc)) / len(gtAcc)
accQA.append(avgGTAcc)
if quesType not in accQuesType:
accQuesType[quesType] = []
accQuesType[quesType].append(avgGTAcc)
if ansType not in accAnsType:
accAnsType[ansType] = []
accAnsType[ansType].append(avgGTAcc)
self.setEvalQA(quesId, avgGTAcc)
self.setEvalQuesType(quesId, quesType, avgGTAcc)
self.setEvalAnsType(quesId, ansType, avgGTAcc)
if step % 100 == 0:
self.updateProgress(step / float(len(quesIds)))
step = step + 1
self.setAccuracy(accQA, accQuesType, accAnsType)
print('Done computing accuracy')
def processPunctuation(self, inText):
outText = inText
for p in self.punct:
if (p + ' ' in inText or ' ' + p
in inText) or (re.search(self.commaStrip, inText) != None):
outText = outText.replace(p, '')
else:
outText = outText.replace(p, ' ')
outText = self.periodStrip.sub('', outText, re.UNICODE)
return outText
def processDigitArticle(self, inText):
outText = []
tempText = inText.lower().split()
for word in tempText:
word = self.manualMap.setdefault(word, word)
if word not in self.articles:
outText.append(word)
else:
pass
for wordId, word in enumerate(outText):
if word in self.contractions:
outText[wordId] = self.contractions[word]
outText = ' '.join(outText)
return outText
def setAccuracy(self, accQA, accQuesType, accAnsType):
self.accuracy['overall'] = round(100 * float(sum(accQA)) / len(accQA),
self.n)
self.accuracy['perQuestionType'] = {
quesType: round(
100 * float(sum(accQuesType[quesType])) /
len(accQuesType[quesType]),
self.n,
)
for quesType in accQuesType
}
self.accuracy['perAnswerType'] = {
ansType: round(
100 * float(sum(accAnsType[ansType])) /
len(accAnsType[ansType]), self.n)
for ansType in accAnsType
}
def setEvalQA(self, quesId, acc):
self.evalQA[quesId] = round(100 * acc, self.n)
def setEvalQuesType(self, quesId, quesType, acc):
if quesType not in self.evalQuesType:
self.evalQuesType[quesType] = {}
self.evalQuesType[quesType][quesId] = round(100 * acc, self.n)
def setEvalAnsType(self, quesId, ansType, acc):
if ansType not in self.evalAnsType:
self.evalAnsType[ansType] = {}
self.evalAnsType[ansType][quesId] = round(100 * acc, self.n)
def updateProgress(self, progress):
barLength = 20
status = ''
if isinstance(progress, int):
progress = float(progress)
if not isinstance(progress, float):
progress = 0
status = 'error: progress var must be float\r\n'
if progress < 0:
progress = 0
status = 'Halt...\r\n'
if progress >= 1:
progress = 1
status = 'Done...\r\n'
block = int(round(barLength * progress))
text = '\rFinshed Percent: [{0}] {1}% {2}'.format(
'#' * block + '-' * (barLength - block), int(progress * 100),
status)
sys.stdout.write(text)
sys.stdout.flush()
================================================
FILE: finetune/ds_config_zero2.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
================================================
FILE: finetune/ds_config_zero3.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
================================================
FILE: finetune/finetune_ds.sh
================================================
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--bf16 True \
--fix_vit True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--deepspeed finetune/ds_config_zero3.json
================================================
FILE: finetune/finetune_lora_ds.sh
================================================
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--bf16 True \
--fix_vit True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--lazy_preprocess True \
--use_lora \
--gradient_checkpointing \
--deepspeed finetune/ds_config_zero2.json
================================================
FILE: finetune/finetune_lora_single_gpu.sh
================================================
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"
export CUDA_VISIBLE_DEVICES=0
python finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--bf16 True \
--fix_vit True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--lazy_preprocess True \
--gradient_checkpointing \
--use_lora
================================================
FILE: finetune/finetune_qlora_ds.sh
================================================
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
MODEL="Qwen/Qwen-VL-Chat-Int4" # Qwen/Qwen-VL-Chat-Int4 Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# Remember to use --fp16 instead of --bf16 due to autogptq
torchrun $DISTRIBUTED_ARGS finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--fp16 True \
--fix_vit True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--lazy_preprocess True \
--use_lora \
--q_lora \
--gradient_checkpointing \
--deepspeed finetune/ds_config_zero2.json
================================================
FILE: finetune/finetune_qlora_single_gpu.sh
================================================
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
MODEL="Qwen/Qwen-VL-Chat-Int4" # Qwen/Qwen-VL-Chat-Int4 Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"
export CUDA_VISIBLE_DEVICES=0
# Remember to use --fp16 instead of --bf16 due to autogptq
python finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--fp16 True \
--fix_vit True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--lazy_preprocess True \
--gradient_checkpointing \
--use_lora \
--q_lora \
--deepspeed finetune/ds_config_zero2.json
================================================
FILE: finetune.py
================================================
# This code is based on the revised code from fastchat based on tatsu-lab/stanford_alpaca.
from dataclasses import dataclass, field
import json
import math
import logging
import os
from typing import Dict, Optional, List
import torch
from torch.utils.data import Dataset
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
import transformers
from transformers import Trainer, GPTQConfig, deepspeed
from transformers.trainer_pt_utils import LabelSmoother
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from accelerate.utils import DistributedType
IGNORE_TOKEN_ID = LabelSmoother.ignore_index
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="Qwen/Qwen-7B")
@dataclass
class DataArguments:
data_path: str = field(
default=None, metadata={"help": "Path to the training data."}
)
eval_data_path: str = field(
default=None, metadata={"help": "Path to the evaluation data."}
)
lazy_preprocess: bool = False
@dataclass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
model_max_length: int = field(
default=8192,
metadata={
"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
},
)
use_lora: bool = False
fix_vit: bool = True
@dataclass
class LoraArguments:
lora_r: int = 64
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_target_modules: List[str] = field(
default_factory=lambda: ["c_attn", "attn.c_proj", "w1", "w2"] ##["in_proj","out_proj","c_fc"]
)
lora_weight_path: str = ""
lora_bias: str = "none"
q_lora: bool = False
def maybe_zero_3(param):
if hasattr(param, "ds_id"):
assert param.ds_status == ZeroParamStatus.NOT_AVAILABLE
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
# Borrowed from peft.utils.get_peft_model_state_dict
def get_peft_state_maybe_zero_3(named_params, bias):
if bias == "none":
to_return = {k: t for k, t in named_params if "lora_" in k}
elif bias == "all":
to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
elif bias == "lora_only":
to_return = {}
maybe_lora_bias = {}
lora_bias_names = set()
for k, t in named_params:
if "lora_" in k:
to_return[k] = t
bias_name = k.split("lora_")[0] + "bias"
lora_bias_names.add(bias_name)
elif "bias" in k:
maybe_lora_bias[k] = t
for k, t in maybe_lora_bias:
if bias_name in lora_bias_names:
to_return[bias_name] = t
else:
raise NotImplementedError
to_return = {k: maybe_zero_3(v) for k, v in to_return.items()}
return to_return
local_rank = None
def rank0_print(*args):
if local_rank == 0:
print(*args)
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str, bias="none"):
"""Collects the state dict and dump to disk."""
# check if zero3 mode enabled
if deepspeed.is_deepspeed_zero3_enabled():
state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
else:
if trainer.args.use_lora:
state_dict = get_peft_state_maybe_zero_3(
trainer.model.named_parameters(), bias
)
else:
state_dict = trainer.model.state_dict()
if trainer.args.should_save and trainer.args.local_rank == 0:
trainer._save(output_dir, state_dict=state_dict)
def preprocess(
sources,
tokenizer: transformers.PreTrainedTokenizer,
max_len: int,
system_message: str = "You are a helpful assistant."
) -> Dict:
roles = {"user": "<|im_start|>user", "assistant": "<|im_start|>assistant"}
im_start = tokenizer.im_start_id
im_end = tokenizer.im_end_id
nl_tokens = tokenizer('\n').input_ids
_system = tokenizer('system').input_ids + nl_tokens
_user = tokenizer('user').input_ids + nl_tokens
_assistant = tokenizer('assistant').input_ids + nl_tokens
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["user"]:
source = source[1:]
input_id, target = [], []
system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
input_id += system
target += [im_start] + [IGNORE_TOKEN_ID] * (len(system)-3) + [im_end] + nl_tokens
assert len(input_id) == len(target)
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
_input_id = tokenizer(role).input_ids + nl_tokens + \
tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
input_id += _input_id
if role == '<|im_start|>user':
_target = [im_start] + [IGNORE_TOKEN_ID] * (len(_input_id)-3) + [im_end] + nl_tokens
elif role == '<|im_start|>assistant':
_target = [im_start] + [IGNORE_TOKEN_ID] * len(tokenizer(role).input_ids) + \
_input_id[len(tokenizer(role).input_ids)+1:-2] + [im_end] + nl_tokens
else:
raise NotImplementedError
target += _target
assert len(input_id) == len(target)
input_id += [tokenizer.pad_token_id] * (max_len - len(input_id))
target += [IGNORE_TOKEN_ID] * (max_len - len(target))
input_ids.append(input_id[:max_len])
targets.append(target[:max_len])
input_ids = torch.tensor(input_ids, dtype=torch.int)
targets = torch.tensor(targets, dtype=torch.int)
return dict(
input_ids=input_ids,
labels=targets,
attention_mask=input_ids.ne(tokenizer.pad_token_id),
)
class SupervisedDataset(Dataset):
"""Dataset for supervised fine-tuning."""
def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int):
super(SupervisedDataset, self).__init__()
rank0_print("Formatting inputs...")
sources = [example["conversations"] for example in raw_data]
data_dict = preprocess(sources, tokenizer, max_len)
self.input_ids = data_dict["input_ids"]
self.labels = data_dict["labels"]
self.attention_mask = data_dict["attention_mask"]
def __len__(self):
return len(self.input_ids)
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
return dict(
input_ids=self.input_ids[i],
labels=self.labels[i],
attention_mask=self.attention_mask[i],
)
class LazySupervisedDataset(Dataset):
"""Dataset for supervised fine-tuning."""
def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int):
super(LazySupervisedDataset, self).__init__()
self.tokenizer = tokenizer
self.max_len = max_len
rank0_print("Formatting inputs...Skip in lazy mode")
self.tokenizer = tokenizer
self.raw_data = raw_data
self.cached_data_dict = {}
def __len__(self):
return len(self.raw_data)
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
if i in self.cached_data_dict:
return self.cached_data_dict[i]
ret = preprocess([self.raw_data[i]["conversations"]], self.tokenizer, self.max_len)
ret = dict(
input_ids=ret["input_ids"][0],
labels=ret["labels"][0],
attention_mask=ret["attention_mask"][0],
)
self.cached_data_dict[i] = ret
return ret
def make_supervised_data_module(
tokenizer: transformers.PreTrainedTokenizer, data_args, max_len,
) -> Dict:
"""Make dataset and collator for supervised fine-tuning."""
dataset_cls = (
LazySupervisedDataset if data_args.lazy_preprocess else SupervisedDataset
)
rank0_print("Loading data...")
train_json = json.load(open(data_args.data_path, "r"))
train_dataset = dataset_cls(train_json, tokenizer=tokenizer, max_len=max_len)
if data_args.eval_data_path:
eval_json = json.load(open(data_args.eval_data_path, "r"))
eval_dataset = dataset_cls(eval_json, tokenizer=tokenizer, max_len=max_len)
else:
eval_dataset = None
return dict(train_dataset=train_dataset, eval_dataset=eval_dataset)
def train():
global local_rank
parser = transformers.HfArgumentParser(
(ModelArguments, DataArguments, TrainingArguments, LoraArguments)
)
(
model_args,
data_args,
training_args,
lora_args,
) = parser.parse_args_into_dataclasses()
if getattr(training_args, 'deepspeed', None) and getattr(lora_args, 'q_lora', False):
training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED
compute_dtype = (
torch.float16
if training_args.fp16
else (torch.bfloat16 if training_args.bf16 else torch.float32)
)
local_rank = training_args.local_rank
device_map = None
world_size = int(os.environ.get("WORLD_SIZE", 1))
ddp = world_size != 1
if lora_args.q_lora:
device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled():
logging.warning(
"FSDP or ZeRO3 are not incompatible with QLoRA."
)
# Set RoPE scaling factor
config = transformers.AutoConfig.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
trust_remote_code=True,
)
config.use_cache = False
# Load model and tokenizer
model = transformers.AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
config=config,
cache_dir=training_args.cache_dir,
device_map=device_map,
trust_remote_code=True,
quantization_config=GPTQConfig(
bits=4, disable_exllama=True
)
if training_args.use_lora and lora_args.q_lora
else None,
)
if not training_args.use_lora:
if training_args.fix_vit and hasattr(model,'transformer') and hasattr(model.transformer,'visual'):
model.transformer.visual.requires_grad_(False)
if hasattr(model.transformer.visual,'attn_pool'):
model.transformer.visual.attn_pool.requires_grad_(True)
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
model_max_length=training_args.model_max_length,
padding_side="right",
use_fast=False,
trust_remote_code=True,
)
tokenizer.pad_token_id = tokenizer.eod_id
if training_args.use_lora:
if lora_args.q_lora or "chat" in model_args.model_name_or_path.lower():
modules_to_save = None
else:
modules_to_save = ["wte", "lm_head"]
lora_config = LoraConfig(
r=lora_args.lora_r,
lora_alpha=lora_args.lora_alpha,
target_modules=lora_args.lora_target_modules,
lora_dropout=lora_args.lora_dropout,
bias=lora_args.lora_bias,
task_type="CAUSAL_LM",
modules_to_save=modules_to_save # This argument serves for adding new tokens.
)
if lora_args.q_lora:
model = prepare_model_for_kbit_training(
model, use_gradient_checkpointing=training_args.gradient_checkpointing
)
model = get_peft_model(model, lora_config)
if training_args.gradient_checkpointing:
model.enable_input_require_grads()
# Load data
data_module = make_supervised_data_module(
tokenizer=tokenizer, data_args=data_args, max_len=training_args.model_max_length
)
# Start trainner
trainer = Trainer(
model=model, tokenizer=tokenizer, args=training_args, **data_module
)
trainer.train()
trainer.save_state()
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir, bias=lora_args.lora_bias)
if __name__ == "__main__":
train()
================================================
FILE: openai_api.py
================================================
# coding=utf-8
# Implements API for Qwen-7B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat)
# Usage: python openai_api.py
# Visit http://localhost:8000/docs for documents.
import re
import copy
import json
import time
from argparse import ArgumentParser
from contextlib import asynccontextmanager
from typing import Dict, List, Literal, Optional, Union
import torch
import uvicorn
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from sse_starlette.sse import EventSourceResponse
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation import GenerationConfig
@asynccontextmanager
async def lifespan(app: FastAPI): # collects GPU memory
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class ModelCard(BaseModel):
id: str
object: str = "model"
created: int = Field(default_factory=lambda: int(time.time()))
owned_by: str = "owner"
root: Optional[str] = None
parent: Optional[str] = None
permission: Optional[list] = None
class ModelList(BaseModel):
object: str = "list"
data: List[ModelCard] = []
class ChatMessage(BaseModel):
role: Literal["user", "assistant", "system", "function"]
content: Optional[str]
function_call: Optional[Dict] = None
class DeltaMessage(BaseModel):
role: Optional[Literal["user", "assistant", "system"]] = None
content: Optional[str] = None
class ChatCompletionRequest(BaseModel):
model: str
messages: List[ChatMessage]
functions: Optional[List[Dict]] = None
temperature: Optional[float] = None
top_p: Optional[float] = None
max_length: Optional[int] = None
stream: Optional[bool] = False
stop: Optional[List[str]] = None
class ChatCompletionResponseChoice(BaseModel):
index: int
message: ChatMessage
finish_reason: Literal["stop", "length", "function_call"]
class ChatCompletionResponseStreamChoice(BaseModel):
index: int
delta: DeltaMessage
finish_reason: Optional[Literal["stop", "length"]]
class ChatCompletionResponse(BaseModel):
model: str
object: Literal["chat.completion", "chat.completion.chunk"]
choices: List[
Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]
]
created: Optional[int] = Field(default_factory=lambda: int(time.time()))
@app.get("/v1/models", response_model=ModelList)
async def list_models():
global model_args
model_card = ModelCard(id="gpt-3.5-turbo")
return ModelList(data=[model_card])
# To work around that unpleasant leading-\n tokenization issue!
def add_extra_stop_words(stop_words):
if stop_words:
_stop_words = []
_stop_words.extend(stop_words)
for x in stop_words:
s = x.lstrip("\n")
if s and (s not in _stop_words):
_stop_words.append(s)
return _stop_words
return stop_words
def trim_stop_words(response, stop_words):
if stop_words:
for stop in stop_words:
idx = response.find(stop)
if idx != -1:
response = response[:idx]
return response
TOOL_DESC = """{name_for_model}: Call this tool to interact with the {name_for_human} API. What is the {name_for_human} API useful for? {description_for_model} Parameters: {parameters}"""
REACT_INSTRUCTION = """Answer the following questions as best you can. You have access to the following APIs:
{tools_text}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tools_name_text}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!"""
_TEXT_COMPLETION_CMD = object()
#
# Temporarily, the system role does not work as expected.
# We advise that you write the setups for role-play in your query,
# i.e., use the user role instead of the system role.
#
# TODO: Use real system role when the model is ready.
#
def parse_messages(messages, functions):
if all(m.role != "user" for m in messages):
raise HTTPException(
status_code=400,
detail=f"Invalid request: Expecting at least one user message.",
)
messages = copy.deepcopy(messages)
default_system = "You are a helpful assistant."
system = ""
if messages[0].role == "system":
system = messages.pop(0).content.lstrip("\n").rstrip()
if system == default_system:
system = ""
if functions:
tools_text = []
tools_name_text = []
for func_info in functions:
name = func_info.get("name", "")
name_m = func_info.get("name_for_model", name)
name_h = func_info.get("name_for_human", name)
desc = func_info.get("description", "")
desc_m = func_info.get("description_for_model", desc)
tool = TOOL_DESC.format(
name_for_model=name_m,
name_for_human=name_h,
# Hint: You can add the following format requirements in description:
# "Format the arguments as a JSON object."
# "Enclose the code within triple backticks (`) at the beginning and end of the code."
description_for_model=desc_m,
parameters=json.dumps(func_info["parameters"], ensure_ascii=False),
)
tools_text.append(tool)
tools_name_text.append(name_m)
tools_text = "\n\n".join(tools_text)
tools_name_text = ", ".join(tools_name_text)
system += "\n\n" + REACT_INSTRUCTION.format(
tools_text=tools_text,
tools_name_text=tools_name_text,
)
system = system.lstrip("\n").rstrip()
dummy_thought = {
"en": "\nThought: I now know the final answer.\nFinal answer: ",
"zh": "\nThought: 我会作答了。\nFinal answer: ",
}
_messages = messages
messages = []
for m_idx, m in enumerate(_messages):
role, content, func_call = m.role, m.content, m.function_call
if content:
content = content.lstrip("\n").rstrip()
if role == "function":
if (len(messages) == 0) or (messages[-1].role != "assistant"):
raise HTTPException(
status_code=400,
detail=f"Invalid request: Expecting role assistant before role function.",
)
messages[-1].content += f"\nObservation: {content}"
if m_idx == len(_messages) - 1:
messages[-1].content += "\nThought:"
elif role == "assistant":
if len(messages) == 0:
raise HTTPException(
status_code=400,
detail=f"Invalid request: Expecting role user before role assistant.",
)
last_msg = messages[-1].content
last_msg_has_zh = len(re.findall(r"[\u4e00-\u9fff]+", last_msg)) > 0
if func_call is None:
if functions:
content = dummy_thought["zh" if last_msg_has_zh else "en"] + content
else:
f_name, f_args = func_call["name"], func_call["arguments"]
if not content:
if last_msg_has_zh:
content = f"Thought: 我可以使用 {f_name} API。"
else:
content = f"Thought: I can use {f_name}."
content = f"\n{content}\nAction: {f_name}\nAction Input: {f_args}"
if messages[-1].role == "user":
messages.append(
ChatMessage(role="assistant", content=content.lstrip("\n").rstrip())
)
else:
messages[-1].content += content
elif role == "user":
messages.append(
ChatMessage(role="user", content=content.lstrip("\n").rstrip())
)
else:
raise HTTPException(
status_code=400, detail=f"Invalid request: Incorrect role {role}."
)
query = _TEXT_COMPLETION_CMD
if messages[-1].role == "user":
query = messages[-1].content
messages = messages[:-1]
if len(messages) % 2 != 0:
raise HTTPException(status_code=400, detail="Invalid request")
history = [] # [(Q1, A1), (Q2, A2), ..., (Q_last_turn, A_last_turn)]
for i in range(0, len(messages), 2):
if messages[i].role == "user" and messages[i + 1].role == "assistant":
usr_msg = messages[i].content.lstrip("\n").rstrip()
bot_msg = messages[i + 1].content.lstrip("\n").rstrip()
if system and (i == len(messages) - 2):
usr_msg = f"{system}\n\nQuestion: {usr_msg}"
system = ""
for t in dummy_thought.values():
t = t.lstrip("\n")
if bot_msg.startswith(t) and ("\nAction: " in bot_msg):
bot_msg = bot_msg[len(t) :]
history.append([usr_msg, bot_msg])
else:
raise HTTPException(
status_code=400,
detail="Invalid request: Expecting exactly one user (or function) role before every assistant role.",
)
if system:
assert query is not _TEXT_COMPLETION_CMD
query = f"{system}\n\nQuestion: {query}"
return query, history
def parse_response(response):
func_name, func_args = "", ""
i = response.rfind("\nAction:")
j = response.rfind("\nAction Input:")
k = response.rfind("\nObservation:")
if 0 <= i < j: # If the text has `Action` and `Action input`,
if k < j: # but does not contain `Observation`,
# then it is likely that `Observation` is omitted by the LLM,
# because the output text may have discarded the stop word.
response = response.rstrip() + "\nObservation:" # Add it back.
k = response.rfind("\nObservation:")
func_name = response[i + len("\nAction:") : j].strip()
func_args = response[j + len("\nAction Input:") : k].strip()
if func_name:
choice_data = ChatCompletionResponseChoice(
index=0,
message=ChatMessage(
role="assistant",
content=response[:i],
function_call={"name": func_name, "arguments": func_args},
),
finish_reason="function_call",
)
return choice_data
z = response.rfind("\nFinal Answer: ")
if z >= 0:
response = response[z + len("\nFinal Answer: ") :]
choice_data = ChatCompletionResponseChoice(
index=0,
message=ChatMessage(role="assistant", content=response),
finish_reason="stop",
)
return choice_data
# completion mode, not chat mode
def text_complete_last_message(history, stop_words_ids):
im_start = "<|im_start|>"
im_end = "<|im_end|>"
prompt = f"{im_start}system\nYou are a helpful assistant.{im_end}"
for i, (query, response) in enumerate(history):
query = query.lstrip("\n").rstrip()
response = response.lstrip("\n").rstrip()
prompt += f"\n{im_start}user\n{query}{im_end}"
prompt += f"\n{im_start}assistant\n{response}{im_end}"
prompt = prompt[: -len(im_end)]
_stop_words_ids = [tokenizer.encode(im_end)]
if stop_words_ids:
for s in stop_words_ids:
_stop_words_ids.append(s)
stop_words_ids = _stop_words_ids
input_ids = torch.tensor([tokenizer.encode(prompt)]).to(model.device)
output = model.generate(input_ids, stop_words_ids=stop_words_ids).tolist()[0]
output = tokenizer.decode(output, errors="ignore")
assert output.startswith(prompt)
output = output[len(prompt) :]
output = trim_stop_words(output, ["<|endoftext|>", im_end])
print(f"
**TOUCHSTONE** is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By automating the evaluation process and converting multimodal information into text, our TouchStone allows for efficient and accurate assessment of dialogue quality, leveraging the power of advanced language models without the need for manual intervention.
## DATASET
To evaluate the abilities of LVLMs, we construct a diverse and comprehensive dataset that covers five key dimensions: basic descriptive ability, visual recognition ability, visual comprehension ability, visual storytelling ability, and multi-image analysis ability.
- **Basic Descriptive Ability** Image description involves the ability of a model to describe the information contained in an image, including simple and detailed descriptions. Simple descriptions are typically short phrases that describe the main subject and action of the image, while detailed descriptions provide more in-depth information about the image scene, their attributes, and relationships.
- **Visual Recognition Ability** Image recognition is the task of recognizing objects or scenes within an image and inferring relevant information. This area can be further divided into several sub-tasks, including attribute QA, movie/TV recognition, art recognition, landmark recognition, celebrity recognition, emotion recognition, text recognition, object recognition, and structure content recognition.
- **Visual Comprehension Ability** Image understanding involves the ability of a model to understand the meaning of an image and associated tasks. This area encompasses several sub-tasks, such as style appreciation, abstract image understanding, meme understanding, image analysis, chart analysis, general problem-solving, and reasoning QA.
- **Visual Storytelling Ability** The visual storytelling ability is the process of literary creation based on visual content, including writing emails, poetry, stories, ads/commodity recommendations, and brainstorming.
- **Multi-Image Analysis Ability** Multi-image analysis is the task of analyzing and comparing multiple images. This area includes tasks such as comparing two/multiple images, summarizing multiple image information, comparing commodities, and step-by-step analysis of images.
**TOUCHSTONE** 是一种针对多模态语言模型(LVLM)的自动化综合评估方法,评估不仅包括基本的认知和理解,还延伸到文学创作。通过人类注解将多模态信息转换为文本,我们的 TouchStone 可以利用SOTA的语言模型来自动化地完成对LVLMs的多模态对话质量评估。
## 数据集
为了评估 LVLMs 的能力,我们构建了一个多样化且全面的数据集,涵盖五个关键维度:基本描述能力、视觉识别能力、视觉理解能力、视觉叙事能力和多图分析能力。
- **基本描述能力** 图像描述考验模型总结图片信息的能力,包括简单描述和详细描述。 简单描述通常是描述图像的主要内容和关系的简短短语,而详细描述则提供有关图像场景、其属性和关系的更深入的信息。
- **视觉识别能力** 图像识别考察模型提取图像中内容的属性以及关联到知识库的能力。为了考察这方面能力,测试的问题包括属性QA、影视识别、艺术识别、地标识别、名人识别、情感识别、文本识别、物体识别和结构内容识别。
- **视觉理解能力** 图像理解需要模型理解图像内容并完成推理进行相关任务。 这方面包含了例如风格欣赏、抽象图像理解、模因理解、图像分析、图表分析、一般问题解决和推理问答等任务。
- **视觉叙事能力** 视觉叙事能力是基于视觉内容的文学创作能力,包括撰写电子邮件、诗歌、故事、广告/商品推荐、头脑风暴等。
- **多图分析能力** 多图分析是分析和比较多幅图像的任务。该领域包括比较两个/多个图像、总结多个图像信息、比较商品以及逐步分析图像等任务。
**TOUCHSTONE** は、マルチモーダル言語モデルの包括的な評価であり、基本的な認識や理解だけでなく、文学的な創作にまで及びます。評価プロセスを自動化し、マルチモーダル情報をテキストに変換することで、私達の TouchStone は、人手を介することなく高度な言語モデルの力を活用し、対話の質を効率的かつ正確に評価することができます。
## DATASET
LVLMの能力を評価するために、基本的な記述能力、視覚認識能力、視覚理解能力、視覚ストーリーテリング能力、複数画像解析能力の5つの主要な次元をカバーする多様で包括的なデータセットを構築する。
- **基本的描写力** 画像記述には、単純な記述と詳細な記述を含め、画像に含まれる情報を記述するモデルの能力が含まれる。単純な記述は、通常、画像の主な主題とアクションを記述する短いフレーズであり、詳細な記述は、画像のシーン、それらの属性、および関係についてのより詳細な情報を提供します。
- **視覚認識能力** 画像認識とは、画像内のオブジェクトやシーンを認識し、関連情報を推論するタスクである。この分野はさらに、属性QA、映画/テレビ認識、アート認識、ランドマーク認識、有名人認識、感情認識、テキスト認識、オブジェクト認識、構造コンテンツ認識など、いくつかのサブタスクに分けることができる。
- **視覚理解能力** 画像理解とは、モデルが画像の意味や関連するタスクを理解する能力のことである。この分野には、スタイル理解、抽象画像理解、ミーム理解、画像分析、チャート分析、一般的な問題解決、推論QAなど、いくつかのサブタスクが含まれる。
- **視覚的ストーリーテリング能力** ビジュアルストーリーテリング能力とは、メール、詩、物語、広告/商品推薦、ブレーンストーミングの執筆など、ビジュアルコンテンツに基づいた文学創作のプロセスである。
- **マルチ画像解析能力** 複数画像解析とは、複数の画像を解析・比較する作業である。この分野には、2つまたは複数の画像を比較する、複数の画像情報を要約する、商品を比較する、画像を段階的に分析するなどのタスクが含まれます。
**터치스톤, TOUCHSTONE**은 기본적인 인식과 이해력뿐만 아니라 문학 창작까지 아우르는 종합적인 멀티모달 언어 모델 평가입니다. 평가 프로세스를 자동화하고 멀티모달 정보를 텍스트로 변환하는 터치스톤은 수동 개입 없이도 고급 언어 모델의 성능을 활용하여 대화 품질을 효율적이고 정확하게 평가할 수 있도록 지원합니다.
## DATASET
머신러닝의 능력을 평가하기 위해 기본 설명 능력, 시각 인식 능력, 시각 이해 능력, 시각 스토리텔링 능력, 다중 이미지 분석 능력 등 5가지 주요 모달을 포괄하는 다양하고 광범위한 데이터 세트를 구축합니다.
- **기본 설명 능력, Basic Descriptive Ability** 이미지 설명에는 단순 설명과 상세 설명을 포함하여 이미지에 포함된 정보를 설명하는 모델의 능력이 포함됩니다. 단순 설명은 일반적으로 이미지의 주요 주제와 동작을 설명하는 짧은 문구로 상세 설명은 이미지 장면, 속성 및 관계에 대한 보다 심층적인 정보를 제공합니다.
- **시각적 인식 능력, Visual Recognition Ability** 이미지 인식은 이미지 내의 사물이나 장면을 인식하고 관련 정보를 추론하는 작업입니다. 이 영역은 속성 QA, 영화/TV 인식, 예술 인식, 랜드마크 인식, 유명인 인식, 감정 인식, 텍스트 인식, 사물 인식, 구조물 내용 인식 등 여러 하위 작업으로 세분화할 수 있습니다.
- **시각적 이해 능력, Visual Comprehension Ability** 이미지 이해에는 이미지의 의미와 관련 작업을 이해하는 모델의 능력이 포함됩니다. 이 영역에는 스타일 감상, 추상적 이미지 이해, 밈 이해, 이미지 분석, 차트 분석, 일반적인 문제 해결, 추론 QA와 같은 여러 하위 작업이 포함됩니다.
- **시각적 스토리텔링 능력, Visual Storytelling Ability** 시각적 스토리텔링 능력은 이메일, 시, 스토리, 광고/상품 추천, 브레인스토밍 등 시각적 콘텐츠를 기반으로 문학적 창작을 하는 과정입니다.
- **다중 이미지 분석 능력, Multi-Image Analysis Ability** 다중 이미지 분석은 여러 이미지를 분석하고 비교하는 작업입니다. 이 영역에는 두 개/여러 개의 이미지 비교, 여러 이미지 정보 요약, 상품 비교, 이미지의 단계별 분석 등의 작업이 포함됩니다.
"
else:
if i > 0:
if count % 2 == 1:
line = line.replace("`", r"\`")
line = line.replace("<", "<")
line = line.replace(">", ">")
line = line.replace(" ", " ")
line = line.replace("*", "*")
line = line.replace("_", "_")
line = line.replace("-", "-")
line = line.replace(".", ".")
line = line.replace("!", "!")
line = line.replace("(", "(")
line = line.replace(")", ")")
line = line.replace("$", "$")
lines[i] = "'
else:
lines[i] = f"
" + line
text = "".join(lines)
return text
def _remove_image_special(text):
text = text.replace('', '').replace('', '')
return re.sub(r'{q[0]}'
pre += q + '\n'
pic_idx += 1
else:
pre += q
history_filter.append((pre, a))
pre = ""
history, message = history_filter[:-1], history_filter[-1][0]
# response, history = model.chat(tokenizer, message, history=history)
for response in model.chat_stream(tokenizer, message, history=history):
_chatbot[-1] = (_parse_text(chat_query), _remove_image_special(_parse_text(response)))
yield _chatbot
full_response = _parse_text(response)
response = full_response
history.append((message, response))
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image is not None:
temp_dir = secrets.token_hex(20)
temp_dir = Path(uploaded_file_dir) / temp_dir
temp_dir.mkdir(exist_ok=True, parents=True)
name = f"tmp{secrets.token_hex(5)}.jpg"
filename = temp_dir / name
image.save(str(filename))
_chatbot.append((None, (str(filename),)))
else:
_chatbot[-1] = (_parse_text(chat_query), response)
# full_response = _parse_text(response)
task_history[-1] = (query, full_response)
print("Qwen-VL-Chat: " + _parse_text(full_response))
yield _chatbot
def regenerate(_chatbot, task_history):
if not task_history:
return _chatbot
item = task_history[-1]
if item[1] is None:
return _chatbot
task_history[-1] = (item[0], None)
chatbot_item = _chatbot.pop(-1)
if chatbot_item[0] is None:
_chatbot[-1] = (_chatbot[-1][0], None)
else:
_chatbot.append((chatbot_item[0], None))
return predict(_chatbot, task_history)
def add_text(history, task_history, text):
task_text = text
if len(text) >= 2 and text[-1] in PUNCTUATION and text[-2] not in PUNCTUATION:
task_text = text[:-1]
history = history + [(_parse_text(text), None)]
task_history = task_history + [(task_text, None)]
return history, task_history, ""
def add_file(history, task_history, file):
history = history + [((file.name,), None)]
task_history = task_history + [((file.name,), None)]
return history, task_history
def reset_user_input():
return gr.update(value="")
def reset_state(task_history):
task_history.clear()
return []
with gr.Blocks() as demo:
gr.Markdown("""\
