" labels: [] body: - type: checkboxes attributes: label: 是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this? description: | 请先搜索您遇到的错误是否在已有的issues或讨论中提到过。 Please search to see if an issue / discussion already exists for the bug you encountered. [Issues](https://github.com/QwenLM/Qwen-7B/issues) [Discussions](https://github.com/QwenLM/Qwen-7B/discussions) options: - label: 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions required: true - type: checkboxes attributes: label: 该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ? description: | 请先搜索您遇到的错误是否已在FAQ中有相关解答。 Please search to see if an answer already exists in FAQ for the bug you encountered. [FAQ-en](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ.md) [FAQ-zh](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ_zh.md) options: - label: 我已经搜索过FAQ | I have searched FAQ required: true - type: textarea attributes: label: 当前行为 | Current Behavior description: | 准确描述遇到的行为。 A concise description of what you're experiencing. validations: required: false - type: textarea attributes: label: 期望行为 | Expected Behavior description: | 准确描述预期的行为。 A concise description of what you expected to happen. validations: required: false - type: textarea attributes: label: 复现方法 | Steps To Reproduce description: | 复现当前行为的详细步骤。 Steps to reproduce the behavior. placeholder: | 1. In this environment... 2. With this config... 3. Run '...' 4. See error... validations: required: false - type: textarea attributes: label: 运行环境 | Environment description: | examples: - **OS**: Ubuntu 20.04 - **Python**: 3.8 - **Transformers**: 4.31.0 - **PyTorch**: 2.0.1 - **CUDA**: 11.4 value: | - OS: - Python: - Transformers: - PyTorch: - CUDA (`python -c 'import torch; print(torch.version.cuda)'`): render: Markdown validations: required: false - type: textarea attributes: label: 备注 | Anything else? description: | 您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。 Links? References? Anything that will give us more context about the issue you are encountering! Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in. validations: required: false ================================================ FILE: .github/ISSUE_TEMPLATE/config.yaml ================================================ blank_issues_enabled: true ================================================ FILE: .github/ISSUE_TEMPLATE/feature_request.yaml ================================================ name: "💡 Feature Request" description: 创建新功能请求 | Create a new ticket for a new feature request title: "💡 [REQUEST] - <title>" labels: [ "question" ] body: - type: input id: start_date attributes: label: "起始日期 | Start Date" description: | 起始开发日期 Start of development placeholder: "month/day/year" validations: required: false - type: textarea id: implementation_pr attributes: label: "实现PR | Implementation PR" description: | 实现该功能的Pull request Pull request used placeholder: "#Pull Request ID" validations: required: false - type: textarea id: reference_issues attributes: label: "相关Issues | Reference Issues" description: | 与该功能相关的issues Common issues placeholder: "#Issues IDs" validations: required: false - type: textarea id: summary attributes: label: "摘要 | Summary" description: | 简要描述新功能的特点 Provide a brief explanation of the feature placeholder: | Describe in a few lines your feature request validations: required: true - type: textarea id: basic_example attributes: label: "基本示例 | Basic Example" description: Indicate here some basic examples of your feature. placeholder: A few specific words about your feature request. validations: required: true - type: textarea id: drawbacks attributes: label: "缺陷 | Drawbacks" description: | 该新功能有哪些缺陷/可能造成哪些影响？ What are the drawbacks/impacts of your feature request ? placeholder: | Identify the drawbacks and impacts while being neutral on your feature request validations: required: true - type: textarea id: unresolved_question attributes: label: "未解决问题 | Unresolved questions" description: | 有哪些尚未解决的问题？ What questions still remain unresolved ? placeholder: | Identify any unresolved issues. validations: required: false ================================================ FILE: .gitignore ================================================ __pycache__ *.so build .coverage_* *.egg-info *~ .vscode/ .idea/ .DS_Store /private/ Qwen-VL-Chat/ Qwen-VL-Chat-Int4/ SimSun.ttf ================================================ FILE: BUILD.md ================================================ ## qwen web demo ### build ``` docker build -t qwen-vl-chat:webdemo --platform linux/amd64 -f Dockerfile.qwendemo . ``` ### run ``` docker run -it --gpus device=0 -d --restart always -v /var/run/docker.sock:/var/run/docker.sock --name qwen-vl-chat -p 8000:8000 --user=20001:20001 --platform linux/amd64 qwen-vl-chat:webdemo ``` ## qwen openai api ### build ``` docker build -t qwen-vl-chat:openai --platform linux/amd64 -f Dockerfile.qwenopenai . ``` ### run ``` docker run -it --gpus device=0 -d --restart always -v /var/run/docker.sock:/var/run/docker.sock --name qwen-vl-chat -p 8080:8080 --user=20001:20001 --platform linux/amd64 qwen-vl-chat:openai ``` ## qwen-int4 openai api ### build ``` docker build -t qwen-vl-chat:int4-openai --platform linux/amd64 -f Dockerfile.qwenint4openai . ``` ### run ``` docker run -it --gpus device=0 -d --restart always -v /var/run/docker.sock:/var/run/docker.sock --name qwen-vl-chat-int4 -p 8080:8080 --user=20001:20001 --platform linux/amd64 qwen-vl-chat:int4-openai ``` ================================================ FILE: Dockerfile.qwendemo ================================================ # python 3.8 and above # pytorch 1.12 and above, 2.0 and above are recommended # CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) # based on modelscope docker image # registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 # registry.cn-beijing.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 ARG workdir=/var/app RUN mkdir -p ${workdir} RUN git lfs install WORKDIR ${workdir} COPY requirements.txt requirements_web_demo.txt ./ # Install Qwen dependencies RUN pip install -r requirements.txt # Install webUI dependencies WORKDIR ${workdir} RUN pip install -r requirements_web_demo.txt # Offline mode, check https://huggingface.co/docs/transformers/v4.15.0/installation#offline-mode ENV HF_DATASETS_OFFLINE=1 ENV TRANSFORMERS_OFFLINE=1 # set TZ, make logs dir, and expose port 8080 ENV TZ=Asia/Shanghai RUN mkdir -p ${workdir}/logs && chmod 777 ${workdir}/logs VOLUME /var/app/logs # create user 20001 RUN useradd -r -m appuser -u 20001 -g 0 WORKDIR ${workdir} # copy model RUN git clone https://huggingface.co/Qwen/Qwen-VL-Chat # COPY --chown=20001:20001 Qwen-VL-Chat ./Qwen-VL-Chat # copy fonts ADD --chown=20001:20001 https://github.com/StellarCN/scp_zh/raw/master/fonts/SimSun.ttf ./ # COPY --chown=20001:20001 SimSun.ttf ./ # copy main app COPY --chown=20001:20001 web_demo_mm.py ./ EXPOSE 8000 CMD ["python3", "web_demo_mm.py", "-c", "./Qwen-VL-Chat", "--server-name", "0.0.0.0", "--server-port", "8000"] ================================================ FILE: Dockerfile.qwenint4openai ================================================ # python 3.8 and above # pytorch 1.12 and above, 2.0 and above are recommended # CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) # based on modelscope docker image # registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 # registry.cn-beijing.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 ARG workdir=/var/app RUN mkdir -p ${workdir} RUN git lfs install WORKDIR ${workdir} COPY requirements.txt requirements_web_demo.txt ./ # Install Qwen dependencies RUN pip install -r requirements.txt # Install webUI dependencies WORKDIR ${workdir} RUN pip install -r requirements_web_demo.txt # Offline mode, check https://huggingface.co/docs/transformers/v4.15.0/installation#offline-mode ENV HF_DATASETS_OFFLINE=1 ENV TRANSFORMERS_OFFLINE=1 # set TZ, make logs dir, and expose port 8080 ENV TZ=Asia/Shanghai RUN mkdir -p ${workdir}/logs && chmod 777 ${workdir}/logs VOLUME /var/app/logs # create user 20001 RUN useradd -r -m appuser -u 20001 -g 0 WORKDIR ${workdir} # copy model RUN git clone https://huggingface.co/Qwen/Qwen-VL-Chat-Int4 # COPY --chown=20001:20001 Qwen-VL-Chat-Int4 ./Qwen-VL-Chat-Int4 # Install AutoGPTQ RUN pip install optimum # RUN git clone https://github.com/JustinLin610/AutoGPTQ.git && \ # cd AutoGPTQ && \ # pip install -v . RUN pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/ # Install OpenAI API dependencies WORKDIR ${workdir} COPY requirements_openai_api.txt ./ RUN pip install -r requirements_openai_api.txt # copy fonts ADD --chown=20001:20001 https://github.com/StellarCN/scp_zh/raw/master/fonts/SimSun.ttf ./ # COPY --chown=20001:20001 SimSun.ttf ./ # copy main app COPY --chown=20001:20001 openai_api.py ./ EXPOSE 8080 # CMD ["python3", "openai_api.py", "-c", "./Qwen-VL-Chat", "--server-name", "0.0.0.0", "--server-port", "8080"] CMD ["python3", "openai_api.py", "-c", "./Qwen-VL-Chat-Int4", "--server-name", "0.0.0.0", "--server-port", "8080"] ================================================ FILE: Dockerfile.qwenopenai ================================================ # python 3.8 and above # pytorch 1.12 and above, 2.0 and above are recommended # CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) # based on modelscope docker image # registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 # registry.cn-beijing.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 ARG workdir=/var/app RUN mkdir -p ${workdir} RUN git lfs install WORKDIR ${workdir} COPY requirements.txt requirements_web_demo.txt ./ # Install Qwen dependencies RUN pip install -r requirements.txt # Install webUI dependencies WORKDIR ${workdir} RUN pip install -r requirements_web_demo.txt # Offline mode, check https://huggingface.co/docs/transformers/v4.15.0/installation#offline-mode ENV HF_DATASETS_OFFLINE=1 ENV TRANSFORMERS_OFFLINE=1 # set TZ, make logs dir, and expose port 8080 ENV TZ=Asia/Shanghai RUN mkdir -p ${workdir}/logs && chmod 777 ${workdir}/logs VOLUME /var/app/logs # create user 20001 RUN useradd -r -m appuser -u 20001 -g 0 WORKDIR ${workdir} # copy model RUN git clone https://huggingface.co/Qwen/Qwen-VL-Chat # COPY --chown=20001:20001 Qwen-VL-Chat ./Qwen-VL-Chat # Install OpenAI API dependencies WORKDIR ${workdir} COPY requirements_openai_api.txt ./ RUN pip install -r requirements_openai_api.txt # copy fonts ADD --chown=20001:20001 https://github.com/StellarCN/scp_zh/raw/master/fonts/SimSun.ttf ./ # COPY --chown=20001:20001 SimSun.ttf ./ # copy main app COPY --chown=20001:20001 openai_api.py ./ EXPOSE 8080 CMD ["python3", "openai_api.py", "-c", "./Qwen-VL-Chat", "--server-name", "0.0.0.0", "--server-port", "8080"] ================================================ FILE: FAQ.md ================================================ # FAQ ## Installation & Environment #### Which version of transformers should I use? 4.31.0 is preferred. #### I downloaded the codes and checkpoints but I can't load the model locally. What should I do? Please check if you have updated the code to the latest, and correctly downloaded all the sharded checkpoint files. #### `qwen.tiktoken` is not found. What is it? This is the merge file of the tokenizer. You have to download it. Note that if you just git clone the repo without [git-lfs](https://git-lfs.com), you cannot download this file. #### transformers_stream_generator/tiktoken/accelerate not found Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt). ## Demo & Inference #### Is there any demo? Yes, see `web_demo_mm.py` for web demo. See README for more information. #### Can Qwen-VL support streaming? No. We do not support streaming yet. #### It seems that the generation is not related to the instruction... Please check if you are loading Qwen-VL-Chat instead of Qwen-VL. Qwen-VL is the base model without alignment, which behaves differently from the SFT/Chat model. #### Is quantization supported? No. We would support quantization asap. #### Unsatisfactory performance in processing long sequences Please ensure that NTK is applied. `use_dynamc_ntk` and `use_logn_attn` in `config.json` should be set to `true` (`true` by default). ## Tokenizer #### bos_id/eos_id/pad_id not found In our training, we only use `<|endoftext|>` as the separator and padding token. You can set bos_id, eos_id, and pad_id to tokenizer.eod_id. Learn more about our tokenizer from our documents about the tokenizer. ================================================ FILE: FAQ_ja.md ================================================ # FAQ ## インストールと環境 #### transformers のバージョンは？ 4.31.0 が望ましいです。 #### コードとチェックポイントをダウンロードしましたが、モデルをローカルにロードできません。どうすればよいでしょうか？コードを最新のものに更新し、すべてのシャードされたチェックポイントファイルを正しくダウンロードしたかどうか確認してください。 #### `qwen.tiktoken` が見つかりません。これは何ですか？これは tokenizer のマージファイルです。ダウンロードする必要があります。[git-lfs](https://git-lfs.com) を使わずにリポジトリを git clone しただけでは、このファイルをダウンロードできないことに注意してください。 #### transformers_stream_generator/tiktoken/accelerate が見つかりません。コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt) にあります。 ## デモと推論 #### デモはありますか？ウェブデモは `web_demo_mm.py` を参照してください。詳細は README を参照してください。 #### Qwen-VLはストリーミングに対応していますか？いいえ、まだサポートしていません。 #### 世代と命令は関係ないようですが... Qwen-VL ではなく Qwen-VL-Chat を読み込んでいないか確認してください。Qwen-VL はアライメントなしのベースモデルで、SFT/Chat モデルとは動作が異なります。 #### 量子化はサポートされていますか？いいえ。早急に量子化をサポートするつもりです。 #### 長いシーケンスの処理で不満足なパフォーマンス NTK が適用されていることを確認してください。`config.json` の `use_dynamc_ntk` と `use_logn_attn` を `true` に設定する必要がある（デフォルトでは `true`）。 ## Tokenizer #### bos_id/eos_id/pad_id が見つかりません。私たちのトレーニングでは、セパレータとパディングトークンとして `<|endoftext|>` のみを使用しています。bos_id、eos_id、pad_id は tokenizer.eod_id に設定できます。私たちの tokenizer について詳しくは、tokenizer についてのドキュメントをご覧ください。 ================================================ FILE: FAQ_ko.md ================================================ # FAQ ## 설치 및 환경 #### 어떤 버전의 transformers를 사용해야 하나요? 4.31.0 버전을 사용하는 것을 선호합니다. #### 코드와 체크포인트를 다운로드했는데 모델을 로컬에서 불러올 수 없어요. 어떻게 해야 하나요? 코드를 최신 버전으로 업데이트했는지, 그리고 모든 샤드 체크포인트 파일을 올바르게 다운로드했는지 확인해 주세요. #### `qwen.tiktoken`을 찾을 수 없어요. 이게 무엇인가요? 이것은 토크나이저의 병합 파일입니다. 이 파일을 다운로드해야 합니다. [git-lfs](https://git-lfs.com) 없이 단순히 깃 저장소를 복제했다면 이 파일을 다운로드할 수 없습니다. #### transformers_stream_generator/tiktoken/accelerate not found 오류 `pip install -r requirements.txt` 명령을 실행하세요. 이 파일은 [https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt)에서 찾을 수 있습니다. ## Demo & Inference #### 데모가 있나요? 네, 웹 데모는 `web_demo_mm.py`를 참고하세요. 더 많은 정보는 README 파일에서 확인할 수 있습니다. #### Qwen-VL은 스트리밍을 지원하나요? 아니요. 아직 스트리밍을 지원하지 않습니다. #### 생성된 내용이 지시사항과 관련 없는 것 같습니다. Qwen-VL 대신 Qwen-VL-Chat을 로드하고 있는지 확인해 주세요. Qwen-VL은 SFT/Chat 모델과 달리 정렬이 없는 기본 모델이므로 다르게 작동합니다. #### 양자화를 지원하나요? 아니요. 가능한 빨리 양자화를 지원할 예정입니다. #### 긴 시퀀스 처리에서 만족스럽지 못한 성능 NTK가 적용되었는지 확인해 주세요. `config.json`의 `use_dynamc_ntk`과 `use_logn_attn`은 `true`로 설정되어야 합니다(`true`가 기본값). ## Tokenizer #### bos_id/eos_id/pad_id not found 오류 저희 훈련에서는 ``을 구분자 및 패딩 토큰으로만 사용합니다. bos_id, eos_id, pad_id를 tokenizer.eod_id로 설정할 수 있습니다. 토크나이저에 대한 문서에서 토크나이저에 대해 더 알아보세요. ================================================ FILE: FAQ_zh.md ================================================ # FAQ ## 安装&环境 #### 我应该用哪个transformers版本？建议使用4.31.0。 #### 我把模型和代码下到本地，按照教程无法使用，该怎么办？答：别着急，先检查你的代码是不是更新到最新版本，然后确认你是否完整地将模型checkpoint下到本地。 #### `qwen.tiktoken`这个文件找不到，怎么办？这个是我们的tokenizer的merge文件，你必须下载它才能使用我们的tokenizer。注意，如果你使用git clone却没有使用git-lfs，这个文件不会被下载。如果你不了解git-lfs，可点击[官网](https://git-lfs.com/)了解。 #### transformers_stream_generator/tiktoken/accelerate，这几个库提示找不到，怎么办？运行如下命令：`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt) 可以找到。 ## Demo & 推理 #### 是否提供Demo？ `web_demo_mm.py`提供了Web UI。请查看README相关内容了解更多。 #### Qwen-VL支持流式推理吗？ Qwen-VL当前不支持流式推理。 #### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的请检查是否加载的是Qwen-VL-Chat模型进行推理，Qwen-VL模型是未经align的预训练基模型，不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`接口内进行了检查，避免您误将预训练模型作为SFT/Chat模型使用。 #### 是否有量化版本模型目前Qwen-VL不支持量化，后续我们将支持高效的量化推理实现。 #### 处理长序列时效果有问题请确认是否开启ntk。若要启用这些技巧，请将`config.json`里的`use_dynamc_ntk`和`use_logn_attn`设置为`true`。最新代码默认为`true`。 ## Tokenizer #### bos_id/eos_id/pad_id，这些token id不存在，为什么？在训练过程中，我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符，你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档，了解如何设置这些id。 ================================================ FILE: LICENSE ================================================ Tongyi Qianwen LICENSE AGREEMENT Tongyi Qianwen Release Date: August 23, 2023 By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. 1. Definitions a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement. b. "We"(or "Us") shall mean Alibaba Cloud. c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. e. "Tongyi Qianwen" shall mean the large language models (including Qwen-VL model and Qwen-VL-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. 2. Grant of Rights You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials. 3. Redistribution You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement; b. You shall cause any modified files to carry prominent notices stating that You changed the files; c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement. 4. Restrictions If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization. 5. Rules of use a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials. b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof). 6. Intellectual Property a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications. b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials. c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought. 7. Disclaimer of Warranty and Limitation of Liability a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto. b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM. c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED. d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials. 8. Survival and Termination. a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement. 9. Governing Law and Jurisdiction. a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement. ================================================ FILE: NOTICE ================================================ ------------- LICENSE FOR NVIDIA Megatron-LM code -------------- Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ------------- LICENSE FOR OpenAI tiktoken code -------------- MIT License Copyright (c) 2022 OpenAI, Shantanu Jain Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ <a href="README_CN.md">中文</a> ｜ English ｜ <a href="README_JA.md">日本語</a> ｜ <a href="README_KO.md">한국어</a> <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖</a> ｜ Qwen-VL-Chat <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖</a> (Int4: <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary">🤖</a> ) ｜ Qwen-VL-Plus <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">🤖</a> ｜ Qwen-VL-Max <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a> <a href="https://tongyi.aliyun.com/qianwen">Web</a> | <a href="http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg">APP</a> | <a href="https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start">API</a> | <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/CV4E9rpNSD">Discord</a> | <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="TUTORIAL.md">Tutorial</a> --- ## Qwen-VL-Plus & Qwen-VL-Max Qwen-Vl-Plus and Qwen-VL-Max are the upgraded and latest versions of the Qwen-VL model family, currently supporting access for free through <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a>, <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>, [Web pages](https://qianwen.aliyun.com), [APP](http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg) and [APIs](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start/). | Model name | Model description | | --- | --- | | Qwen-VL-Plus | Qwen's **Enhanced Large Visual Language Model**. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and extreme aspect ratios for image input. It delivers **significant** performance across a broad range of visual tasks. | | Qwen-VL-Max | Qwen's **Most Capable Large Visual Language Model**. Compared to the enhanced version, further improvements have been made to visual reasoning and instruction-following capabilities, offering a higher level of visual perception and cognitive understanding. It delivers **optimal** performance on an even broader range of complex tasks. | The key technical advancements in these versions include: - Substantially boost in image-related **reasoning capabilities**; - Considerable enhancement in recognizing, extracting, and analyzing **details of images**, especially for text-oriented tasks; - Support for **high-definition images** with resolutions above one million pixels and extreme aspect ratios; These two models not only significantly surpass all previous best results from open-source LVLM models, but also perform on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks. Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension. This breakthrough underscores the model’s advanced capabilities and its potential to set new standards in the field of multimodal AI research and application. <table> <thead> <tr> <th>Model</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>TextVQA</th> <th>MMMU</th> <th>MathVista</th> <th>MM-Bench-CN</th> </tr> </thead> <tbody align="center"> <tr> <td>Other Best Open-source LVLM</td> <td>81.6% (CogAgent)</td> <td>68.4% (CogAgent)</td> <td>73.7% (Fuyu-Medium)</td> <td>76.1% (CogAgent)</td> <td>45.9% (Yi-VL-34B)</td> <td>36.7% (SPHINX-V2)</td> <td>72.4% (InternLM-XComposer-VL)</td> </tr> <tr> <td>Gemini Pro</td> <td>88.1%</td> <td>74.1%</td> <td>73.9%</td> <td>74.6%</td> <td>47.9%</td> <td>45.2%</td> <td>74.3%</td> </tr> <tr> <td>Gemini Ultra</td> <td>90.9%</td> <td>80.8% 1</td> <td>79.5% 1</td> <td>82.3% 1</td> <td>59.4% 1</td> <td>53.0% 1</td> <td>-</td> </tr> <tr> <td>GPT-4V</td> <td>88.4%</td> <td>78.5%</td> <td>78.2%</td> <td>78.0%</td> <td>56.8%</td> <td>49.9%</td> <td>73.9%</td> </tr> <tr> <td>Qwen-VL-Plus</td> <td>91.4%</td> <td>78.1%</td> <td>75.9%</td> <td>78.9%</td> <td>45.2%</td> <td>43.3%</td> <td>68.0%</td> </tr> <tr> <td>Qwen-VL-Max</td> <td>93.1% 1</td> <td>79.8% 2</td> <td>79.3% 2</td> <td>79.5% 2</td> <td>51.4% 3</td> <td>51.0% 2</td> <td>75.1% 1</td> </tr> </tbody> </table> All numbers are obtained without any use of external OCR tools ('pixel only'). --- ## News and Updates * ```2024.01.18``` 💥💥💥 We introduce Qwen-VL-Max, our most capable model that significantly surpasses all previous open-source LVLM models, and it performs on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks. You can enjoy the new model by directly visiting our [web pages](https://qianwen.aliyun.com), <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a> and <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>. * ```2023.11.28``` 🏆🏆🏆 Qwen-VL-Plus achieved the best performance in [DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) by using a single model, surpassing GPT4V and PALI-X, without using model ensemble or OCR-pipeline. Meanwhile, it is also a general model that can help you analyze and understand various tasks by directly inputting images. * ```2023.9.25``` 🚀🚀🚀 We update Qwen-VL-Chat with more robust Chinese instruction-following ability, improved understanding of web pages and table images, and better dialogue performance (Touchstone: CN: 401.2->481.7, EN: 645.2->711.6). * ```2023.9.12``` 😃😃😃 We now support finetuning on the Qwen-VL models, including full-parameter finetuning, LoRA and Q-LoRA. * ```2023.9.8``` 👍👍👍 Thanks to [camenduru](https://github.com/camenduru) for contributing the wonderful [Colab](https://github.com/camenduru/Qwen-VL-Chat-colab). Everyone can use it as a local or online Qwen-VL-Chat-Int4 Demo tutorial on one 12G GPU. * ```2023.9.5``` 👏👏👏 Qwen-VL-Chat achieves SOTAs on [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks. * ```2023.9.4``` ⭐⭐⭐ Qwen-VL series achieve SOTAs on [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard), a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs including both image and video understanding. * ```2023.9.1``` 🔥🔥🔥 We release the [TouchStone](https://github.com/OFA-Sys/TouchStone) Evaluation, which is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By using strong LLMs as judges and converting multimodal information into text. * ```2023.8.31``` 🌟🌟🌟 We release the Int4 quantized model for Qwen-VL-Chat, **Qwen-VL-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation. * ```2023.8.22``` 🎉🎉🎉 We release both **Qwen-VL** and **Qwen-VL-Chat** on ModelScope and Hugging Face. We also provide a [paper](https://arxiv.org/abs/2308.12966) for more details about the model, including training details and model performance. --- ## Qwen-VL **Qwen-VL** (Qwen Large Vision Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text, and bounding box. The features of Qwen-VL include: - **Strong performance**: It significantly surpasses existing open-sourced Large Vision Language Models (LVLM) under a similar model scale on multiple English evaluation benchmarks (including Zero-shot Captioning, VQA, DocVQA, and Grounding). - **Multi-lingual LVLM supporting text recognition**: Qwen-VL naturally supports English, Chinese, and multi-lingual conversation, and it promotes end-to-end recognition of Chinese and English bi-lingual text in images. - **Multi-image interleaved conversations**: This feature allows for the input and comparison of multiple images, as well as the ability to specify questions related to the images and engage in multi-image storytelling. - **First generalist model supporting grounding in Chinese**: Detecting bounding boxes through open-domain language expression in both Chinese and English. - **Fine-grained recognition and understanding**: Compared to the 224\*224 resolution currently used by other open-sourced LVLM, the 448\*448 resolution promotes fine-grained text recognition, document QA, and bounding box annotation. <img src="assets/demo_vl.gif" width="400"/> We release two models of the Qwen-VL series: - Qwen-VL: The pre-trained LVLM model uses Qwen-7B as the initialization of the LLM, and [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) as the initialization of the visual encoder. And connects them with a randomly initialized cross-attention layer. - Qwen-VL-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities. ## Evaluation We evaluated the model's abilities from three perspectives: 1. **Standard Benchmarks**: We evaluate the model's basic task capabilities on four major categories of multimodal tasks: - Zero-shot Captioning: Evaluate model's zero-shot image captioning ability on unseen datasets; - General VQA: Evaluate the general question-answering ability of pictures, such as the judgment, color, number, category, etc; - Text-based VQA: Evaluate the model's ability to recognize text in pictures, such as document QA, chart QA, etc; - Referring Expression Comprehension: Evaluate the ability to localize a target object in an image described by a referring expression. 2. **TouchStone**: To evaluate the overall text-image dialogue capability and alignment level with humans, we have constructed a benchmark called [TouchStone](https://github.com/OFA-Sys/TouchStone), which is based on scoring with GPT4 to evaluate the LVLM model. - The TouchStone benchmark covers a total of 300+ images, 800+ questions, and 27 categories. Such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc; - In order to break the current limitation of GPT4 in terms of direct image input, TouchStone provides fine-grained image annotations by human labeling. These detailed annotations, along with the questions and the model's output, are then presented to GPT4 for scoring. - The benchmark includes both English and Chinese versions. 3. **Other Multimodal Benchmarks**: We also evaluated our model's capabilities in other multimodal benchmarks: - [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), a comprehensive evaluation benchmark for multimodal large language models. Qwen-VL-Chat achieves SOTAs on both perception and cognition tracks. - [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard), a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs. Qwen series achieves SOTAs on this benchmark. The results of the evaluation are as follows: Qwen-VL outperforms current SOTA generalist models on multiple VL tasks and has a more comprehensive coverage in terms of capability range. <img src="assets/radar.png" width="600"/> ### Zero-shot Captioning & General VQA <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>80.5</td> <td>51.1</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - For zero-shot image captioning, Qwen-VL achieves the **SOTA** on Flickr30K and competitive results on Nocaps with InstructBlip. - For general VQA, Qwen-VL achieves the **SOTA** under the same generalist LVLM scale settings. ### Text-oriented VQA (Focused on text understanding capabilities in images) <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - In text-related recognition/QA evaluation, Qwen-VL achieves the SOTA under the generalist LVLM scale settings. - Resolution is important for several above evaluations. While most open-sourced LVLM models with 224 resolution are incapable of these evaluations or can only solve these by cutting images, Qwen-VL scales the resolution to 448 so that it can be evaluated end-to-end. Qwen-VL even outperforms Pix2Struct-Large models of 1024 resolution on some tasks. ### Referring Expression Comprehension <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56</td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - Qwen-VL achieves the **SOTA** in all above referring expression comprehension benchmarks. - Qwen-VL has not been trained on any Chinese grounding data, but it can still generalize to the Chinese Grounding tasks in a zero-shot way by training Chinese Caption data and English Grounding data. We provide all of the above evaluation scripts for reproducing our experimental results. Please read [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md) for more information. ### Chat evaluation TouchStone is a benchmark based on scoring with GPT4 to evaluate the abilities of the LVLM model on text-image dialogue and alignment levels with humans. It covers a total of 300+ images, 800+ questions, and 27 categories, such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc. Please read [touchstone/README.md](touchstone/README.md) for more information. #### English evaluation | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### Chinese evaluation | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat has achieved the best results in both Chinese and English alignment evaluation. ### Other Benchmarks #### MME Benchmark [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning. Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation. See more details on [HERE](eval_mm/mme/EVAL_MME.md). <img src="eval_mm/mme/perception.jpg" width="600"/> <img src="eval_mm/mme/cognition.jpg" width="600"/> #### SEED-Bench [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both **image** and **video** understanding. See more details on [HERE](eval_mm/seed_bench/EVAL_SEED.md). Qwen-VL and Qwen-VL-Chat achieve SOTAs on this benchmark. <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## Requirements * python 3.8 and above * pytorch 1.12 and above, 2.0 and above are recommended * CUDA 11.4 and above are recommended (this is for GPU users) ## Quickstart Below, we provide simple examples to show how to use Qwen-VL and Qwen-VL-Chat with 🤖 ModelScope and 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. ```bash pip install -r requirements.txt ``` Now you can start with ModelScope or Transformers. More usage aboue vision encoder, please refer to the [tutorial](TUTORIAL.md). #### 🤗 Transformers To use Qwen-VL-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # Note: The default behavior now has injection attack prevention off. tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 1st dialogue turn query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。 # 2nd dialogue turn response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> <details> <summary>Running Qwen-VL</summary> Running Qwen-VL pretrained base model is also simple. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0) # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> </details> In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below: ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below: ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # use bf16 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # use auto model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>=4.32.0) # model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 1st dialogue turn # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。 # 2nd dialogue turn response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> ## Quantization ### Usage We provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-VL-Chat, Qwen-VL-Chat-Int4 [Click here](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed. Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel. Then you can load the quantized model easily and run inference as same as usual: ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### Performance We illustrate the model performance of both BF16 and Int4 models on the benchmark **[TouchStone](https://github.com/OFA-Sys/TouchStone)**, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### Inference Speed We measured the average inference speed (tokens/s) of generating 1792 (2048-258) and 7934 (8192-258) tokens with the context of an image (which takes 258 tokens) under BF16 precision and Int4 quantization, respectively. | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. ### GPU Memory Usage We also profile the peak GPU memory usage for encoding 1792 (2048-258) tokens (including an image) as context (and generating single token) and generating 7934 (8192-258) tokens (with an image as context) under BF16 or Int4 quantization level, respectively. The results are shown below. | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py). ## Finetuning Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with DeepSpeed and FSDP. The shell scripts that we provide use DeepSpeed, and thus we advise you to install DeepSpeed before you start: ```bash pip install deepspeed ``` ### Data preparation To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample: ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` For the VL tasks, there are special tokens that are used, including `<img> </img> <ref> </ref> <box> </box>`. The picture is represented as `Picture id: <img>img_path</img>\n{your prompt}`, where `id` indicates the position of the image in the conversation, starting from 1. The "img_path" can be a local file path or a web link. The coordinate box is expressed as `<box>(x1,y1),(x2,y2)</box>`·, where `(x1, y1)` and `(x2, y2)` are normalized values in the range `[0, 1000)`. Its corresponding text description can be identified by `<ref>text_caption</ref>`. After data preparation, you can use the provided shell scripts to run finetuning. Remember to specify the path to the data file, `$DATA`. The finetuning scripts allow you to perform: - Full-parameter finetuning - LoRA - Q-LoRA ### Full-parameter finetuning Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. In our experiments, frozening the parameters of ViT during the fine-tuning phase achieves better performance. To launch your training, run the following script: ```bash sh finetune/finetune_ds.sh ``` Remember to specify the correct model name or path, the data path, as well as the output directory in the shell scripts. If you want to make changes, just remove the argument `--deepspeed` or make changes in the DeepSpeed configuration json file based on your requirements. Additionally, this script supports mixed-precision training, and thus you can use `--bf16 True` or `--fp16 True`. Empirically we advise you to use bf16 to make your training consistent with our pretraining and alignment if your machine supports bf16, and thus we use it by default. ### LoRA Similarly, to run LoRA, use another script to run as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load. ```bash # Single GPU training sh finetune/finetune_lora_single_gpu.sh # Distributed training sh finetune/finetune_lora_ds.sh ``` In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. Note that if you use LoRA to finetune the base language model, e.g., Qwen-VL, instead of chat models, e.g., Qwen-VL-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA finetune the chat models. Check the profile below for more information. ### Q-LoRA However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. To run Q-LoRA, directly run the following script: ```bash # Single GPU training sh finetune/finetune_qlora_single_gpu.sh # Distributed training sh finetune/finetune_qlora_ds.sh ``` For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-VL-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work. Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the finetuned model for inference as shown below: ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` If you want to merge the adapters and save the finetuned model as a standalone model (you can only do this with LoRA, and you CANNOT merge the parameters from Q-LoRA), you can run the following codes: ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size and safe serialization are not necessary. # They respectively work for sharding checkpoint and save the model to safetensors merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` Note: For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument --model_max_length, based on your consideration of data, memory footprint, and training speed. ### Profiling of Memory and Speed We profile the GPU memory and training speed of both LoRA (Base) refers to training the embedding and output layer, while LoRA (Chat) has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. We uniformly use a batch size of 1 and gradient accumulation of 8. Each sample contains an image. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 384, 512, 1024, and 2048. The statistics are listed below: <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> ## Demo ### Web UI We provide code for users to build a web UI demo. Before you start, make sure you install the following packages: ``` pip install -r requirements_web_demo.txt ``` Then run the command below and click on the generated link: ``` python web_demo_mm.py ``` ## FAQ If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue. ## License Agreement Researchers and developers are free to use the codes and model weights of both Qwen-VL and Qwen-VL-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. ## Citation If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## Contact Us If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com. ================================================ FILE: README_CN.md ================================================ 中文</a> ｜ <a href="README.md">English</a> ｜ <a href="README_JA.md">日本語</a> ｜ <a href="README_KO.md">한국어</a> <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖</a> ｜ Qwen-VL-Chat <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖</a> (Int4: <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary">🤖</a> ) ｜ Qwen-VL-Plus <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">🤖</a> ｜ Qwen-VL-Max <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a> <a href="https://tongyi.aliyun.com/qianwen">Web</a> | <a href="http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg">APP</a> | <a href="https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start">API</a> | <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a> | <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="TUTORIAL.md">Tutorial</a> --- ## Qwen-VL-Plus & Qwen-VL-Max Qwen-VL 系列再次迎来重磅升级，我们推出 Qwen-VL-Plus 和 Qwen-VL-Max 两个升级版的模型。目前支持通过<a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a>、<a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>、[网页端](https://qianwen.aliyun.com)、[APP](http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg) 和 [API](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start)免费访问。 | 模型名 | 模型简介 | | --- | --- | | Qwen-VL-Plus | 通义千问大规模视觉语言模型增强版。大幅提升细节识别能力和文字识别能力，支持超百万像素分辨率和任意长宽比规格的图像。在广泛的视觉任务上提供**卓越**的性能。 | | Qwen-VL-Max | 通义千问超大规模视觉语言模型。相比增强版，再次提升视觉推理能力和指令遵循能力，提供更高的视觉感知和认知水平。在更多复杂任务上提供**最佳**的性能。 | 这两个版本的主要技术升级在于： - 大幅提升图像相关的推理能力； - 大幅提升对图中细节和文字的识别、提取和分析能力； - 支持百万像素以上的高清分辨率图，支持各种长宽比的图像；这两个模型不仅大幅超越此前所有开源 LVLM 模型的最佳水平，并且在多项图文多模态标准测试中获得了堪比 Gemini Ultra 和 GPT4-v 的水准。甚至，Qwen-VL-Max 在中文问答、中文文字理解相关的任务上超越了 OpenAI的 GPT4-v 和 Google 的 Gemini-Pro。 <table> <thead> <tr> <th>Model</th> <th>DocVQA (文档理解)</th> <th>ChartQA (图表理解)</th> <th>AI2D (科学图例)</th> <th>TextVQA (文字阅读)</th> <th>MMMU (多学科问题)</th> <th>MathVista (数学推理)</th> <th>MM-Bench-CN (中文问答)</th> </tr> </thead> <tbody align="center"> <tr> <td>Other Best Open-source LVLM</td> <td>81.6% (CogAgent)</td> <td>68.4% (CogAgent)</td> <td>73.7% (Fuyu-Medium)</td> <td>76.1% (CogAgent)</td> <td>45.9% (Yi-VL-34B)</td> <td>36.7% (SPHINX-V2)</td> <td>72.4% (InternLM-XComposer-VL)</td> </tr> <tr> <td>Gemini Pro</td> <td>88.1%</td> <td>74.1%</td> <td>73.9%</td> <td>74.6%</td> <td>47.9%</td> <td>45.2%</td> <td>74.3%</td> </tr> <tr> <td>Gemini Ultra</td> <td>90.9%</td> <td>80.8% 1</td> <td>79.5% 1</td> <td>82.3% 1</td> <td>59.4% 1</td> <td>53.0% 1</td> <td>-</td> </tr> <tr> <td>GPT-4V</td> <td>88.4%</td> <td>78.5%</td> <td>78.2%</td> <td>78.0%</td> <td>56.8%</td> <td>49.9%</td> <td>73.9%</td> </tr> <tr> <td>Qwen-VL-Plus</td> <td>91.4%</td> <td>78.1%</td> <td>75.9%</td> <td>78.9%</td> <td>44.0%</td> <td>43.3%</td> <td>68.0%</td> </tr> <tr> <td>Qwen-VL-Max</td> <td>92.5% 1</td> <td>79.8% 2</td> <td>79.3% 2</td> <td>79.5% 2</td> <td>51.4% 3</td> <td>51.0% 2</td> <td>75.1% 1</td> </tr> </tbody> </table> 所有评测都是在不使用任何外部OCR工具(“only pixel”)的情况下获得的。 --- ## 新闻 * 2024年01月18日我们推出 Qwen-Vl-Max，大幅超越此前所有开源 LVLM 模型的最佳水平，并且在多项图文多模态标准测试中获得了堪比 Gemini Ultra 和 GPT4-v 的水准。直接访问[通义千问网页端或APP](https://qianwen.aliyun.com)就能体验新模型。 * 2023年11月28日 Qwen-VL单模型在[DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1)达到了最强水平，超越了GPT4V,PALI-X，与此同时它还是一个通用模型，直接输入图片就能帮你分析理解各种任务。 * 2023年9月12日更新Qwen-VL-Chat模型，该模型有更鲁棒的中文指令跟随，更好的网页和表格图片理解和问答能力以及更好的对话表现(Touchstone: 中文: 401.2->481.7, 英文: 645.2->711.6)。 * 2023年9月12日支持Qwen-VL和Qwen-VL-Chat的微调，其中包括全参数微调、LoRA以及Q-LoRA * 2023年9月8日感谢[camenduru](https://github.com/camenduru)贡献了[Colab](https://github.com/camenduru/Qwen-VL-Chat-colab)示例，每个人都可以以此为教程，在12G的GPU上做本地或在线的Demo。 * 2023年9月5日在社区多模态通用模型榜单 [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) 上取得了感知和认知双赛道的当前最好结果。 * 2023年9月4日在社区多模态通用模型榜单 [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) 上取得了图像理解和视频理解的当前最好结果。 * 2023年9月1日发布[TouchStone](https://github.com/OFA-Sys/TouchStone) 测评, 这是一个综合评估LVLM能力的测评,它不仅考察模型的视觉描述和推理能力，还包括根据视觉内容的文学创作能力。同时它是将多模态信息用文本表述并用LLMs进行评估的方法。 * 2023年8月31日发布Qwen-VL-Chat量化模型，**Qwen-VL-Chat-Int4**,该模型显存占用低，推理速度相比半精度模型显著提升，在基准评测上效果损失较小。 * 2023年8月22日在魔搭社区（ModelScope）和Hugging Face同步推出Qwen-VL和Qwen-VL-Chat模型。同时，我们提供一个[论文](https://arxiv.org/abs/2308.12966)介绍了相关的模型结构、训练细节和模型表现。 --- ## Qwen-VL **Qwen-VL** 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。Qwen-VL 系列模型的特点包括： - **强大的性能**：在四大类多模态任务的标准英文测评中（Zero-shot Captioning/VQA/DocVQA/Grounding）上，均取得同等通用模型大小下最好效果； - **多语言对话模型**：天然支持英文、中文等多语言对话，端到端支持图片里中英双语的长文本识别； - **多图交错对话**：支持多图输入和比较，指定图片问答，多图文学创作等； - **首个支持中文开放域定位的通用模型**：通过中文开放域语言表达进行检测框标注； - **细粒度识别和理解**：相比于目前其它开源LVLM使用的224分辨率，Qwen-VL是首个开源的448分辨率的LVLM模型。更高分辨率可以提升细粒度的文字识别、文档问答和检测框标注。 <img src="assets/demo_vl.gif" width="400"/> 目前，我们提供了 Qwen-VL 系列的两个模型： - Qwen-VL: Qwen-VL 以 Qwen-7B 的预训练模型作为语言模型的初始化，并以 [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) 作为视觉编码器的初始化，中间加入单层随机初始化的 cross-attention，经过约1.5B的图文数据训练得到。最终图像输入分辨率为448。 - Qwen-VL-Chat: 在 Qwen-VL 的基础上，我们使用对齐机制打造了基于大语言模型的视觉AI助手Qwen-VL-Chat，它支持更灵活的交互方式，包括多图、多轮问答、创作等能力。 ## 评测我们从三个角度评测了模型的能力： 1. 在**英文标准 Benchmark** 上评测模型的基础任务能力。目前评测了四大类多模态任务： - Zero-shot Captioning: 评测模型在未见过数据集上的零样本图片描述能力； - General VQA: 评测模型的通用问答能力，例如判断题、颜色、个数、类目等问答能力； - Text-based VQA：评测模型对于图片中文字相关的识别/问答能力，例如文档问答、图表问答、文字问答等； - Referring Expression Compression：评测模型给定物体描述画检测框的能力； 2. **试金石 (TouchStone)**：为了评测模型整体的图文对话能力和人类对齐水平。我们为此构建了一个基于 GPT4 打分来评测 LVLM 模型的 Benchmark：TouchStone。在 TouchStone-v0.1 中： - 评测基准总计涵盖 300+张图片、800+道题目、27个类别。包括基础属性问答、人物地标问答、影视作品问答、视觉推理、反事实推理、诗歌创作、故事写作，商品比较、图片解题等**尽可能广泛的类别**。 - 为了弥补目前 GPT4 无法直接读取图片的缺陷，我们给所有的带评测图片提供了**人工标注的充分详细描述**，并且将图片的详细描述、问题和模型的输出结果一起交给 GPT4 打分。 - 评测同时包含英文版本和中文版本。 3. **其它多模态通用模型榜单**：我们也在其它多模态通用模型榜单中评测了模型的能力： - MME Benchmark: 是一个多模态大型语言模型的综合评价基准。它在总共14个子任务上评测**感知和认知**能力，Qwen-VL-Chat在这两个总维度上都实现了当前最好结果。 - SEED-Bench: 是一个包含1.9万选择题的多模态基准测评，通过人工注释的结果评估多模态大模型，涵盖12个评估维度，包括**图像和视频理解**，Qwen-VL和Qwen-VL-chat在这个基准上实现了当前最好结果。评测结果如下： Qwen-VL在多个VL任务上相比目前SOTA的Generalist Models都有明显优势，并且在能力范围也覆盖更加全面。 <img src="assets/radar.png" width="600"/> ### 零样本图像描述生成（Zero-shot Image Caption）及通用视觉问答（General VQA） <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>66.7</td> <td>45.6</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - 在 Zero-shot Captioning 中，Qwen-VL 在 Flickr30K 数据集上取得了 **SOTA** 的结果，并在 Nocaps 数据集上取得了和 InstructBlip 可竞争的结果。 - 在 General VQA 中，Qwen-VL 取得了 LVLM 模型同等量级和设定下 **SOTA** 的结果。 ### 文本导向的视觉问答（Text-oriented VQA） <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - 在文字相关的识别/问答评测上，取得了当前规模下通用 LVLM 达到的最好结果。 - 分辨率对上述某几个评测非常重要，大部分 224 分辨率的开源 LVLM 模型无法完成以上评测，或只能通过切图的方式解决。Qwen-VL 将分辨率提升到 448，可以直接以端到端的方式进行以上评测。Qwen-VL 在很多任务上甚至超过了 1024 分辨率的 Pix2Struct-Large 模型。 ### 细粒度视觉定位（Referring Expression Comprehension） <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> </tr> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56 </td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - 在定位任务上，Qwen-VL 全面超过 Shikra-13B，取得了目前 Generalist LVLM 模型上在 Refcoco 上的 **SOTA**。 - Qwen-VL 并没有在任何中文定位数据上训练过，但通过中文 Caption 数据和英文 Grounding 数据的训练，可以 Zero-shot 泛化出中文 Grounding 能力。我们提供了以上**所有**评测脚本以供复现我们的实验结果。请阅读 [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md) 了解更多信息。 ### 对话能力测评 TouchStone 是一个基于 GPT4 打分来评测 LVLM 模型的图文对话能力和人类对齐水平的基准。它涵盖了 300+张图片、800+道题目、27个类别，包括基础属性、人物地标、视觉推理、诗歌创作、故事写作、商品比较、图片解题等**尽可能广泛的类别**。关于 TouchStone 的详细介绍，请参考[touchstone/README_CN.md](touchstone/README_CN.md)了解更多信息。 #### 英语 | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### 中文 | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat 模型在中英文的对齐评测中均取得当前 LVLM 模型下的最好结果。 ### 其它榜单测评 #### MME Benchmark MME是多模态大型语言模型的综合评价基准。它在总共14个子任务上评测**感知和认知**能力。Qwen-VL-Chat在这个基准上实现了SOTAs。完整复现[见此](eval_mm/mme/EVAL_MME.md). <img src="eval_mm/mme/perception.jpg" width="600"/> <img src="eval_mm/mme/cognition.jpg" width="600"/> #### SEED-Bench SEED-Bench是一个包含1.9万选择题的多模态基准测评，通过人工注释的结果评估多模态大模型，涵盖12个评估维度，包括**图像和视频理解**。Qwen-VL和Qwen-VL-chat在这个基准上实现了SOTAs。完整复现[见此](eval_mm/seed_bench/EVAL_SEED.md)。 <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## 部署要求 * python 3.8及以上版本 * pytorch 1.12及以上版本，推荐2.0及以上版本 * 建议使用CUDA 11.4及以上（GPU用户需考虑此选项） ## 快速使用我们提供简单的示例来说明如何利用 🤖 ModelScope 和 🤗 Transformers 快速使用 Qwen-VL 和 Qwen-VL-Chat。在开始前，请确保你已经配置好环境并安装好相关的代码包。最重要的是，确保你满足上述要求，然后安装相关的依赖库。 ```bash pip install -r requirements.txt ``` 接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。关于视觉模块的更多用法，请参考[教程](TUTORIAL_zh.md)。 #### 🤗 Transformers 如希望使用 Qwen-VL-chat 进行推理，所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码。** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # 请注意：分词器默认行为已更改为默认关闭特殊token攻击防护。 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # 使用CPU进行推理，需要约32GB内存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # 默认gpu进行推理，需要约24GB显存 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # 可指定不同的生成长度、top_p等相关超参（transformers 4.32.0及以上无需执行此操作） # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 第一轮对话 query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。 # 第二轮对话 response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> 运行Qwen-VL同样非常简单。 <summary>运行Qwen-VL</summary> ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # 使用CPU进行推理，需要约32GB内存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # 默认gpu进行推理，需要约24GB显存 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # 可指定不同的生成长度、top_p等相关超参（transformers 4.32.0及以上无需执行此操作） # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> 若在使用上述代码时由于各种原因无法从 HuggingFace 拉取模型和代码，可以先从 ModelScope 下载模型及代码至本地，再从本地加载模型： ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope 魔搭（ModelScope）是开源的模型即服务共享平台，为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单，代码如下所示： ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # 使用CPU进行推理，需要约32GB内存 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # 默认gpu进行推理，需要约24GB显存 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # 指定生成超参数（transformers 4.32.0及以上无需执行此操作） # model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 第一轮对话 # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。 # 第二轮对话 response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` ## 量化 ### 用法当前我们提供了基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化方案，并提供了Qwen-VL-Chat的Int4量化版本Qwen-VL-Chat-Int4 [点击此处](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)。该模型在效果评测上几乎无损，并在显存占用和推理速度上具有明显优势。下文说明如何使用该量化模型。开始之前，请确保你满足要求（如torch2.0及以上、transformers 4.32.0及以上，等）并安装所需的代码库： ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` 如遇到安装 `auto-gptq` 的问题，建议您前往官方[repo](https://github.com/PanQiWei/AutoGPTQ) 寻找合适的wheel。随后你便可以按照上述用法****，轻松调用量化模型： ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### 效果评测我们列出不同精度下模型在评测基准 **[TouchStone](https://github.com/OFA-Sys/TouchStone)** 上的表现，并发现量化模型并没有显著性能损失。结果如下所示： | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### 推理速度我们测算了在输入一张图片（即258个token）的条件下BF16和Int4的模型生成1792 (2048-258) 和 7934 (8192-258) 个token的平均速度。 | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | 推理速度测算是在单卡 A100-SXM4-80G GPU上运行，使用PyTorch 2.0.1及CUDA 11.4。 ### GPU显存占用我们还测算了在一张图片输入的条件下BF16和Int4模型生成1792 (2048-258) 和 7934 (8192-258) 个token所需显存。结果如下所示： | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | 上述速度和显存测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)完成。 ## 微调我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能，以接入下游任务。此外，我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed，因此建议您确保已经安装DeepSpeed。首先，你需要准备你的训练数据。你需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典，包含id和conversation，其中后者为一个列表。示例如下所示： ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` 为针对多样的VL任务，我们增加了一下的特殊tokens： `<img> </img> <ref> </ref> <box> </box>`. 对于带图像输入的内容可表示为 `Picture id: <img>img_path</img>\n{your prompt}`，其中`id`表示对话中的第几张图片。"img_path"可以是本地的图片或网络地址。对话中的检测框可以表示为`<box>(x1,y1),(x2,y2)</box>`，其中 `(x1, y1)` 和`(x2, y2)`分别对应左上角和右下角的坐标，并且被归一化到`[0, 1000)`的范围内. 检测框对应的文本描述也可以通过`<ref>text_caption</ref>`表示。准备好数据后，你可以使用我们提供的shell脚本实现微调。注意，你需要在脚本中指定你的数据的路径。微调脚本能够帮你实现： - 全参数微调 - LoRA - Q-LoRA ### 全参数微调默认下全参数微调在训练过程中更新LLM所有参数。我们的实验中，在微调阶段不更新ViT的参数会取得更好的表现。你可以运行这个脚本开始训练： ```bash # 分布式训练。由于显存限制将导致单卡训练失败，我们不提供单卡训练脚本。 sh finetune/finetune_ds.sh ``` 尤其注意，你需要在脚本中指定正确的模型名称或路径、数据路径、以及模型输出的文件夹路径。如果你想修改deepspeed配置，可以删除掉`--deepspeed`这个输入或者自行根据需求修改DeepSpeed配置json文件。此外，我们支持混合精度训练，因此你可以设置`--bf16 True`或者`--fp16 True`。经验上，如果你的机器支持bf16，我们建议使用bf16，这样可以和我们的预训练和对齐训练保持一致，这也是为什么我们把默认配置设为它的原因。 ### LoRA 运行LoRA的方法类似全参数微调。但在开始前，请确保已经安装`peft`代码库。另外，记住要设置正确的模型、数据和输出路径。我们建议你为模型路径使用绝对路径。这是因为LoRA仅存储adapter部分参数，而adapter配置json文件记录了预训练模型的路径，用于读取预训练模型权重。同样，你可以设置bf16或者fp16。 ```bash # 单卡训练 sh finetune/finetune_lora_single_gpu.sh # 分布式训练 sh finetune/finetune_lora_ds.sh ``` 与全参数微调不同，LoRA ([论文](https://arxiv.org/abs/2106.09685)) 只更新adapter层的参数而无需更新原有语言模型的参数。这种方法允许用户用更低的显存开销来训练模型，也意味着更小的计算开销。注意，如果你使用预训练模型进行LoRA微调，而非chat模型，模型的embedding和输出层的参数将被设为可训练的参数。这是因为预训练模型没有学习过ChatML格式中的特殊token，因此需要将这部分参数设为可训练才能让模型学会理解和预测这些token。这也意味着，假如你的训练引入新的特殊token，你需要通过代码中的`modules_to_save`将这些参数设为可训练的参数。如果你想节省显存占用，可以考虑使用chat模型进行LoRA微调，显存占用将大幅度降低。下文的显存占用和训练速度的记录将详细介绍这部分细节。 ### Q-LoRA 如果你依然遇到显存不足的问题，可以考虑使用Q-LoRA ([论文](https://arxiv.org/abs/2305.14314))。该方法使用4比特量化模型以及paged attention等技术实现更小的显存开销。运行Q-LoRA你只需运行如下脚本： ```bash # 单卡训练 sh finetune/finetune_qlora_single_gpu.sh # 分布式训练 sh finetune/finetune_qlora_ds.sh ``` 我们建议你使用我们提供的Int4量化模型进行训练，即Qwen-VL-Chat-Int4。请**不要使用**非量化模型！与全参数微调以及LoRA不同，Q-LoRA仅支持fp16。此外，上述LoRA关于特殊token的问题在Q-LoRA依然存在。并且，Int4模型的参数无法被设为可训练的参数。所幸的是，我们只提供了Chat模型的Int4模型，因此你不用担心这个问题。但是，如果你执意要在Q-LoRA中引入新的特殊token，很抱歉，我们无法保证你能成功训练。与全参数微调不同，LoRA和Q-LoRA的训练只需存储adapter部分的参数。假如你需要使用LoRA训练后的模型，你需要使用如下方法。你可以用如下代码读取模型： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` 如果你觉得这样一步到位的方式让你很不安心或者影响你接入下游应用，你可以选择先合并并存储模型（LoRA支持合并，Q-LoRA不支持），再用常规方式读取你的新模型，示例如下： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size and safe serialization are not necessary. # They respectively work for sharding checkpoint and save the model to safetensors merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` 注意：分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外，你需要根据你的数据、显存情况和训练速度预期，使用`--model_max_length`设定你的数据长度。 ### 显存占用及训练速度下面记录Qwen_VL模型在单GPU使用LoRA（LoRA (Base)指的是embedding和输出层参与训练，而LoRA (Chat)则不优化这部分参数）和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU，使用CUDA 11.8和Pytorch 2.0。我们统一使用batch size为1，gradient accumulation为8的训练配置，每个样本包含一张图，分别记录输入长度分别为384、512、1024和2048的显存占用（GB）和训练速度（s/iter）。具体数值如下所示： <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> ## Demo ### Web UI 我们提供了Web UI的demo供用户使用。在开始前，确保已经安装如下代码库： ``` pip install -r requirements_web_demo.txt ``` 随后运行如下命令，并点击生成链接： ``` python web_demo_mm.py ``` ## FAQ 如遇到问题，敬请查阅 [FAQ](FAQ_zh.md)以及issue区，如仍无法解决再提交issue。 ## 使用协议研究人员与开发者可使用Qwen-VL和Qwen-VL-Chat或进行二次开发。我们同样允许商业使用，具体细节请查看[LICENSE](LICENSE)。如需商用，请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。 ## 引用如果你觉得我们的论文和代码对你的研究有帮助，请考虑:star: 和引用 :pencil: :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## 联系我们如果你想给我们的研发团队和产品团队留言，请通过邮件（qianwen_opensource@alibabacloud.com）联系我们。 ================================================ FILE: README_JA.md ================================================ <a href="README_CN.md">中文</a> ｜ <a href="README.md">English</a> ｜日本語 <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> ｜ Qwen-VL-Chat <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> ｜ Qwen-VL-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a> | <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">Demo</a> ｜ <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="https://github.com/camenduru/Qwen-VL-Chat-colab">Colab</a> | <a href="TUTORIAL_ja.md">Tutorial</a> 日本語ドキュメントメンテナー: <a href="https://github.com/eltociear">Ikko Eltociear Ashimine</a> **Qwen-VL** （Qwen Large Vision Language Model）は、アリババクラウドが提唱するラージモデルシリーズ Qwen（略称: Tongyi Qianwen）のマルチモーダル版です。Qwen-VL は、画像、テキスト、バウンディングボックスを入力として受け付け、テキストとバウンディングボックスを出力します。Qwen-VL の特徴は以下の通りです: - **好調なパフォーマンス**: 複数の英語評価ベンチマーク（Zero-shot Captioning、VQA、DocVQA、Grounding を含む）において、同様のモデル規模でオープンソース化された既存の大規模ビジョン言語モデル（LVLM）を大幅に上回ります。 - **テキスト認識をサポートする多言語 LVLM**: Qwen-VL は、英語、中国語、多言語の会話を自然にサポートし、画像内の中国語と英語の二言語テキストのエンドツーエンドの認識を促進します。 - **複数画像のインターリーブ会話**: この機能により、複数の画像を入力し、比較することができる。また、画像に関連する質問を指定し、複数の画像によるストーリーテリングを行うこともできます。 - **中国語のグラウンディングを支える初のジェネラリストモデル**: 中国語と英語のオープンドメイン言語表現によるバウンディングボックスの検出。 - **きめ細やかな認識と理解**: 現在他のオープンソース LVLM で使用されている 224\*224 の解像度と比較して、448\*448 の解像度は、きめ細かいテキスト認識、文書 QA、バウンディングボックス注釈を促進する。 <img src="assets/demo_vl.gif" width="400"/> Qwen-VL シリーズの 2 つのモデルを公開します: - Qwen-VL: LLM の初期化に Qwen-7B を、視覚エンコーダの初期化に [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) を用いた学習済み LVLM モデル。そして、それらをランダムに初期化されたクロスアテンションレイヤーで接続する。 - Qwen-VL-Chat: マルチモーダルな LLM ベースの AI アシスタント。Qwen-VL-Chat は、複数の画像入力、複数ラウンドの質問応答、クリエイティブな機能など、より柔軟なインタラクションをサポートします。 ## ニュースとアップデート * 2023.11.28 Qwen-VL は、GPT4V、PALI-X を凌駕する最高レベルの [DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) をシングルモデルで達成し、直接画像を入力するだけで様々なタスクを分析理解できる汎用モデルであり。 https://qianwen.aliyun.com のマルチモーダルタブで直接新しいモデルを体験できます。 * 2023.9.25 Qwen-VL-Chat モデルが更新され、中国語コマンドのフォローがより堅牢になり、Web ページと表の画像の理解と質問と回答の機能が向上し、対話のパフォーマンスが向上しました (タッチストーン: 中国語: 401.2->481.7、英語: 645.2->711.6)。 * 2023.9.12 フルパラメータ微調整、LoRA、Q-LoRA を含む、Qwen-VL モデルの微調整をサポートするようになりました。 * 2023.9.8 [Colab](https://github.com/camenduru/Qwen-VL-Chat-colab) のサンプルを提供してくれた [camenduru](https://github.com/camenduru) に感謝します。これをチュートリアルとして使用して、12G GPU でローカルまたはオンラインのデモを行うことができます。 * 2023.9.4 Qwen-VL シリーズは、画像とビデオの両方の理解を含むマルチモーダル LLM を評価するための、正確な人による注釈を備えた 19,000 個の多肢選択質問のマルチモーダルベンチマークである [Seed-Bench](eval_mm/seed_bench/EVAL_SEED.md) で SOTA を達成します。 * 2023.9.1 基本的な認識と理解だけでなく、文学創作までを含むマルチモーダル言語モデルの包括的な評価である [TouchStone](https://github.com/OFA-Sys/TouchStone) 評価をリリースします。強力な LLM を判定者として使用し、マルチモーダルな情報をテキストに変換します。 * 2023.8.31 低メモリコストでありながら推論速度の向上を実現する Qwen-VL-Chat 用の Int4 量子化モデル **Qwen-VL-Chat-Int4** をリリースしました。また、ベンチマーク評価においても大きなパフォーマンスの低下はありません。 * 2023.8.22 ModelScope と Hugging Face で **Qwen-VL** と **Qwen-VL-Chat** をリリースしました。また、トレーニングの詳細やモデルのパフォーマンスなど、モデルの詳細については [論文](https://arxiv.org/abs/2308.12966) も提供しています。 ## 評価モデルの能力を2つの観点から評価しました: 1. **標準ベンチマーク**: マルチモーダルなタスクの 4 つの主要カテゴリーについて、モデルの基本的なタスク能力を評価する: - ゼロショットキャプション: 未見のデータセットに対して、モデルのゼロショット画像キャプション能力を評価する; - 一般的な VQA: 判定、色、数、カテゴリなど、画像の一般的な質問応答能力を評価する; - テキストベース VQA: 文書 QA、図表 QAなど、写真内のテキストを認識するモデルの能力を評価する; - 参照表現理解: 参照表現理解: 参照表現で記述された画像内の対象物を特定する能力を評価する。 2. **TouchStone**: 総合的なテキスト画像対話能力と人間とのアライメントレベルを評価するために、GPT4 によるスコアリングに基づく TouchStone と呼ばれるベンチマークを構築し、LVLM モデルを評価しました。 - TouchStone ベンチマークは、合計 300 以上の画像、800 以上の質問、27 のカテゴリをカバーしています。例えば、属性ベースの Q&A、有名人の認識、詩の作文、複数の画像の要約、商品比較、数学の問題解決などです; - 画像の直接入力という GPT4 の現在の制限を打ち破るため、TouchStone は人間のラベル付けによるきめ細かい画像注釈を提供します。これらの詳細な注釈は、質問とモデルの出力と共に、採点のために GPT4 に提示されます。 - ベンチマークには英語版と中国語版があります。評価結果は以下の通りです: Qwen-VL は、複数の VL タスクにおいて、現行の SOTA ジェネラリストモデルを上回り、また、能力範囲の点でより包括的なカバレッジを持ちます。 <img src="assets/radar.png" width="600"/> ### ゼロショットキャプションと一般的な VQA <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>80.5</td> <td>51.1</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - ゼロショット画像のキャプション付けでは、Qwen-VL は Flickr30K で **SOTA** を達成し、InstructBlip を使用した Nocaps でも競争力のある結果を得ています。 - 一般的な VQA では、Qwen-VL は同じ一般的な LVLM スケール設定で **SOTA** を達成しています。 ### テキスト指向VQA（画像中のテキスト理解能力に重点を置く） <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - テキスト関連の認識/QA 評価において、Qwen-VL は汎用の LVLM スケール設定で SOTA を達成しています。 - 解像度は上記のいくつかの評価において重要である。解像度が 224 のオープンソースの LVLM モデルの多くは、これらの評価ができないか、画像をカットすることでしか解決できないが、Qwen-VL は解像度を 448 にスケーリングし、エンドツーエンドで評価できるようにしました。Qwen-VL は、一部のタスクにおいて、解像度 1024 の Pix2Struct-Large モデルをも凌駕しています。 ### 表現理解の参照 <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56 </td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - Qwen-VL は、上記のすべての参照表現理解ベンチマークで **SOTA** を達成した。 - Qwen-VL は中国語の下地データを学習していないが、中国語のキャプションデータと英語の下地データを学習することで、ゼロショットで中国語の下地タスクに汎化することができます。私たちの実験結果を再現するために、上記の評価スクリプトをすべて提供しています。詳しくは [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md) をお読みください。 ### チャット評価 TouchStone は GPT4 によるスコアリングに基づくベンチマークで、テキストと画像の対話および人間とのアライメントレベルにおける LVLM モデルの能力を評価する。合計 300 以上の画像、800 以上の質問、属性ベースの Q&A、有名人の認識、詩の作成、複数の画像の要約、商品比較、数学の問題解決など27のカテゴリをカバーしています。詳しくは [touchstone/README_JA.md](touchstone/README_JA.md) をお読みください。 #### 英語 | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### 中国語 | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat は中国語と英語のアライメント評価で最高の結果を得ました。 ### その他のベンチマーク #### SEED-Bench SEED-Bench は、マルチモーダル LLM を評価するための正確な人による注釈を備えた 19,000 個の多肢選択式質問のマルチモーダルベンチマークで、**画像** と **ビデオ** の両方の理解を含む 12 の評価次元をカバーしています。詳細については、[こちら](eval_mm/seed_bench/EVAL_SEED.md) をご覧ください。 Qwen-VL と Qwen-VL-Chat は、このベンチマークで SOTA を達成しています。 <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## 必要条件 * python 3.8 以上 * pytorch 1.12 以上、2.0 以上を推奨 * CUDA 11.4 以上を推奨（GPU ユーザー向けです） ## クイックスタート以下では、Qwen-VL と Qwen-VL-Chat を 🤖 ModelScope と 🤗 Transformers とともに使う方法を、簡単な例で示します。コードを実行する前に、環境のセットアップと必要なパッケージのインストールが済んでいることを確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。 ```bash pip install -r requirements.txt ``` これで ModelScope や Transformers を使い始めることができます。ビジョンエンコーダについての詳しい使い方は、[チュートリアル](TUTORIAL_ja.md)を参照してください。 #### 🤗 Transformers Qwen-VL-Chat を推論に使用するために必要なのは、以下に示す数行のコードを入力することだけです。ただし、**最新のコードを使用していることを確認してください。** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # Note: デフォルトの動作では、インジェクション攻撃防止機能がオフになりました。 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # bf16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # fp16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # cpu のみの使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # cuda デバイスの使用 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # 生成のためのハイパーパラメータの指定 # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 第 1 回対話ターン query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # ローカルパスまたは url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 写真はビーチでラブラドールの隣で愛犬と戯れる女性が写っており、彼らは砂の中にいる。 # 第 2 回対話ターン response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> <details> <summary>Running Qwen-VL</summary> Running Qwen-VL pretrained base model is also simple. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # bf16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # fp16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # cpu のみの使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # cuda デバイスの使用 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # 生成のためのハイパーパラメータの指定 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # ローカルパスまたは url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> </details> HuggingFaceからモデルのチェックポイントとコードをダウンロードする際にネットワークの問題が発生した場合、ModelScopeからチェックポイントをダウンロードする方法はこちらでございます。 ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope ModelScope は、MaaS（Model-as-a-Service）のためのオープンソースプラットフォームであり、AI 開発者に柔軟で費用対効果の高いモデルサービスを提供します。同様に、以下のように ModelScope でモデルを実行することができます: ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # bf16 の使用 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # fp16 の使用 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # cpu の使用 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # auto の使用 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # 生成のためのハイパーパラメータの指定 model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 第 1 回対話ターン # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 写真は、若い女性がビーチで愛犬のラブラドール種と戯れているところ。二人は浜辺に座り、犬の前脚を上げて触れ合っている。 # 第 2 回対話ターン response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> ## 量子化 ### 使用方法私たちは、[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)に基づいた新しいソリューションを提供し、Qwen-VL-ChatのためのInt4量子化モデル、Qwen-VL-Chat-Int4[Click here](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)をリリースします。このモデルは、ほぼ無損失なモデル効果を達成しながら、メモリコストと推論速度の両方のパフォーマンスを向上させます。ここでは、量子化されたモデルを推論に使用する方法を説明します。始める前に、必要な要件（torch 2.0以上、transformers 4.32.0以上など）を満たしていることを確認し、必要なパッケージをインストールしてください： ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` `auto-gptq`のインストールに問題がある場合は、公式の[repo](https://github.com/PanQiWei/AutoGPTQ)をチェックして、ホイールを見つけることをお勧めする。そうすれば、量子化されたモデルを簡単にロードすることができ、いつもと同じように推論を実行することができる： ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### 性能ベンチマーク **[TouchStone](https://github.com/OFA-Sys/TouchStone)** において、BF16 モデルと Int4 モデルの両方のモデル性能を例示し、量子化モデルが大きな性能劣化に悩まされないことを見出しました。結果を以下に示します： | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### 推論スピード BF16 精度と Int4 量子化の下で、画像（258 トークンを要する）のコンテキストで 1792（2048-258）トークンと 7934（8192-258）トークンを生成する平均推論速度（トークン/秒）をそれぞれ測定した。 | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | プロファイリングは、PyTorch 2.0.1 と CUDA 11.4 を搭載したシングル A100-SXM4-80G GPU で実行されます。 ### GPU メモリ使用量また、1792 (2048-258) 個のトークン (画像を含む) をコンテキストとしてエンコードする場合 (および単一のトークンを生成する場合) と、7934 (8192-258) 個のトークン (画像をコンテキストとして生成する場合) をそれぞれ BF16 または Int4 量子化レベルでエンコードする場合の GPU メモリ使用量のピーク値をプロファイリングしました。結果を以下に示します。 | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | 上記のスピードとメモリーのプロファイリングは、[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)を使用しています。 ## ファインチューニング現在、公式のトレーニングスクリプト `finetune.py` を提供しています。さらに、finetune.py のシェルスクリプトを提供し、finetune.py を実行することで、finetune.py を起動することができる。さらに、安心してファインチューニングを開始するためのシェルスクリプトも提供しています。このスクリプトは、[DeepSpeed](https://github.com/microsoft/DeepSpeed) および [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) を使用したトレーニングをサポートします。弊社が提供するシェル・スクリプトは DeepSpeed を使用するため、事前に DeepSpeed をインストールすることをお勧めします: 学習データを準備するには、すべてのサンプルをリストにまとめ、json ファイルに保存する必要があります。各サンプルは id と会話リストで構成される辞書です。以下は 1 つのサンプルを含む単純なリストの例です: ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` VL タスクの場合、`<img> </img> <ref> </ref> <box> </box>` などの特別なトークンが使用されます。画像は「画像 ID: `<img>img_path</img>\n{your prompt}`」として表されます。ここで、「id」は会話内の画像の位置を 1 から示します。「img_path」はローカルファイルパスまたは Web リンク。座標ボックスは `<box>(x1,y1),(x2,y2)</box>`・として表されます。ここで、`(x1, y1)` と `(x2, y2)` は範囲内の正規化された値です。 `[0, 1000)`。対応するテキスト説明は `<ref>text_caption</ref>` によって識別できます。データ準備の後、提供されているシェルスクリプトを使って微調整を実行することができる。データファイルのパス `$DATA` を忘れずに指定してください。ファインチューニングのスクリプトを使用することで、以下のことが可能になる： - フルパラメーター・ファインチューニング - LoRA - Q-LoRA ### フルパラメーターファインチューニングフルパラメータパラメータのファインチューニングを行うには、トレーニングプロセス全体ですべてのパラメータを更新する必要があります。トレーニングを開始するには、以下のスクリプトを実行します： ```bash # 分散トレーニング。GPU メモリが不足するとトレーニングが破綻するため、シングル GPU のトレーニングスクリプトは提供していません。 sh finetune/finetune_ds.sh ``` シェルスクリプトでは、正しいモデル名またはパス、データパス、出力ディレクトリを指定することを忘れないでください。変更したい場合は、引数 `--deepspeed` を削除するか、要件に基づいて DeepSpeed 設定 json ファイルを変更してください。さらに、このスクリプトは混合精度のトレーニングに対応しており、`--bf16 True` または `--fp16 True` を使用することができます。経験的に、あなたのマシンがbf16をサポートしている場合、私たちのプリトレーニングとアライメントを整合させるためにbf16を使用することをお勧めします。 ### LoRA 同様に、LoRA を実行するには、以下のように別のスクリプトを使って実行する。始める前に、`peft` がインストールされていることを確認してください。また、モデル、データ、出力へのパスを指定する必要があります。学習済みモデルには絶対パスを使用することをお勧めします。なぜなら、LoRA はアダプタのみを保存し、アダプタ設定 json ファイルの絶対パスは、ロードする事前学習済みモデルを見つけるために使用されるからです。また、このスクリプトは bf16 と fp16 の両方をサポートしている。 ```bash # シングル GPU トレーニング sh finetune/finetune_lora_single_gpu.sh # 分散トレーニング sh finetune/finetune_lora_ds.sh ``` LoRA ([論文](https://arxiv.org/abs/2106.09685)) は、フルパラメーターによるファインチューニングと比較して、adapter のパラメーターを更新するだけで、元の大きな言語モデル層は凍結されたままである。そのため、メモリコストが大幅に削減でき、計算コストも削減できる。なお、チャットモデル（Qwen-VL-Chatなど）ではなく、ベース言語モデル（Qwen-VLなど）の微調整にLoRAを使用した場合、スクリプトは自動的に学習可能なパラメータとして埋め込み層と出力層を切り替えます。これは、ベースとなる言語モデルには、ChatMLフォーマットによってもたらされる特殊なトークンに関する知識がないためです。したがって、これらのレイヤーは、モデルがトークンを理解し予測するために更新される必要があります。別の言い方をすれば、もしLoRAで特殊なトークンを学習するのであれば、コード内で `modules_to_save` を設定することで、レイヤーを学習可能なパラメータに設定する必要があります。さらに、LoRAのメモリフットプリントは、このような学習可能なパラメータがある場合とない場合で、大きな開きがあることがわかります。そのため、メモリに問題がある場合は、LoRAのChatモデルを微調整することをお勧めします。詳細は以下のプロファイルを参照してください。 ### Q-LoRA しかし、それでもメモリ不足に悩む場合は、Q-LoRA（[論文](https://arxiv.org/abs/2305.14314)）を検討することができます。これは、量子化されたラージ言語モデルと、ページド・アテンションなどの他のテクニックを使用し、さらに少ないメモリコストで実行することができます。Q-LoRA を実行するには、以下のスクリプトを直接実行してください： ```bash # シングルGPUトレーニング sh finetune/finetune_qlora_single_gpu.sh # 分散トレーニング sh finetune/finetune_qlora_ds.sh ``` Q-LoRA については、弊社が提供する量子化モデル、例えば Qwen-7B-Chat-Int4 をロードすることをお勧めします。ただし、フルパラメータ・ファインチューニングや LoRA とは異なり、Q-LoRA では fp16 のみがサポートされる。 LoRA と Q-LoRA の学習は、フルパラメータによるファインチューニングとは異なり、アダプターパラメータのみを保存する。仮に Qwen-7B から学習を開始したとすると、以下のようにファインチューニングされたモデルを読み込んで推論を行うことができる： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` アダプターをマージし、微調整したモデルをスタンドアロンモデルとして保存したい場合は（これは LoRA でのみ可能で、Q-LoRA からパラメータをマージすることはできません）、以下のコードを実行します： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size and safe serialization are not necessary. # They respectively work for sharding checkpoint and save the model to safetensors merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` 注意：マルチGPUトレーニングの場合、分散トレーニング用の適切なハイパーパラメータをマシンに応じて指定する必要があります。また、データ、メモリフットプリント、トレーニング速度を考慮して、引数 `--model_max_length` で最大シーケンス長を指定することをお勧めします。 ### メモリと速度のプロファイリングシングルGPUトレーニングのセットアップにおいて、LoRA (LoRA (Base)はembeddingと出力層を学習させるが、LoRA (Chat)はembeddingと出力層を学習させない) とQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。各サンプルには写真が含まれています。384、512、1024、2048という異なる長さの入力のメモリ（GB）と速度（s/iter）をプロファイリングします。統計量を以下に示す： <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> シェルスクリプトは `torchrun` を使用してシングル GPU またはマルチGPUトレーニングを実行します。そのため、分散トレーニングのための適切なハイパーパラメータをマシンに応じて指定する必要があります。 ## デモ ### Web UI Web UI デモを構築するためのコードを提供します。始める前に、以下のパッケージがインストールされていることを確認してください: ```bash pip install -r requirements_web_demo.txt ``` 次に以下のコマンドを実行し、生成されたリンクをクリックします: ```bash python web_demo_mm.py ``` ## FAQ 問題が発生した場合は、[FAQ](FAQ_ja.md) や issue を参照し、新しい issue を立ち上げる前に解決策を探してください。 ## ライセンス契約研究者や開発者は、Qwen-VL と Qwen-VL-Chat のコードとモデルウェイトを自由に使用することができます。また、商用利用も可能です。詳しくは [LICENSE](LICENSE) をご覧ください。 ## 引用私たちの論文やコードがあなたの研究に役立つとお感じになりましたら、スター :star: と引用 :pencil: をお付けください :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## お問い合わせ研究チームまたは製品チームへのメッセージは、qianwen_opensource@alibabacloud.com までお気軽にお送りください。 ================================================ FILE: README_KO.md ================================================ <a href="README_CN.md">中文</a> ｜ English ｜ <a href="README_JA.md">日本語</a> ｜ <a href="README_KO.md">한국어</a> <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> ｜ Qwen-VL-Chat <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> ｜ Qwen-VL-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a> | <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">Demo</a> ｜ <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="https://github.com/camenduru/Qwen-VL-Chat-colab">Colab</a> | <a href="TUTORIAL.md">Tutorial</a> --- **Qwen-VL**(Qwen Large Vision Language Model)은 알리바바 클라우드가 제안한 큰 모델 시리즈인 Qwen(약칭, Tongyi Qianwen)의 멀티모달 버전입니다. Qwen-VL은 이미지, 텍스트, 그리고 바운딩 박스를 입력으로 받아 텍스트와 바운딩 박스를 출력합니다. Qwen-VL의 특징은 다음과 같습니다. - **강력한 성능**: 동일한 모델 규모의 기존 공개된 대규모 시각 언어 모델(Large Vision Language Models, ㄴLVLM)보다 영어 평가 벤치마크(Zero-shot Captioning, VQA, DocVQA, Grounding 포함)에서 현저히 우수합니다. - **텍스트 인식을 지원하는 다국어 LVLM**: Qwen-VL은 자연스러운 영어, 중국어 및 다국어 대화를 지원하며, 이미지 내 중국어-영어 간 이중 언어 텍스트의 종단 간 인식을 개선했습니다. - **다중 이미지 교차 대화**: 이 기능은 여러 이미지의 입력과 비교뿐만 아니라 이미지와 관련된 질문을 지정하고 다중 이미지 스토리텔링에 참여할 수 있는 기능을 제공합니다. - **중국어에서 지상화를 지원하는 첫 번째 일반 모델**: 중국어와 영어의 개방형 언어 표현을 통해 바운딩 박스를 인식합니다. - **세밀한 인식 및 이해**: 다른 공개된 LVLM이 현재 사용하는 224\*224 해상도와 비교하여 448\*448 해상도는 세밀한 텍스트 인식, 문서 QA 및 바운딩 어노테이션을 개선했습니다. <img src="assets/demo_vl.gif" width="400"/> Qwen-VL 시리즈의 두 모델을 출시합니다. - Qwen-VL: 사전 훈련된 LVLM 모델로, Qwen-7B를 LLM의 초기화에 사용하며, 시각 인코더의 초기화로는 [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip)를 사용하여, 무작위로 초기화된 교차 어텐션 레이어(randomly initialized cross-attention layer)에 연결합니다. - Qwen-VL-Chat: 정렬 기술로 훈련된 멀티모달 LLM 기반 AI 어시스턴트입니다. Qwen-VL-Chat은 여러 이미지 입력, 다중 라운드 질문 응답, 창의적 능력과 같은 더 유연한 상호작용을 지원합니다. ## 뉴스 및 업데이트 * ```2023.9.25``` 🚀🚀🚀 Qwen-VL-Chat을 더욱 강력한 중국어 지시 수행 능력, 웹페이지 및 표 이미지에 대한 개선된 이해력, 더 나은 대화 성능(TouchStone: CN: 401.2->481.7, EN: 645.2->711.6)으로 업데이트 되었습니다. * ```2023.9.12``` 😃😃😃 이제 Qwen-VL 모델에 대한 파인튜닝을 지원합니다. 이에는 전체 파라미터 파인튜닝, LoRA 및 Q-LoRA가 포함됩니다. * ```2023.9.8``` 👍👍👍 camenduru가 멋진 Colab을 기여해 주셔서 감사합니다. 모두가 12G GPU에서 로컬 또는 온라인 Qwen-VL-Chat-Int4 데모 튜토리얼로 사용할 수 있습니다. * ```2023.9.5``` 👏👏👏 Qwen-VL-Chat은 MME Benchmark, 멀티모달 대형 언어 모델을 위한 종합적인 평가 벤치마크에서 SOTAs를 달성했습니다. 이는 총 14개의 하위 과제에서 인식과 인지 능력을 모두 측정합니다. * ```2023.9.4``` ⭐⭐⭐ Qwen-VL 시리즈는 Seed-Bench, 이미지 및 비디오 이해를 평가하는 19K 다중 선택 질문의 멀티모달 벤치마크에서 SOTAs를 달성했습니다. 이는 정확한 인간 주석을 갖추고 있습니다. * ```2023.9.1``` 🔥🔥🔥 기본적인 인식과 이해력뿐만 아니라 문학 창작까지 아우르는 복합 언어 모델에 대한 종합적인 평가인 [TouchStone](https://github.com/OFA-Sys/TouchStone) 평가를 출시합니다. 강력한 LLM을 심사위원으로 활용하고, 멀티모달 정보를 텍스트로 변환하여 평가합니다. * ```2023.8.31``` 🌟🌟🌟 Qwen-VL-Chat용 Int4 양자화 모델인 **Qwen-VL-Chat-Int4**를 출시하여 메모리 비용은 낮추고 추론 속도는 향상시켰습니다. 또한 벤치마크 평가에서도 성능 저하가 크지 않습니다. * ```2023.8.22``` 🎉🎉🎉 모델스코프와 허깅페이스에 **Qwen-VL**과 **Qwen-VL-Chat**을 모두 출시합니다. 학습 내용 및 모델 성능 등 모델에 대한 자세한 내용은 [논문](https://arxiv.org/abs/2308.12966)을 통해 확인할 수 있습니다. ## Evaluation 세 가지 관점에서 모델의 기능을 평가했습니다: 1. **표준 벤치마크**: 멀티모달 작업의 네 가지 주요 범주에 대한 모델의 기본 작업 기능을 평가합니다: - 제로 샷 캡션: 보이지 않는 데이터 세트에 대한 모델의 제로샷 이미지 캡션 능력을 평가합니다. - 일반 VQA: 판단, 색상, 숫자, 카테고리 등과 같은 사진의 일반적인 질문에 대한 답변 능력을 평가합니다. - 텍스트 기반 VQA: 문서 QA, 차트 QA 등과 같이 사진 속 텍스트를 인식하는 모델의 능력을 평가합니다. - 참조 표현 이해: 참조 표현식으로 설명된 이미지에서 대상 객체를 찾아내는 능력을 평가합니다. 2. **터치스톤**: 전반적인 텍스트-이미지 대화 능력과 사람과의 일치도를 평가하기 위해 [TouchStone](https://github.com/OFA-Sys/TouchStone)이라는 벤치마크를 구축했으며, 이 벤치마크는 GPT4로 채점하여 LVLM 모델을 평가합니다. - 터치스톤 벤치마크는 총 300개 이상의 이미지, 800개 이상의 질문, 27개 카테고리를 다룹니다. 속성 기반 Q&A, 유명인 인식, 시 쓰기, 여러 이미지 요약, 제품 비교, 수학 문제 풀이 등이 포함됩니다. - 직접 이미지 입력이라는 현재 GPT4의 한계를 극복하기 위해 TouchStone은 사람이 직접 라벨을 지정하여 세분화된 이미지 주석을 제공합니다. 이러한 세부 주석은 문제 및 모델의 출력과 함께 채점을 위해 GPT4에 제공됩니다. - 벤치마크에는 영어와 중국어 버전이 모두 포함되어 있습니다. 3. **기타 멀티모달 벤치마크**: 다른 멀티모달 벤치마크에서도 모델의 성능을 평가했습니다: - 멀티모달 대규모 언어 모델에 대한 종합적인 평가 벤치마크인 [MME 벤치마크](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation). Qwen-VL-Chat은 지각과 인지 트랙 모두에서 SOTA를 달성했습니다. - [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard)는 멀티모달 LLM을 평가하기 위한 정확한 인간 주석이 포함된 19K 객관식 질문으로 구성된 멀티모달 벤치마크입니다. 큐원 시리즈는 이 벤치마크에서 SOTA를 달성했습니다. 평가 결과는 다음과 같습니다. Qwen-VL은 여러 VL 작업에서 현재 SOTA 제너럴리스트 모델보다 성능이 뛰어나며, 기능 범위 측면에서 더 포괄적인 기능을 지원합니다. <img src="assets/radar.png" width="600"/> ### Zero-shot Captioning & General VQA <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>80.5</td> <td>51.1</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - 제로 샷 이미지 캡션의 경우, Qwen-VL은 Flickr30K에서 **SOTA**를 달성했고 InstructBlip을 사용하여 노캡스에서 경쟁력 있는 결과를 얻었습니다. - 일반 VQA의 경우, Qwen-VL은 동일한 일반 LVLM 스케일 설정에서 **SOTA**를 달성했습니다. ### Text-oriented VQA (Focused on text understanding capabilities in images) <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - 텍스트 관련 인식/QA 평가에서 Qwen-VL은 일반적인 LVLM 스케일 설정에서 SOTA를 달성합니다. - 해상도는 위의 여러 평가에서 중요합니다. 224 해상도의 대부분의 오픈 소스 LVLM 모델은 이러한 평가를 수행할 수 없거나 이미지를 잘라내야만 해결할 수 있지만, Qwen-VL은 해상도를 448로 확장하여 엔드투엔드 평가가 가능합니다. Qwen-VL은 일부 작업에서 1024 해상도의 Pix2Struct-Large 모델보다 더 뛰어난 성능을 발휘합니다. ### Referring Expression Comprehension <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56</td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - Qwen-VL은 위의 모든 참조 표현 이해도 벤치마크에서 **SOTA**를 달성했습니다. - Qwen-VL은 중국어 자막 데이터에 대해 학습되지 않았지만, 중국어 자막 데이터와 영어 자막 데이터를 학습하여 제로 샷 방식으로 중국어 자막 작업에 일반화할 수 있습니다. 실험 결과를 재현하기 위해 위의 모든 평가 스크립트를 제공합니다. 자세한 내용은 [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md)를 참조하세요. ### Chat evaluation TouchStone은 텍스트-이미지 대화 및 사람과의 일치 수준에 대한 LVLM 모델의 능력을 평가하기 위해 GPT4로 점수를 매기는 벤치마크입니다. 총 300개 이상의 이미지, 800개 이상의 질문, 속성 기반 Q&A, 유명인 인식, 시 쓰기, 여러 이미지 요약, 제품 비교, 수학 문제 풀이 등 27개 카테고리로 구성되어 있습니다. 자세한 내용은 [터치스톤/README.md](터치스톤/README.md)를 참조하세요. #### English evaluation | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### Chinese evaluation | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat은 중국어와 영어 정렬 평가에서 모두 최고의 결과를 얻었습니다. ### Other Benchmarks #### MME Benchmark [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)는 멀티모달 대규모 언어 모델에 대한 종합적인 평가 벤치마크입니다. 존재, 수, 위치, 색상, 포스터, 유명인, 장면, 랜드마크, 예술품, OCR, 상식 추론, 숫자 계산, 텍스트 번역, 코드 추론 등 총 14개의 하위 과제에 대한 지각과 인지 능력을 모두 측정합니다. Qwen-VL-Chat은 지각과 인지 평가 모두에서 SOTA를 달성했습니다. 자세한 내용은 [여기](eval_mm/mme/EVAL_MME.md)에서 확인하세요. <img src="eval_mm/mme/perception.jpg" width="600"/> <img src="eval_mm/mme/cognition.jpg" width="600"/> #### SEED-Bench [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard)는 **이미지** 및 **동영상** 이해도를 포함한 12가지 평가 차원을 포괄하는 멀티모달 LLM을 평가하기 위한 정확한 사람의 주석이 포함된 19K 개의 객관식 문항으로 구성된 멀티모달 벤치마크입니다. 자세한 내용은 [여기](eval_mm/seed_bench/EVAL_SEED.md)에서 확인할 수 있습니다. 이 벤치마크에서 Qwen-VL과 Qwen-VL-Chat은 SOTA를 달성했습니다. <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## Requirements * python 3.8 and above * pytorch 1.12 and above, 2.0 and above are recommended * CUDA 11.4 and above are recommended (this is for GPU users) ## Quickstart 아래에서는 🤖 모델스코프 및 🤗 트랜스포머와 함께 Qwen-VL 및 Qwen-VL-Chat을 사용하는 방법을 보여주는 간단한 예제를 제공합니다. 코드를 실행하기 전에 환경을 설정하고 필요한 패키지를 설치했는지 확인하세요. 위의 요구 사항을 충족하는지 확인한 다음 종속 라이브러리를 설치하세요. ```bash pip install -r requirements.txt ``` 이제 모델스코프 또는 트랜스포머로 시작할 수 있습니다. 비전 인코더에 대한 자세한 사용법은 [튜토리얼](TUTORIAL.md)을 참조하세요. #### 🤗 Transformers 추론에 Qwen-VL-Chat을 사용하려면 아래에 설명된 대로 몇 줄의 코드를 입력하기만 하면 됩니다. 단, **최신 코드를 사용하고 있는지 확인하세요**. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # Note: The default behavior now has injection attack prevention off. tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 1st dialogue turn query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。 # 2nd dialogue turn response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> <details> <summary>Running Qwen-VL</summary> Qwen-VL pretrained base model을 실행하는 것도 매우 간단합니다. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0) # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> </details> HuggingFace에서 모델 체크포인트와 코드를 다운로드하는 동안 네트워크 문제가 발생하는 경우, 아래에 설명된 대로 모델스코프에서 체크포인트를 먼저 가져온 다음 로컬 디렉터리에서 로드하는 방법을 사용할 수 있습니다. ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope ModelScope는 서비스형 모델(MaaS)을 위한 오픈소스 플랫폼으로, AI 개발자에게 유연하고 비용 효율적인 모델 서비스를 제공합니다. 마찬가지로 아래와 같이 ModelScope로 모델을 실행할 수 있습니다. ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # use bf16 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # use auto model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>=4.32.0) # model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 1st dialogue turn # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。 # 2nd dialogue turn response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> ## Quantization ### Usage [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)를 기반으로 하는 새로운 솔루션을 제공하고, 거의 무손실 모델 효과를 달성하면서도 메모리 비용과 추론 속도 모두에서 성능이 향상된 Qwen-VL-Chat용 Int4 양자화 모델인 [Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)를 출시했습니다. 여기에서는 제공된 양자화된 모델을 추론에 사용하는 방법을 보여줍니다. 시작하기 전에 요구 사항(예: torch 2.0 이상, transformers 4.32.0 이상 등) 및 필요한 패키지를 제대로 설치했는지 확인하세요. ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` 만약 'auto-gptq' 설치에 문제가 있다면, 공식 [repo](https://github.com/PanQiWei/AutoGPTQ)에서 휠을 찾아보시길 권장합니다. 그러면 정량화된 모델을 쉽게 로드하고 평소와 동일하게 추론을 실행할 수 있습니다. ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### Performance [TouchStone](https://github.com/OFA-Sys/TouchStone)벤치마크에서 BF16 및 Int4 모델의 모델 성능을 살펴본 결과, 양자화된 모델에서 성능 저하가 크지 않은 것으로 나타났습니다. 결과는 아래와 같습니다. | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### Inference Speed 이미지의 컨텍스트(258개의 토큰이 필요한)를 가지고 각각 1792개(2048-258개), 7934개(8192-258개)의 토큰을 생성하는 평균 추론 속도(토큰/초)를 BF16 정밀도와 Int4 양자화 하에서 측정했습니다. | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | 프로파일링은 PyTorch 2.0.1 및 CUDA 11.4가 탑재된 단일 A100-SXM4-80G GPU에서 실행됩니다. ### GPU Memory Usage 또한 1792개(2048-258개)의 토큰(이미지 포함)을 컨텍스트로 인코딩하고 단일 토큰을 생성할 때와 7934개(8192-258개)의 토큰(이미지가 컨텍스트로 포함)을 생성할 때 각각 BF16 또는 Int4 양자화 수준에서 최대 GPU 메모리 사용량을 프로파일링했습니다. 결과는 아래와 같습니다. | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | 위의 속도 및 메모리 프로파일링은 [이 스크립트](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)를 사용하여 수행되었습니다. ## Finetuning 이제 사용자가 다운스트림 애플리케이션을 위해 사전 학습된 모델을 간단한 방식으로 미세 조정할 수 있도록 공식 학습 스크립트인 `finetune.py`를 제공합니다. 또한, 걱정 없이 미세 조정을 시작할 수 있는 셸 스크립트도 제공합니다. 이 스크립트는 딥스피드와 FSDP를 통한 학습을 지원합니다. 제공되는 셸 스크립트는 DeepSpeed를 사용하므로 시작하기 전에 DeepSpeed를 설치하는 것이 좋습니다. ```bash pip install deepspeed ``` ### Data preparation 학습 데이터를 준비하려면 모든 샘플을 목록에 넣고 json 파일에 저장해야 합니다. 각 샘플은 ID와 대화 목록으로 구성된 사전입니다. 아래는 샘플 1개가 포함된 간단한 예제 목록입니다. ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` VL 작업에서는 `<img> </img> <ref> </ref> <box> </box>`등과 같은 특수 토큰이 사용됩니다. 이미지는 `Picture id: <img>img_path</img>\n{your prompt}`로 표시되며, 여기서 `id`는 대화에서 이미지의 위치(1부터 시작)를 나타냅니다. `img_path`는 로컬 파일 경로 또는 웹 링크일 수 있습니다. 박스의 좌표는 `<box>(x1,y1),(x2,y2)</box>`로 표시되는데, 여기에서 `(x1, y1)`과 `(x2, y2)`의 좌표는 `[0, 1000)`으로 정규화되게 됩니다. 해당 텍스트 설명은 `<ref>text_caption</ref>`과 같은 방법으로 식별할 수 있습니다. 데이터 준비 후 제공된 셸 스크립트를 사용하여 미세 조정을 실행할 수 있습니다. 데이터 파일 경로인 `$DATA`를 지정하는 것을 잊지 마세요. 미세 조정 스크립트를 통해 다음을 수행할 수 있습니다. - Full-parameter finetuning - LoRA - Q-LoRA ### Full-parameter finetuning 전체 파라미터를 미세 조정하려면 전체 훈련 과정에서 LLM의 모든 파라미터를 업데이트해야 합니다. 실험 결과, 미세 조정 단계에서 **ViT의 파라미터를 동결(frozening)하면 더 나은 성능을 얻을 수 있었습니다.** 훈련을 시작하려면 다음 스크립트를 실행합니다. ```bash sh finetune/finetune_ds.sh ``` 셸 스크립트에서 올바른 모델 이름 또는 경로, 데이터 경로, 출력 디렉터리를 지정하는 것을 잊지 마세요. 변경하려면 `--deepspeed` 인수를 제거하거나 요구 사항에 따라 DeepSpeed 구성 json 파일을 변경하면 됩니다. 또한, 이 스크립트는 혼합 정밀도 훈련을 지원하므로 `--bf16 True` 또는 `--fp16 True`를 사용할 수 있습니다. 경험적으로 머신이 bf16을 지원하는 경우 사전 훈련 및 정렬과 일관된 훈련을 위해 bf16을 사용하는 것이 좋으며, 따라서 기본값으로 사용됩니다. ### LoRA 마찬가지로 LoRA를 실행하려면 아래와 같이 다른 스크립트를 사용하여 실행합니다. 시작하기 전에 `peft`를 설치했는지 확인하세요. 또한 모델, 데이터, 출력에 대한 경로를 지정해야 합니다. 사전 학습된 모델에는 절대 경로를 사용하는 것이 좋습니다. LoRA는 어댑터만 저장하고 어댑터 구성 json 파일의 절대 경로는 로드할 사전 학습된 모델을 찾는 데 사용되기 때문입니다. ```bash # Single GPU training sh finetune/finetune_lora_single_gpu.sh # Distributed training sh finetune/finetune_lora_ds.sh ``` 전체 매개변수 미세 조정과 비교할 때 LoRA([paper](https://arxiv.org/abs/2106.09685))는 어댑터 레이어의 매개변수만 업데이트하고 원래의 대규모 언어 모델 레이어는 고정된 상태로 유지합니다. 따라서 메모리 비용이 훨씬 적게 들고 계산 비용도 적게 듭니다. LoRA를 사용하여 채팅 모델 대신 기본 언어 모델(예: Qwen-VL)을 미세 조정하는 경우, 스크립트는 임베딩 및 출력 레이어를 학습 가능한 파라미터로 자동 전환합니다. 이는 기본 언어 모델에 ChatML 형식에서 가져온 특수 토큰에 대한 지식이 없기 때문입니다. 따라서 모델이 토큰을 이해하고 예측하려면 이러한 레이어를 업데이트해야 합니다. 다시 말해, 학습이 LoRA에서 특수 토큰을 가져오는 경우 코드 내에서 `modules_to_save`를 설정하여 레이어를 학습 가능한 파라미터로 설정해야 합니다. 또한 이러한 트레이닝 가능한 파라미터가 있는 경우와 없는 경우 LoRA의 메모리 사용량에는 상당한 차이가 있음을 발견했습니다. 따라서 메모리에 문제가 있는 경우 LoRA에서 채팅 모델을 미세 조정하는 것이 좋습니다. 자세한 내용은 아래 프로필을 확인하세요. ### Q-LoRA 그러나 여전히 메모리가 부족하다면 양자화된 대규모 언어 모델과 페이징 주의와 같은 기타 기술을 사용하여 메모리 비용을 훨씬 더 적게 사용할 수 있는 Q-LoRA([paper](https://arxiv.org/abs/2305.14314))를 고려해 볼 수 있습니다. Q-LoRA를 실행하려면 다음 스크립트를 직접 실행하세요. ```bash # Single GPU training sh finetune/finetune_qlora_single_gpu.sh # Distributed training sh finetune/finetune_qlora_ds.sh ``` Q-LoRA의 경우, 당사에서 제공하는 정량화된 모델(예: Qwen-VL-Chat-Int4)을 로드하는 것이 좋습니다. bf16 모델을 사용해서는 안 됩니다. 전체 파라미터 미세 조정 및 LoRA와 달리 Q-LoRA에는 fp16만 지원됩니다. 또한 Q-LoRA의 경우 LoRA의 특수 토큰에 대한 문제가 여전히 존재합니다. 하지만 저희는 채팅 모델에 Int4 모델만 제공하기 때문에 언어 모델이 ChatML 형식의 특수 토큰을 학습했기 때문에 레이어에 대한 걱정은 하지 않으셔도 됩니다. 단, Int4 모델의 레이어는 학습할 수 없어야 하므로 학습에 특수 토큰을 도입하면 Q-LoRA가 작동하지 않을 수 있습니다. 전체 매개변수 미세 조정과 달리 LoRA 및 Q-LoRA의 훈련은 어댑터 매개변수만 저장합니다. 아래와 같이 추론을 위해 미세 조정된 모델을 로드할 수 있습니다: ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` 어댑터를 병합하고 미세 조정된 모델을 독립형 모델로 저장하려면(이 작업은 LoRA에서만 가능하며 Q-LoRA에서 파라미터를 병합할 수 없음) 다음 코드를 실행하면 됩니다. ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size와 안전한 직렬화는 필요하지 않습니다. # 이들은 각각 샤딩 체크포인트에 대해 작동하고 모델을 세이프텐서에 저장합니다. merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` 참고: 멀티 GPU 트레이닝의 경우, 머신에 따라 분산 트레이닝에 적합한 하이퍼파라미터를 지정해야 합니다. 또한 데이터, 메모리 사용량, 훈련 속도 등을 고려하여 --model_max_length 인수를 사용하여 최대 시퀀스 길이를 지정하는 것이 좋습니다. ### Profiling of Memory and Speed 단일 GPU 트레이닝 설정에서 임베딩 및 출력 레이어를 트레이닝하는 LoRA(Base)와 임베딩 및 출력 레이어를 트레이닝할 수 없는 LoRA(Chat)의 GPU 메모리 및 트레이닝 속도를 프로파일링합니다. 이 테스트에서는 단일 A100-SXM4-80G GPU에서 실험했으며, CUDA 11.8과 Python 2.0을 사용했습니다. 배치 크기는 1, 그라데이션 누적은 8을 균일하게 사용합니다. 각 샘플에는 이미지가 포함됩니다. 384, 512, 1024, 2048 등 다양한 길이의 입력에 대한 메모리(GB)와 속도(s/iter)를 프로파일링합니다. 통계는 아래와 같습니다. <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> ## Demo ### Web UI 사용자가 웹 UI 데모를 빌드할 수 있는 코드를 제공합니다. 시작하기 전에 다음 패키지를 설치해야 합니다. ``` pip install -r requirements_web_demo.txt ``` Then run the command below and click on the generated link: ``` python web_demo_mm.py ``` ## FAQ 문제가 발생하면 새 이슈를 시작하기 전에 먼저 [자주 묻는 질문](FAQ.md)과 이슈를 참조하여 해결 방법을 찾아보시기 바랍니다. ## License Agreement 연구자와 개발자는 Qwen-VL과 Qwen-VL-Chat의 코드와 모델 가중치를 자유롭게 사용할 수 있습니다. 또한 상업적 사용도 허용됩니다. 자세한 내용은 [LICENSE](라이센스)에서 라이센스를 확인하세요. ## Citation 저희 논문과 코드가 여러분의 연구에 도움이 되었다면 star:star: 와 인용:pencil: 해주시면 감사드리겠습니다. :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## Contact Us 연구팀이나 제품팀에 메시지를 남기고 싶으시면 언제든지 이메일(qianwen_opensource@alibabacloud.com)을 보내주세요. ================================================ FILE: TUTORIAL.md ================================================ # Qwen-VL-Chat Tutorial Qwen-VL-Chat is a generalist multimodal large-scale language model, and it can perform a wide range of vision-language tasks. In this tutorial, we will give some concise examples to demonstrate the capabilities of Qwen-VL-Chat in **Visual Question Answering, Text Understanding, Mathematical Reasoning with Diagrams, Multi-Figure Reasoning, and Grounding**. Please note that the examples shown are far from the limit of Qwen-VL-Chat's capabilities, **you can further explore Qwen-VL-Chat's capabilities by changing the input images and prompts!** ## Initializing the Qwen-VL-Chat model Before you can use Qwen-VL-Chat, you first need to initialize Qwen-VL-Chat's tokenizer and Qwen-VL-Chat's model: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # If you expect the results to be reproducible, set a random seed. # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` After executing the above code, ```tokenizer``` will correspond to the classifier used by Qwen-VL-Chat, while ```model``` will correspond to the model of Qwen-VL-Chat. The ```tokenizer``` is used for preprocessing the interleaved multimodal inputs, while the ```model``` is the Qwen-VL-Chat model itself. ## Using Qwen-VL-Chat ### **Multi-round visual question answering** #### **The first question** Let's get started with a simple example. As shown below, the file ```assets/mm_tutorial/Rebecca_(1939_poster).jpeg``` is a poster for the 1940 film Rebecca. ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) Let's ask what is the name of the film on the Qwen-VL-Chat poster. First of all, we use ```tokenizer.from_list_format``` which can preprocess and tokenize the input: ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` Next, we can use ```model.chat``` to ask questions to the Qwen-VL-Chat model and get its response. Note that for the first question, the dialogue history is empty, so we use ```history=None```. ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > The name of the movie in the poster is "Rebecca." This shows that the model correctly answered the given question! According to the poster, the title of the film is indeed **Rebecca**. #### **Multi-round question answering** We can also continue to ask the model other questions, such as who is the director of the film. The dialogue history is not empty for subsequent questions, therefore we use ```history=history``` to pass the history of previous conversations to ``model.chat``: ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: > The movie "Rebecca" was directed by Alfred Hitchcock. Again, the model answered the given question correctly! According to the poster, the director of the film is Alfred Hitchcock。 ### **Text Understanding** Qwen-VL-Chat also has the ability to understand images containing dense text. As shown below, the file ```assets/mm_tutorial/Hospital.jpeg``` is a hospital signage containing dense text. ![](assets/mm_tutorial/Hospital_Small.jpg) We can ask questions about the location of different departments in the Hospital. Since the dialogue history is empty, so we use ```history=None```. ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > The Department of Otorhinolaryngology is located on the 4th floor. You can also ask further questions. In this case you need to use ```history=history``` to pass a history of previous conversations to ```model.chat```. ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: > The Department of Surgery is located on the 3rd floor. ### **Mathematical Reasoning with Diagram** Using the model's diagram comprehension and mathematical reasoning capabilities, Qwen-VL-Chat can also perform some more complex tasks! As shown below, the file ```assets/mm_tutorial/Menu.jpeg``` is the menu of a restaurant. Now we want to know how much it would cost to purchase two Salmon Burgers and three Meat Lover's Pizzas. ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ```Think carefully step by step.``` is a common prompt that guides the model through complex tasks step by step. So if you have a complex task to complete, try using it to improve the accuracy of the model. You are expected to get an output similar to the following: > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **Multi-Figure Reasoning and Chinese Input** In the previous examples, we have demonstrated Qwen-VL-Chat's question-answering capability for a single image and English questions. However, Qwen-VL-Chat is actually a multilingual model that supports Chinese input and multiple images! In the following example, we let Qwen-VL-Chat compare the photos of two cities (Chongqing and Beijing) for us (```assets/mm_tutorial/Chongqing.jpeg``` and ```assets/mm_tutorial/Beijing.jpeg```) in Chinese: ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **Please note that comparing cities is a fairly subjective question, so the responses generated by the model may be subject to a high degree of randomness. If you do not set the random seed using ```torch.manual_seed(5678)```, the output will be different each time. Even if you set the random seed, the results obtained may still differ from this tutorial due to differences in hardware and software environments.** ### **Grounding Capability** In the last section of the tutorial, we demonstrate the ability of the Qwen-VL-Chat model to produce a bounding box. Qwen-VL-Chat can frame a specified area of an image with a rectangular box according to your language description. This may be a bit abstract, so let's look at the following example. As shown below, the file ```assets/mm_tutorial/Shanghai.jpg``` is a photo of Shanghai, and we'll start by asking the model to describe the image with a regular prompt. ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。 Next, let's talk to the model by using the prompt ```请给我框出图中上海环球金融中心和东方明珠``` and see what happens. Note that at this point you need to pass the history of previous conversations to ```model.chat``` using ```history=history```. ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` The Qwen-VL-Chat model doesn't have a hand, but it doesn't reject your request either. Instead, it outputs something "strange" - In fact, the output of the model gives the location of the 上海环球金融中心（Shanghai World Financial Centre） and the 东方明珠（Oriental Pearl Tower） in markup language. You can visualise it using the following code: ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` The saved ```Shanghai_Output.jpg``` will look similar to the screenshot below: ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) After that, you can still chat with Qwen-VL-Chat as before: ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **Please note that travel planning is a fairly subjective question, so the responses generated by the model may be subject to a high degree of randomness. If you do not set the random seed using ```torch.manual_seed(1234)```, the output will be different each time. Even if you set the random seed, the results obtained may still differ from this tutorial due to differences in hardware and software environments.** ### Grounded Captioning Qwen-VL can output the bounding box information of the subject while captioning the image. For example: ``` img_url = 'assets/apple.jpeg' query = tokenizer.from_list_format([ {'image': img_url}, {'text': 'Generate the caption in English with grounding:'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) image = tokenizer.draw_bbox_on_latest_picture(response, history) if image is not None: image.save('apple.jpg') ``` The saved ```apple.jpg``` will look similar to the screenshot below: <img src="assets/apple_r.jpeg" width="600"/> #### How to get the caption without any box-like annotations Sometimes you may expect no box-like annotations in the response. In the case, you can stably get the cleaned text by the following post-processing. ``` # response = '<ref> Two apples</ref><box>(302,257),(582,671)</box><box>(603,252),(878,642)</box> and<ref> a bowl</ref><box>(2,269),(304,674)</box>' import re clean_response = re.sub(r'<ref>(.*?)</ref>(?:<box>.*?</box>)*(?:<quad>.*?</quad>)*', r'\1', response).strip() print(clean_response) # clean_response = 'Two apples and a bowl' ``` ================================================ FILE: TUTORIAL_ja.md ================================================ # Qwen-VL-Chat チュートリアル Qwen-VL-Chat は汎用のマルチモーダル大規模言語モデルであり、幅広い視覚言語タスクを実行できます。このチュートリアルでは、Qwen-VL-Chat の**視覚的質問応答、テキスト理解、図を用いた数学的推論、多視点推論、およびグラウンディング**の機能について、いくつかの簡潔な例を挙げて説明します。Qwen-VL-Chat は、入力画像やプロンプトを変更することで、Qwen-VL-Chat の能力をさらに引き出すことができます。 ## Qwen-VL-Chat モデルの初期化 Qwen-VL-Chat を使用する前に、まず Qwen-VL-Chat のトークナイザと Qwen-VL-Chat のモデルを初期化する必要があります: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # 結果の再現性を期待する場合は、ランダムシードを設定する。 # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` 上記のコードを実行すると、```tokenizer``` は Qwen-VL-Chat で使用される分類器に対応し、```model``` は Qwen-VL-Chat のモデルに対応します。```tokenizer``` はインターリーブされたマルチモーダル入力の前処理に使用され、```model``` は Qwen-VL-Chat のモデルそのものです。 ## Qwen-VL-Chat を使う ### **複数ラウンドのビジュアル質問回答** #### **最初の質問** 簡単な例から始めましょう。以下に示すように、```assets/mm_tutorial/Rebecca_(1939_poster).jpeg``` は 1940 年の映画レベッカのポスターです。 ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) Qwen-VL-Chat のポスターに描かれている映画の名前を聞いてみよう。まず初めに、入力を前処理してトークン化する ```tokenizer.from_list_format``` を使用します: ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` 次に、```model.chat``` を使って Qwen-VL-Chat モデルに質問をし、その回答を得ることができます。最初の質問では、ダイアログの履歴は空なので、```history=None``` を使用することに注意してください。 ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > The name of the movie in the poster is "Rebecca." これは、モデルが与えられた問題に正しく答えたことを示しています！ポスターをみると、映画のタイトルは確かに**レベッカ**です。 #### **複数ラウンドの質問回答** また、映画の監督は誰かなど、他の質問をモデルに続けることもできます。そのため、```history=history``` を使って、以前の会話の履歴を ``model.chat`` に渡します: ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: > The movie "Rebecca" was directed by Alfred Hitchcock. 再びこのモデルは与えられた問題に正解しました！ポスターによると、この映画の監督はアルフレッド・ヒッチコックです。 ### **テキスト理解** Qwen-VL-Chat には、高密度なテキストを含む画像を理解する機能もあります。下図に示すように、```assets/mm_tutorial/Hospital.jpeg``` というファイルは、濃いテキストを含む病院の看板です。 ![](assets/mm_tutorial/Hospital_Small.jpg) 病院内のさまざまな診療科の場所について質問することができます。対話の履歴は空なので、```history=None``` を使用します。 ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > The Department of Otorhinolaryngology is located on the 4th floor. さらに質問をすることもできます。この場合、```history=history``` を使用して、以前の会話の履歴を ```model.chat``` に渡す必要があります。 ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: > The Department of Surgery is located on the 3rd floor. ### **ダイアグラムによる数学的推論** Qwen-VL-Chat は、このモデルのダイアグラム理解能力と数学的推論能力を使って、より複雑なタスクを実行することもできます！下に示すように、```assets/mm_tutorial/Menu.jpeg``` というファイルはレストランのメニューです。では、Salmon Burger 2 個と Meat Lover's Pizza 3 枚を購入した場合の値段を知りたい。 ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ステップバイステップで注意深く考えてください(```Think carefully step by step.```)」は、複雑なタスクを一歩ずつでモデルをガイドする一般的なプロンプトです。複雑なタスクをこなさなければならない場合、このプロンプトを使ってモデルの精度を上げてみてください。以下のような出力が期待されます: > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **多視点推論と中国語入力** これまでの例では、Qwen-VL-Chat が 1 つの画像と英語の質問に対して質問応答ができることを示しました。しかし、実際には Qwen-VL-Chat は中国語入力と複数の画像をサポートする多言語モデルです！以下の例では、Qwen-VL-Chat に 2 つの都市（重慶と北京）の写真（```assets/mm_tutorial/Chongqing.jpeg``` と ```assets/mm_tutorial/Beijing.jpeg```）を中国語で比較させています: ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **都市の比較はかなり主観的な質問であるため、モデルによって生成される回答は高度なランダム性を持つ可能性があることに注意してください。```torch.manual_seed(5678)``` を使用してランダムシードを設定しない場合、出力は毎回異なります。ランダムシードを設定した場合でも、ハードウェアやソフトウェアの環境の違いにより、得られる結果がこのチュートリアルと異なる場合があります。** ### **グラウンディング能力** チュートリアルの最後のセクションでは、Qwen-VL-Chat モデルがバウンディングボックスを生成する機能を紹介します。Qwen-VL-Chat は、言語記述に従って、画像の指定された領域を矩形の枠で囲むことができます。少し抽象的なので、次の例を見てみましょう。下図のように、ファイル ```assets/mm_tutorial/Shanghai.jpg``` は上海の写真です。まず、通常のプロンプトでモデルに画像を記述してもらいます。 ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。次に、プロンプト ```请给我框出图中上海环球金融中心和东方明珠``` を使ってモデルと会話してみましょう。このとき、```history=history``` を使って、以前の会話の履歴を ```model.chat``` に渡す必要があることに注意してください。 ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` Qwen-VL-Chat モデルには手はありませんが、だからといってリクエストを拒否することもありません。その代わりに、"奇妙な"ものが出力されます - 実際、モデルの出力は上海环球金融中心（上海ワールド・フィナンシャル・センター）と东方明珠（東方テレビタワー）の位置をマークアップ言語で示しています。次のコードで視覚化できます: ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` 保存された ```Shanghai_Output.jpg``` は以下のスクリーンショットのようになります: ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) その後、Qwen-VL-Chat で以前と同じようにチャットすることができます: ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **旅行計画はかなり主観的な質問であるため、モデルによって生成される回答は高いランダム性を持つ可能性があることに注意してください。```torch.manual_seed(1234)``` を使用してランダムシードを設定しない場合、出力は毎回異なります。ランダムシードを設定した場合でも、ハードウェアやソフトウェアの環境の違いにより、得られる結果がこのチュートリアルと異なる場合があります。** ================================================ FILE: TUTORIAL_ko.md ================================================ # Qwen-VL-Chat Tutorial Qwen-VL-Chat은 범용 멀티모달 대규모 언어 모델이며 광범위한 시각 언어 작업을 수행할 수 있습니다. 이 튜토리얼에서는 **시각적 질문 답변, 텍스트 이해, 다이어그램을 사용한 수학적 추론, 다중 그림 추론 및 그라운딩(Grounding) 작업**에서 Qwen-VL-Chat의 기능을 보여주는 몇 가지 간결한 예제를 제시합니다. Qwen-VL-Chat의 기능의 한계가 아니며, **입력 이미지와 프롬프트를 변경하여 Qwen-VL-Chat의 기능**을 더 자세히 살펴보실 수도 있습니다. ## Initializing the Qwen-VL-Chat model Qwen-VL-Chat을 사용하기 전에 먼저 Qwen-VL-Chat의 Tokenizer와 Qwen-VL-Chat의 모델을 초기화해야 합니다. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # If you expect the results to be reproducible, set a random seed. # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` 위 코드를 실행하시면 ```tokenizer```변수에 Qwen-VL-Chat에서 사용하는 분류기(classifier)가 할당되고, ```model```변수에는 Qwen-VL-Chat의 모델을 할당하게 됩니다. ```tokenizer```는 인터리브된 멀티모달 입력(interleaved multimodal inputs)을 전처리하는 데 사용되며, ``model``은 Qwen-VL-Chat 모델입니다. ## Using Qwen-VL-Chat ### **Multi-round visual question answering** #### **첫 질문하기** 간단한 예제를 확인해보겠습니다. 아래에서 볼 수 있듯이, ```assets/mm_tutorial/Rebecca_(1939_poster).jpeg``` 파일은 1940년 영화 <레베카>의 포스터입니다. ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) Qwen-VL-Chat 포스터에 있는 영화 제목이 무엇인지 물어봅시다. 우선, 입력을 전처리하고 토큰화할 수 있는 ```tokenizer.from_list_format```을 사용합니다. ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` 다음으로, ```model.chat```을 사용하여 Qwen-VL-Chat 모델에 질문하고 응답을 얻을 수 있습니다. 첫 번째 질문의 경우 대화 기록이 비어 있으므로 ``history=None``을 사용합니다. ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The name of the movie in the poster is "Rebecca." 모델이 주어진 질문에 정답을 맞혔습니다. 포스터에 따르면, 영화의 제목은 실제로 **레베카**입니다. #### **Multi-round question answering** 또한 모델에게 영화 감독이 누구인지와 같은 다른 질문을 계속할 수도 있습니다. 대화 기록은 후속 질문을 위해 비어 있지 않으므로 ``history=history``를 사용하여 이전 대화의 기록을 ``model.chat``에 전달합니다: ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The movie "Rebecca" was directed by Alfred Hitchcock. 다시 한 번, 모델이 주어진 질문에 대한 정답을 맞혔습니다. 포스터에 따르면 이 영화의 감독은 <알프레드 히치콕>입니다. ### **Text Understanding** Qwen-VL-Chat은 촘촘한 텍스트가 포함된 이미지도 이해할 수 있습니다. 아래 그림과 같이 ``assets/mm_tutorial/Hospital.jpeg`` 파일은 촘촘한 텍스트가 포함된 병원 간판입니다. ![](assets/mm_tutorial/Hospital_Small.jpg) 병원 내 여러 부서의 위치에 대해 질문할 수 있습니다. 첫 질문으로 대화에 대한 이전 기록이 없으므로 ```history=None```을 사용합니다. ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The Department of Otorhinolaryngology is located on the 4th floor. 추가 질문을 하실 수도 있습니다. 이 경우 ```history=history```를 사용하여 이전 대화의 기록을 ```model.chat```에 전달해야 합니다. ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The Department of Surgery is located on the 3rd floor. ### **Mathematical Reasoning with Diagram** 모델의 다이어그램 이해와 수학적 추론 기능을 사용하여 Qwen-VL-Chat은 좀 더 복잡한 작업도 수행할 수 있습니다. 아래에서 볼 수 있듯이 ``assets/mm_tutorial/Menu.jpeg`` 파일은 레스토랑의 메뉴 이미지 입니다. 이제 연어 버거 두 개와 미트 러버스 피자 세 개를 구매하는 데 드는 비용을 알아봅시다. ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ``단계별로 신중하게 생각하세요``는 복잡한 작업을 단계별로 모델에 안내하는 일반적인 프롬프트입니다. 따라서 완료해야 할 복잡한 작업이 있는 경우에는 이 프롬프트를 사용하여 모델의 정확도를 향상시켜 보세요. 다음과 유사한 출력이 나올 것입니다. > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **Multi-Figure Reasoning and Chinese Input** 이전 예제에서는 단일 이미지와 영어 질문에 대한 Qwen-VL-Chat의 질문 답변 기능을 시연했습니다. 하지만 실제로는 중국어 입력과 여러 이미지를 지원하는 다국어 모델입니다. 다음 예제에서는 두 도시(충칭과 베이징)의 사진(`assets/mm_tutorial/Chongqing.jpeg` 및 `assets/mm_tutorial/Beijing.jpeg`)을 중국어로 비교하도록 Qwen-VL-Chat을 설정했습니다. ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 유사한 출력이 나올 것입니다. > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **도시 비교는 상당히 주관적인 질문이므로 모델에 의해 생성된 응답에는 매우 다양하게 무작위의 시드가 적용될 수 있다는 점을 유의하세요. ``torch.manual_seed(5678)```를 사용하여 무작위 시드를 설정하지 않으면 매번 출력이 달라집니다. 랜덤 시드를 설정하더라도 하드웨어 및 소프트웨어 환경의 차이로 인해 얻은 결과가 이 튜토리얼과 다를 수 있습니다**. ### **Grounding Capability** 튜토리얼의 마지막 섹션에서는 Qwen-VL-Chat 모델이 바운딩 박스를 생성하는 기능을 보여드립니다. Qwen-VL-Chat은 언어 설명에 따라 직사각형 상자로 이미지의 지정된 영역에 프레임을 지정할 수 있습니다. 다소 추상적일 수 있으므로 다음 예제를 살펴보겠습니다. 아래 그림과 같이 ```assets/mm_tutorial/Shanghai.jpg`` 파일은 상하이의 사진이며, 모델에게 일반 프롬프트로 이미지를 설명하도록 요청하는 것으로 시작하겠습니다. ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 유사한 출력을 보실 수 있습니다. > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。 다음으로 '``请给我框出图中上海环球金融中心和东方明珠``라는 프롬프트를 사용하여 모델과 대화하고 어떤 일이 발생하는지 살펴봅시다. 이 시점에서 ``history=history``를 사용하여 이전 대화의 기록을 ``model.chat``에 전달해야 합니다. ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 유사한 출력을 보실 수 있습니다. ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` Qwen-VL-Chat 모델에는 손이 없지만 사용자의 요청을 거부하지도 않습니다. 대신 "이상한" 결과를 출력하는데, 실제로 이 모델의 출력은 上海环球金融中心(상하이 월드 파이낸셜 센터) 와 东方明珠(동방명주) 의 위치를 마크업 언어로 제공합니다. 다음 코드를 사용하여 시각화할 수 있습니다. ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` The saved ```Shanghai_Output.jpg``` will look similar to the screenshot below: ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) 그 후에도 이전처럼 Qwen-VL-Chat으로 계속 채팅할 수 있습니다. ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 유사한 출력을 보실 수 있습니다. > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **여행 계획은 상당히 주관적인 질문이므로 모델에 의해 생성된 응답에는 높은 수준의 랜덤 시드가 적용될 수 있다는 점에 유의하세요. ``torch.manual_seed(1234)``를 사용하여 무작위 시드를 설정하지 않으면 매번 다른 출력이 나오게 됩니다. 랜덤 시드를 일정하게 설정하더라도 하드웨어 및 소프트웨어 환경의 차이로 인해 얻은 결과가 이 튜토리얼과 다를 수 있습니다**. ### Grounded Captioning Qwen-VL은 다음과 같이 이미지를 캡쳐하는 동안 피사체의 바운딩 박스 정보를 출력할 수 있습니다. ``` img_url = 'assets/apple.jpeg' query = tokenizer.from_list_format([ {'image': img_url}, {'text': 'Generate the caption in English with grounding:'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) image = tokenizer.draw_bbox_on_latest_picture(response, history) if image is not None: image.save('apple.jpg') ``` 저장된 ``사과.jpg``는 이미지는 아래 스크린샷과 비슷하게 보이게 될 것입니다. <img src="assets/apple_r.jpeg" width="600"/> #### How to get the caption without any box-like annotations 때로는 응답에 박스형 주석이 없을 수도 있습니다. 이 경우 다음과 같은 후처리를 통해 안정적으로 정리된 텍스트를 얻을 수 있습니다. ``` # response = '<ref> Two apples</ref><box>(302,257),(582,671)</box><box>(603,252),(878,642)</box> and<ref> a bowl</ref><box>(2,269),(304,674)</box>' import re clean_response = re.sub(r'<ref>(.*?)</ref>(?:<box>.*?</box>)*(?:<quad>.*?</quad>)*', r'\1', response).strip() print(clean_response) # clean_response = 'Two apples and a bowl' ``` ================================================ FILE: TUTORIAL_zh.md ================================================ # Qwen-VL-Chat使用教程 Qwen-VL-Chat是通用多模态大规模语言模型，因此它可以完成多种视觉语言任务。在本教程之中，我们会给出一些简明的例子，用以展示Qwen-VL-Chat在**视觉问答，文字理解，图表数学推理，多图理解和Grounding**(根据指令标注图片中指定区域的包围框)等多方面的能力。请注意，展示的例子远非Qwen-VL-Chat能力的极限，**您可以通过更换不同的输入图像和提示词（Prompt），来进一步挖掘Qwen-VL-Chat的能力！** ## 初始化Qwen-VL-Chat模型在使用Qwen-VL-Chat之前，您首先需要初始化Qwen-VL-Chat的分词器（Tokenizer）和Qwen-VL-Chat的模型： ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # 如果您希望结果可复现，可以设置随机数种子。 # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` 在执行完上述代码后，```tokenizer```将对应Qwen-VL-Chat使用的分词器，而```model```将对应Qwen-VL-Chat的模型。```tokenizer```用于对图文混排输入进行分词和预处理，而```model```则是Qwen-VL-Chat模型本身。 ## 使用Qwen-VL-Chat ### **多轮视觉问答** #### **第一个问题** 首先我们来看一个最简单的例子，如下图所示，文件```assets/mm_tutorial/Rebecca_(1939_poster).jpeg```是1940年电影Rebecca的于1939发布的海报。 ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) 我们来问一问Qwen-VL-Chat海报上电影的名称是什么。首先，我们使用tokenizer.from_list_format可以对图文混排输入进行分词与处理： ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` 接下来，我们可以使用```model.chat```向Qwen-VL-Chat模型提问并获得回复。注意在第一次提问时，对话历史为空，因此我们使用```history=None```。 ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > The name of the movie in the poster is "Rebecca." 这说明模型正确的回答了问题！根据海报，该电影的名称的确是**Rebecca**。 #### **多轮问答** 我们还可以继续向模型发问，例如询问电影的导演是谁。在后续提问时，对话历史并不为空，我们使用```history=history```向```model.chat```传递之前的对话历史： ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： > The movie "Rebecca" was directed by Alfred Hitchcock. 模型再次正确回答了问题！根据海报，该电影的导演是Alfred Hitchcock。 ### **文字理解** Qwen-VL-Chat具有一定的针对包含密集文字图片的理解能力。如下图所示，文件```assets/mm_tutorial/Hospital.jpeg```是一个包含密集文字的医院指示牌。 ![](assets/mm_tutorial/Hospital_Small.jpg) 我们可以像之前一样向模型询问医院中各个科室的位置，对话历史为空，因此我们使用```history=None```。 ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > The Department of Otorhinolaryngology is located on the 4th floor. 您同样可以进一步提出后续问题，此时需要使用```history=history```向```model.chat```传递之前的对话历史。 ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： > The Department of Surgery is located on the 3rd floor. ### **图表数学推理** 利用模型的图表理解和数学推理能力，Qwen-VL-Chat还可以完成更复杂的一些任务！如下图所示，文件```assets/mm_tutorial/Menu.jpeg```展示了一家餐厅的菜单。现在我们想知道，如果购买两个Salmon Burger和三个Meat Lover's Pizza需要花多少钱呢？ ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ```Think carefully step by step.```是一个引导模型分步处理复杂任务的常见提示词，如果您需要完成的任务较为复杂，可以试着使用它来提高准确率。您应该会得到类似下列的输出结果： > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **多图理解与中文输入** 在之前的例子中，我们主要展示了Qwen-VL-Chat针对单张图像和英文问题的问答能力。但实际上，Qwen-VL-Chat是支持中文输入的多语言模型，而且也支持多张图片的输入！下面的例子中，我们用中文让Qwen-VL-Chat来为我们比较重庆和北京这两个城市的照片（```assets/mm_tutorial/Chongqing.jpeg```和```assets/mm_tutorial/Beijing.jpeg```）： ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **请注意，城市间的比较是一个具有相当主观性的问题，因此模型产生的回复可能具有相当高的随机性。若不使用```torch.manual_seed(5678)```设置随机数种子，每次的输出结果会不一样。即使您设置了随机数种子，由于软硬件环境的差异，得到的结果也可能与本文档中的有所不同。** ### **Grounding能力** 在最后，我们展示Qwen-VL-Chat模型产生包围框的能力。Qwen-VL-Chat可以根据您的语言描述，在图像中用矩形框框出指定区域。这样说可能有些抽象，让我们来看下面的例子。如下图所示，文件```assets/mm_tutorial/Shanghai.jpg```是上海的一张照片，我们先用常规的提示词，问一下模型图里有什么。 ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。接下来，我们通过使用```请给我框出图中上海环球金融中心和东方明珠```这个提示词来和模型对话，看看会发生什么。注意此时需要使用```history=history```向```model.chat```传递之前的对话历史。 ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` Qwen-VL-Chat模型没有手，但也没有拒绝您的请求，而是输出了一些“奇怪”的东西——并不是，实际上，模型的输出以标记语言的形式给出了上海环球金融中心和东方明珠在图中的具体位置。您可以使用下列代码将其可视化： ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` 保存下来的```Shanghai_Output.jpg```结果将类似于下面的截图： ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) 在此之后，您还可以继续照常和Qwen-VL-Chat对话： ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **请注意，旅游计划是一个具有相当主观性的问题，因此模型产生的回复可能具有相当高的随机性。若不使用```torch.manual_seed(1234)```设置随机数种子，每次的输出结果会不一样。即使您设置了随机数种子，由于软硬件环境的差异，得到的结果也可能与本文档中的有所不同。** ================================================ FILE: assets/mm_tutorial/TUTORIAL.ipynb ================================================ ================================================ FILE: eval_mm/EVALUATION.md ================================================ # Evaluation ## Dependencies ```bash pip install pycocoevalcap tqdm ``` ## Image Caption ### [Flickr30K](https://bryanplummer.com/Flickr30kEntities/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/flickr && cd data/flickr # download images from https://bryanplummer.com/Flickr30kEntities/ # karpathy split annotations can be downloaded from https://cs.stanford.edu/people/karpathy/deepimagesent/ # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/flickr30k/flickr30k_karpathy_test.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/flickr30k/flickr30k_karpathy_train.json cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="flickr" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_caption.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [Nocaps](https://nocaps.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/nocaps && cd data/nocaps # download images from https://nocaps.org/download # original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/nocaps/nocaps_val.json cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="nocaps" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_caption.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ## [COCO](https://cocodataset.org/) > COCO images are used in VQAv2/OK-VQA/RefCOCO/RefCOCO+/RefCOCOg, make sure you have already downloaded COCO images before evaluate on these benchmarks. <details> <summary>Data Preparation</summary> ```bash mkdir -p data/coco && cd data/coco # download coco2014 images wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip cd ../.. ``` </details> ## General VQA ### [VQAv2](https://visualqa.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/vqav2 && cd data/vqav2 # make sure you have downloaded COCO images # download questions and annotations wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip && unzip v2_Annotations_Train_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip && unzip v2_Questions_Train_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip && unzip v2_Annotations_Val_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip && unzip v2_Questions_Val_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip && unzip v2_Questions_Test_mscoco.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_testdev.jsonl ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "vqav2_val" "vqav2_testdev" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [OKVQA](https://okvqa.allenai.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/okvqa && cd data/okvqa # download annotations and questions wget https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip && unzip mscoco_train2014_annotations.json.zip wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip && unzip OpenEnded_mscoco_train2014_questions.json.zip wget https://okvqa.allenai.org/static/data/mscoco_val2014_annotations.json.zip && unzip mscoco_val2014_annotations.json.zip wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip && unzip OpenEnded_mscoco_val2014_questions.json.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_val.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="okvqa_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [TextVQA](https://textvqa.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/textvqa && cd data/textvqa # download images wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip && unzip train_val_images.zip # download annotations and questions wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="textvqa_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [VizWiz](https://vizwiz.org/tasks-and-datasets/vqa/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/vizwiz && cd data/vizwiz # download images wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/train.zip && unzip train.zip wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/val.zip && unzip val.zip wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip && unzip test.zip # download annotations wget https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip && unzip Annotations.zip # download converted files # train wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train.jsonl # val wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val.jsonl # test wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_test.jsonl cd ../.. ``` </details> <details> <summary>Evaluation</summary> ```bash # evaluate vqa score on vizwiz val split ds="vizwiz_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [DocVQA](https://www.docvqa.org/datasets) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/docvqa && cd data/docvqa # download images and annotations from https://www.docvqa.org/datasets # download converted files # train wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.jsonl # val wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.jsonl # test wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.jsonl cd ../.. ``` </details> <details> <summary>Evaluation</summary> ```bash # evaluate vqa score on docvqa val split ds="docvqa_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [ChartQA](https://aclanthology.org/2022.findings-acl.177/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/chartqa && cd data/chartqa # download images from https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_human.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_augmented.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_human.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_augmented.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "chartqa_test_human" "chartqa_test_augmented" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/gqa && cd data/gqa # download images wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip unzip images.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/testdev_balanced.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/train_balanced.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT ds="gqa_testdev" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [OCRVQA](https://ocr-vqa.github.io/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/ocrvqa && cd data/ocrvqa # download images by following instructions at https://ocr-vqa.github.io/kvqa_ProjectFiles/README.txt # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_test.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT ds="ocrvqa_test" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [AI2Diagram](https://allenai.org/data/diagrams) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/ai2diagram && cd data/ai2diagram # download images wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/test.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT ds="ai2diagram_test" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [ScienceQA](https://github.com/lupantech/ScienceQA) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/scienceqa/images && cd data/scienceqa/images # download images wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip && unzip test.zip cd .. # download original questions wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/scienceqa_test_img.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="scienceqa_test_img" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_multiple_choice.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ## Refer Expression Comprehension ### RefCOCO <details> <summary>Data Preparation</summary> ```bash mkdir -p data/refcoco && cd data/refcoco # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testA.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testB.jsonl cd ../.. ``` </details> <details> <summary>Evaluation</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "refcoco_val" "refcoco_testA" "refcoco_testB" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_grounding.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### RefCOCO+ <details> <summary>Data Preparation</summary> ```bash mkdir -p data/refcoco+ && cd data/refcoco+ # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testA.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testB.jsonl cd ../.. ``` </details> <details> <summary>Data Preparation</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "refcoco+_val" "refcoco+_testA" "refcoco+_testB" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_grounding.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### RefCOCOg <details> <summary>Data Preparation</summary> ```bash mkdir -p data/refcocog && data/refcocog # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_test.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "refcocog_val" "refcocog_test" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_grounding.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ================================================ FILE: eval_mm/evaluate_caption.py ================================================ import argparse import itertools import json import os import random import time from functools import partial import torch from pycocoevalcap.eval import COCOEvalCap from pycocotools.coco import COCO from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer ds_collections = { 'flickr': { 'train': 'data/flickr30k/flickr30k_karpathy_test.json', 'test': 'data/flickr30k/flickr30k_karpathy_test.json', }, 'nocaps': { 'train': '', 'test': 'data/nocaps/nocaps_val.json', }, } class CaptionDataset(torch.utils.data.Dataset): def __init__(self, train, test, prompt, few_shot=0): self.images = json.load(open(test))['images'] self.prompt = prompt self.few_shot = few_shot if few_shot > 0: self.train = json.load(open(train))['annotations'] def __len__(self): return len(self.images) def __getitem__(self, idx): image_id, image_path = self.images[idx]['id'], self.images[idx][ 'image'] few_shot_prompt = '' if self.few_shot > 0: few_shot_samples = random.sample(self.train, self.few_shot) for sample in few_shot_samples: few_shot_prompt += self.prompt.format( sample['image']) + f" {sample['caption']}" return { 'image_id': image_id, 'input_text': few_shot_prompt + self.prompt.format(image_path) } def collate_fn(inputs, tokenizer): image_ids = [_['image_id'] for _ in inputs] input_texts = [_['input_text'] for _ in inputs] input_tokens = tokenizer(input_texts, return_tensors='pt', padding='longest') return image_ids, input_tokens.input_ids, input_tokens.attention_mask class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) parser.add_argument('--few-shot', type=int, default=0) parser.add_argument('--seed', type=int, default=0) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) prompt = '<img>{}</img>Describe the image in English:' model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) tokenizer.padding_side = 'left' tokenizer.pad_token_id = tokenizer.eod_id random.seed(args.seed) dataset = CaptionDataset( train=ds_collections[args.dataset]['train'], test=ds_collections[args.dataset]['test'], prompt=prompt, few_shot=args.few_shot, ) coco_karpathy_test_loader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, tokenizer=tokenizer), ) image_ids = [] captions = [] for _, (ids, input_ids, attention_mask) in tqdm(enumerate(coco_karpathy_test_loader)): pred = model.generate( input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda(), do_sample=False, num_beams=1, max_new_tokens=30, min_new_tokens=8, length_penalty=0, num_return_sequences=1, use_cache=True, pad_token_id=tokenizer.eod_id, eos_token_id=tokenizer.eod_id, ) image_ids.extend(ids) captions.extend([ tokenizer.decode(_[input_ids.size(1):].cpu(), skip_special_tokens=True).strip() for _ in pred ]) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_ids = [None for _ in range(world_size)] merged_captions = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_ids, image_ids) torch.distributed.all_gather_object(merged_captions, captions) merged_ids = [_ for _ in itertools.chain.from_iterable(merged_ids)] merged_captions = [ _ for _ in itertools.chain.from_iterable(merged_captions) ] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") results = [] for image_id, caption in zip(merged_ids, merged_captions): results.append({ 'image_id': int(image_id), 'caption': caption, }) time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) results_file = f'{args.dataset}_{time_prefix}.json' json.dump(results, open(results_file, 'w')) coco = COCO(ds_collections[args.dataset]['test']) coco_result = coco.loadRes(results_file) coco_eval = COCOEvalCap(coco, coco_result) coco_eval.evaluate() print(coco_eval.eval.items()) torch.distributed.barrier() ================================================ FILE: eval_mm/evaluate_grounding.py ================================================ import argparse import itertools import json import os import re from functools import partial import torch from torchvision.ops.boxes import box_area from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer ds_collections = { 'refcoco_val': 'data/refcoco/refcoco_val.jsonl', 'refcoco_testA': 'data/refcoco/refcoco_testA.jsonl', 'refcoco_testB': 'data/refcoco/refcoco_testB.jsonl', 'refcoco+_val': 'data/refcoco+/refcoco+_val.jsonl', 'refcoco+_testA': 'data/refcoco+/refcoco+_testA.jsonl', 'refcoco+_testB': 'data/refcoco+/refcoco+_testB.jsonl', 'refcocog_val': 'data/refcocog/refcocog_val.jsonl', 'refcocog_test': 'data/refcocog/refcocog_test.jsonl', } def box_iou(boxes1, boxes2): area1 = box_area(boxes1) area2 = box_area(boxes2) lt = torch.max(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2] rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2] wh = (rb - lt).clamp(min=0) # [N,M,2] inter = wh[:, :, 0] * wh[:, :, 1] # [N,M] union = area1[:, None] + area2 - inter iou = inter / union return iou, union def collate_fn(batches, tokenizer): texts = [_['text'] for _ in batches] bboxes = [_['bbox'] for _ in batches] hws = [_['hw'] for _ in batches] input_ids = tokenizer(texts, return_tensors='pt', padding='longest') return input_ids.input_ids, input_ids.attention_mask, bboxes, hws class RefCOCODataset(torch.utils.data.Dataset): def __init__(self, test, tokenizer, prompt): self.datas = open(test).readlines() self.tokenizer = tokenizer self.prompt = prompt def __len__(self): return len(self.datas) def __getitem__(self, idx): data = json.loads(self.datas[idx].strip()) image = data['image'] text = data['sent'] bbox = data['bbox'] w, h = data['width'], data['height'] return { 'text': self.prompt.format(image, text), 'bbox': bbox, 'hw': (h, w), } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) tokenizer.padding_side = 'left' tokenizer.pad_token_id = tokenizer.eod_id prompt = '<img>{}</img><ref>{}</ref><box>' dataset = RefCOCODataset(test=ds_collections[args.dataset], tokenizer=tokenizer, prompt=prompt) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=True, collate_fn=partial(collate_fn, tokenizer=tokenizer), ) outputs = [] for _, (input_ids, attention_mask, bboxes, hws) in tqdm(enumerate(dataloader)): pred = model.generate( input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda(), do_sample=False, num_beams=1, max_new_tokens=28, min_new_tokens=10, length_penalty=1, num_return_sequences=1, use_cache=True, pad_token_id=tokenizer.eod_id, eos_token_id=tokenizer.eod_id, ) answers = [ tokenizer.decode(_[input_ids.size(1):].cpu(), skip_special_tokens=True) for _ in pred ] for bbox, hw, answer in zip(bboxes, hws, answers): outputs.append({ 'answer': answer, 'gt_bbox': bbox, 'hw': hw, }) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_outputs = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_outputs, outputs) merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] PATTERN = re.compile(r'$(.*?)$,$(.*?)$') if torch.distributed.get_rank() == 0: correct = total_cnt = 0 for i, output in enumerate(merged_outputs): predict_bbox = re.findall(PATTERN, output['answer']) try: if ',' not in predict_bbox[0][0] or ',' not in predict_bbox[0][ 1]: predict_bbox = (0., 0., 0., 0.) else: x1, y1 = [ float(tmp) for tmp in predict_bbox[0][0].split(',') ] x2, y2 = [ float(tmp) for tmp in predict_bbox[0][1].split(',') ] predict_bbox = (x1, y1, x2, y2) except: predict_bbox = (0., 0., 0., 0.) target_bbox = torch.tensor(output['gt_bbox'], dtype=torch.float32).view(-1, 4) predict_bbox = torch.tensor(predict_bbox, dtype=torch.float32).view(-1, 4) / 999 predict_bbox[:, 0::2] *= output['hw'][1] predict_bbox[:, 1::2] *= output['hw'][0] iou, _ = box_iou(predict_bbox, target_bbox) iou = iou.item() total_cnt += 1 if iou >= 0.5: correct += 1 print(f"Evaluating {args.dataset} ...") print(f'Precision @ 1: {correct / total_cnt} \n') torch.distributed.barrier() ================================================ FILE: eval_mm/evaluate_multiple_choice.py ================================================ import argparse import itertools import json import os from functools import partial import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer multiple_choices = ['A', 'B', 'C', 'D', 'E'] ds_collections = { 'scienceqa_test_img': { 'test': 'data/scienceqa/scienceqa_test_img.jsonl', } } def collate_fn(batches, pad_token_id): input_tokens = [_['input_tokens'] for _ in batches] target_lengths = [_['target_lengths'] for _ in batches] answers = [_['answer'] for _ in batches] chunk_sizes = [len(_) for _ in input_tokens] input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)] max_lengths = max([len(_) for _ in input_tokens]) input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _ for _ in input_tokens] input_tokens = torch.LongTensor(input_tokens) attention_mask = 1 - input_tokens.eq(pad_token_id).float() return input_tokens, attention_mask, target_lengths, answers, chunk_sizes class MultipleChoiceDataste(torch.utils.data.Dataset): def __init__(self, test, prompt, tokenizer): self.datas = open(test).readlines() self.prompt = prompt self.tokenizer = tokenizer def __len__(self): return len(self.datas) def __getitem__(self, idx): data = json.loads(self.datas[idx].strip()) image = data['image'] hint = data['hint'] if data['hint'] else 'N/A' question = data['question'] choices = data['choices'] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(multiple_choices[i], c)) choice_txt = '\n'.join(choice_list) prompt = self.prompt.format(image, hint, question, choice_txt) prompt_tokens = self.tokenizer(prompt).input_ids target_tokens = [ self.tokenizer(' ' + _).input_ids for _ in multiple_choices[:len(choices)] ] return { 'input_tokens': [prompt_tokens + _ for _ in target_tokens], 'target_lengths': [len(_) for _ in target_tokens], 'answer': data['answer'], } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) prompt = '<img>{}</img>Context: {}\nQuestion: {}\nOptions: {}\nAnswer:' dataset = MultipleChoiceDataste(test=ds_collections[args.dataset]['test'], prompt=prompt, tokenizer=tokenizer) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id), ) results = [] with torch.no_grad(): for _, (input_tokens, attention_mask, target_lengths, answer, chunk_sizes) in tqdm(enumerate(dataloader)): outputs = model( input_ids=input_tokens[:, :-1].cuda(), attention_mask=attention_mask[:, :-1].cuda(), return_dict=True, ) losses = torch.nn.functional.cross_entropy(outputs.logits.permute( 0, 2, 1), input_tokens[:, 1:].cuda(), reduction='none') losses = losses.split(chunk_sizes, dim=0) for loss, target_length, answer in zip(losses, target_lengths, answer): target_loss = loss.mean(-1) for _ in range(len(target_length)): target_loss[_] = loss[_, -target_length[_]:].mean() pred = target_loss.argmin().item() if pred == answer: results.append(1) else: results.append(0) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_results = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_results, results) merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") print(f'Acc@1: {sum(merged_results) / len(merged_results)}') torch.distributed.barrier() ================================================ FILE: eval_mm/evaluate_vqa.py ================================================ import argparse import itertools import json import os import random import time from functools import partial from typing import Optional import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer from vqa import VQA from vqa_eval import VQAEval ds_collections = { 'vqav2_val': { 'train': 'data/vqav2/vqav2_train.jsonl', 'test': 'data/vqav2/vqav2_val.jsonl', 'question': 'data/vqav2/v2_OpenEnded_mscoco_val2014_questions.json', 'annotation': 'data/vqav2/v2_mscoco_val2014_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'vqav2_testdev': { 'train': 'data/vqav2/vqav2_train.jsonl', 'test': 'data/vqav2/vqav2_testdev.jsonl', 'metric': None, 'max_new_tokens': 10, }, 'okvqa_val': { 'train': 'data/okvqa/okvqa_train.jsonl', 'test': 'data/okvqa/okvqa_val.jsonl', 'question': 'data/okvqa/OpenEnded_mscoco_val2014_questions.json', 'annotation': 'data/okvqa/mscoco_val2014_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'textvqa_val': { 'train': 'data/textvqa/textvqa_train.jsonl', 'test': 'data/textvqa/textvqa_val.jsonl', 'question': 'data/textvqa/textvqa_val_questions.json', 'annotation': 'data/textvqa/textvqa_val_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'vizwiz_val': { 'train': 'data/vizwiz/vizwiz_train.jsonl', 'test': 'data/vizwiz/vizwiz_val.jsonl', 'question': 'data/vizwiz/vizwiz_val_questions.json', 'annotation': 'data/vizwiz/vizwiz_val_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'vizwiz_test': { 'train': 'data/vizwiz/vizwiz_train.jsonl', 'test': 'data/vizwiz/vizwiz_test.jsonl', 'metric': None, 'max_new_tokens': 10, }, 'docvqa_val': { 'train': 'data/docvqa/train.jsonl', 'test': 'data/docvqa/val.jsonl', 'annotation': 'data/docvqa/val/val_v1.0.json', 'metric': 'anls', 'max_new_tokens': 100, }, 'docvqa_test': { 'train': 'data/docvqa/train.jsonl', 'test': 'data/docvqa/test.jsonl', 'metric': None, 'max_new_tokens': 100, }, 'chartqa_test_human': { 'train': 'data/chartqa/train_human.jsonl', 'test': 'data/chartqa/test_human.jsonl', 'metric': 'relaxed_accuracy', 'max_new_tokens': 100, }, 'chartqa_test_augmented': { 'train': 'data/chartqa/train_augmented.jsonl', 'test': 'data/chartqa/test_augmented.jsonl', 'metric': 'relaxed_accuracy', 'max_new_tokens': 100, }, 'gqa_testdev': { 'train': 'data/gqa/train.jsonl', 'test': 'data/gqa/testdev_balanced.jsonl', 'metric': 'accuracy', 'max_new_tokens': 10, }, 'ocrvqa_val': { 'train': 'data/ocrvqa/ocrvqa_train.jsonl', 'test': 'data/ocrvqa/ocrvqa_val.jsonl', 'metric': 'accuracy', 'max_new_tokens': 100, }, 'ocrvqa_test': { 'train': 'data/ocrvqa/ocrvqa_train.jsonl', 'test': 'data/ocrvqa/ocrvqa_test.jsonl', 'metric': 'accuracy', 'max_new_tokens': 100, }, 'ai2diagram_test': { 'train': 'data/ai2diagram/train.jsonl', 'test': 'data/ai2diagram/test.jsonl', 'metric': 'accuracy', 'max_new_tokens': 10, } } # https://github.com/google-research/pix2struct/blob/main/pix2struct/metrics.py#L81 def relaxed_correctness(target: str, prediction: str, max_relative_change: float = 0.05) -> bool: """Calculates relaxed correctness. The correctness tolerates certain error ratio defined by max_relative_change. See https://arxiv.org/pdf/2203.10244.pdf, end of section 5.1: “Following Methani et al. (2020), we use a relaxed accuracy measure for the numeric answers to allow a minor inaccuracy that may result from the automatic data extraction process. We consider an answer to be correct if it is within 5% of the gold answer. For non-numeric answers, we still need an exact match to consider an answer to be correct.” Args: target: Target string. prediction: Predicted string. max_relative_change: Maximum relative change. Returns: Whether the prediction was correct given the specified tolerance. """ def _to_float(text: str) -> Optional[float]: try: if text.endswith('%'): # Convert percentages to floats. return float(text.rstrip('%')) / 100.0 else: return float(text) except ValueError: return None prediction_float = _to_float(prediction) target_float = _to_float(target) if prediction_float is not None and target_float: relative_change = abs(prediction_float - target_float) / abs(target_float) return relative_change <= max_relative_change else: return prediction.lower() == target.lower() def evaluate_relaxed_accuracy(entries): scores = [] for elem in entries: if isinstance(elem['annotation'], str): elem['annotation'] = [elem['annotation']] score = max([ relaxed_correctness(elem['answer'].strip(), ann) for ann in elem['annotation'] ]) scores.append(score) return sum(scores) / len(scores) def evaluate_exact_match_accuracy(entries): scores = [] for elem in entries: if isinstance(elem['annotation'], str): elem['annotation'] = [elem['annotation']] score = max([ (1.0 if (elem['answer'].strip().lower() == ann.strip().lower()) else 0.0) for ann in elem['annotation'] ]) scores.append(score) return sum(scores) / len(scores) def collate_fn(batches, tokenizer): questions = [_['question'] for _ in batches] question_ids = [_['question_id'] for _ in batches] annotations = [_['annotation'] for _ in batches] input_ids = tokenizer(questions, return_tensors='pt', padding='longest') return question_ids, input_ids.input_ids, input_ids.attention_mask, annotations class VQADataset(torch.utils.data.Dataset): def __init__(self, train, test, prompt, few_shot): self.test = open(test).readlines() self.prompt = prompt self.few_shot = few_shot if few_shot > 0: self.train = open(train).readlines() def __len__(self): return len(self.test) def __getitem__(self, idx): data = json.loads(self.test[idx].strip()) image, question, question_id, annotation = data['image'], data[ 'question'], data['question_id'], data.get('answer', None) few_shot_prompt = '' if self.few_shot > 0: few_shot_samples = random.sample(self.train, self.few_shot) for sample in few_shot_samples: sample = json.loads(sample.strip()) few_shot_prompt += self.prompt.format( sample['image'], sample['question']) + f" {sample['answer']}" return { 'question': few_shot_prompt + self.prompt.format(image, question), 'question_id': question_id, 'annotation': annotation } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) parser.add_argument('--few-shot', type=int, default=0) parser.add_argument('--seed', type=int, default=0) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) tokenizer.padding_side = 'left' tokenizer.pad_token_id = tokenizer.eod_id prompt = '<img>{}</img>{} Answer:' random.seed(args.seed) dataset = VQADataset( train=ds_collections[args.dataset]['train'], test=ds_collections[args.dataset]['test'], prompt=prompt, few_shot=args.few_shot, ) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, tokenizer=tokenizer), ) outputs = [] for _, (question_ids, input_ids, attention_mask, annotations) in tqdm(enumerate(dataloader)): pred = model.generate( input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda(), do_sample=False, num_beams=1, max_new_tokens=ds_collections[args.dataset]['max_new_tokens'], min_new_tokens=1, length_penalty=1, num_return_sequences=1, output_hidden_states=True, use_cache=True, pad_token_id=tokenizer.eod_id, eos_token_id=tokenizer.eod_id, ) answers = [ tokenizer.decode(_[input_ids.size(1):].cpu(), skip_special_tokens=True).strip() for _ in pred ] for question_id, answer, annotation in zip(question_ids, answers, annotations): if args.dataset in ['vqav2_val', 'vqav2_testdev', 'okvqa_val', 'textvqa_val', 'vizwiz_val']: outputs.append({ 'question_id': question_id, 'answer': answer, }) elif args.dataset in ['docvqa_val', 'infographicsvqa', 'gqa_testdev', 'ocrvqa_val', 'ocrvqa_test']: outputs.append({ 'questionId': question_id, 'answer': answer, 'annotation': annotation, }) elif args.dataset in ['ai2diagram_test']: outputs.append({ 'image': question_id, 'answer': answer, 'annotation': annotation, }) elif args.dataset in ['chartqa_test_human', 'chartqa_test_augmented']: outputs.append({ 'answer': answer, 'annotation': annotation, }) elif args.dataset in ['docvqa_test']: outputs.append({ 'questionId': question_id, 'answer': answer, }) elif args.dataset in ['vizwiz_test']: outputs.append({ 'image': question_id, 'answer': answer, }) else: raise NotImplementedError torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_outputs = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) merged_outputs = [json.loads(_) for _ in merged_outputs] merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) results_file = f'{args.dataset}_{time_prefix}_fs{args.few_shot}_s{args.seed}.json' json.dump(merged_outputs, open(results_file, 'w'), ensure_ascii=False) if ds_collections[args.dataset]['metric'] == 'vqa_score': vqa = VQA(ds_collections[args.dataset]['annotation'], ds_collections[args.dataset]['question']) results = vqa.loadRes( resFile=results_file, quesFile=ds_collections[args.dataset]['question']) vqa_scorer = VQAEval(vqa, results, n=2) vqa_scorer.evaluate() print(vqa_scorer.accuracy) elif ds_collections[args.dataset]['metric'] == 'anls': json.dump(merged_outputs, open(results_file, 'w'), ensure_ascii=False) print('python infographicsvqa_eval.py -g ' + ds_collections[args.dataset]['annotation'] + ' -s ' + results_file) os.system('python infographicsvqa_eval.py -g ' + ds_collections[args.dataset]['annotation'] + ' -s ' + results_file) elif ds_collections[args.dataset]['metric'] == 'relaxed_accuracy': print({ 'relaxed_accuracy': evaluate_relaxed_accuracy(merged_outputs) }) elif ds_collections[args.dataset]['metric'] == 'accuracy': if 'gqa' in args.dataset: for entry in merged_outputs: response = entry['answer'] response = response.strip().split('.')[0].split( ',')[0].split('!')[0].lower() if 'is ' in response: response = response.split('is ')[1] if 'are ' in response: response = response.split('are ')[1] if 'a ' in response: response = response.split('a ')[1] if 'an ' in response: response = response.split('an ')[1] if 'the ' in response: response = response.split('the ')[1] if ' of' in response: response = response.split(' of')[0] response = response.strip() entry['answer'] = response print({'accuracy': evaluate_exact_match_accuracy(merged_outputs)}) torch.distributed.barrier() ================================================ FILE: eval_mm/infographicsvqa_eval.py ================================================ # This file can be downloaded from: https://www.docvqa.org/datasets/infographicvqa and https://rrc.cvc.uab.es/?ch=17&com=introduction import os, json import argparse question_ids_to_exclude = [] # answer_types = {'image span': 'Image-Span', 'question span': 'Question-Span', 'multiple spans': 'Multi-Span', 'non span': 'None span', 'list': 'List'} answer_types = {'image span': 'Image-Span', 'question span': 'Question-Span', 'multiple spans': 'Multi-Span', 'non span': 'None span'} evidence_types = {'table/list': 'Table/list', 'textual': 'Text', 'photo/pciture/visual_objects': 'Visual/Layout', 'figure': 'Figure', 'map': 'Map'} reasoning_requirements = {'comparison': 'Sorting', 'arithmetic': 'Arithmetic', 'counting':'Counting'} def save_json(file_path, data): with open(file_path, 'w+') as json_file: json.dump(data, json_file) def levenshtein_distance(s1, s2): if len(s1) > len(s2): s1, s2 = s2, s1 distances = range(len(s1) + 1) for i2, c2 in enumerate(s2): distances_ = [i2+1] for i1, c1 in enumerate(s1): if c1 == c2: distances_.append(distances[i1]) else: distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1]))) distances = distances_ return distances[-1] def validate_data(gtFilePath, submFilePath): """ Method validate_data: validates that all files in the results folder are correct (have the correct name contents). Validates also that there are no missing files in the folder. If some error detected, the method raises the error """ gtJson = json.load(open(gtFilePath,'rb')); submJson = json.load(open(submFilePath,'rb')); if not 'data' in gtJson: raise Exception("The GT file is not valid (no data key)") if not 'dataset_name' in gtJson: raise Exception("The GT file is not valid (no dataset_name key)") if isinstance(submJson, list) == False : raise Exception("The Det file is not valid (root item must be an array)") if len(submJson) != len(gtJson['data']) : raise Exception("The Det file is not valid (invalid number of answers. Expected:" + str(len(gtJson['data'])) + " Found:" + str(len(submJson)) + ")") gtQuestions = sorted([r['questionId'] for r in gtJson['data']]) res_id_to_index = {int(r['questionId']): ix for ix, r in enumerate(submJson)} detQuestions = sorted([r['questionId'] for r in submJson]) if( (gtQuestions == detQuestions) == False ): raise Exception("The Det file is not valid. Question IDs must much GT") for gtObject in gtJson['data']: try: q_id = int(gtObject['questionId']); res_ix = res_id_to_index[q_id]; except: raise Exception("The Det file is not valid. Question " + str(gtObject['questionId']) + " not present") else: detObject = submJson[res_ix]; # if detObject['questionId'] != gtObject['questionId'] : # raise Exception("Answer #" + str(i) + " not valid (invalid question ID. Expected:" + str(gtObject['questionId']) + "Found:" + detObject['questionId'] + ")") if not 'answer' in detObject: raise Exception("Question " + str(gtObject['questionId']) + " not valid (no answer key)") if isinstance(detObject['answer'], list) == True : raise Exception("Question " + str(gtObject['questionId']) + " not valid (answer key has to be a single string)") def evaluate_method(gtFilePath, submFilePath, evaluationParams): """ Method evaluate_method: evaluate method and returns the results Results. Dictionary with the following values: - method (required) Global method metrics. Ex: { 'Precision':0.8,'Recall':0.9 } - samples (optional) Per sample metrics. Ex: {'sample1' : { 'Precision':0.8,'Recall':0.9 } , 'sample2' : { 'Precision':0.8,'Recall':0.9 } """ show_scores_per_answer_type = evaluationParams.answer_types gtJson = json.load(open(gtFilePath,'rb')); submJson = json.load(open(submFilePath,'rb')); res_id_to_index = {int(r['questionId']): ix for ix, r in enumerate(submJson)} perSampleMetrics = {} totalScore = 0 row = 0 if show_scores_per_answer_type: answerTypeTotalScore = {x:0 for x in answer_types.keys()} answerTypeNumQuestions = {x:0 for x in answer_types.keys()} evidenceTypeTotalScore = {x:0 for x in evidence_types.keys()} evidenceTypeNumQuestions = {x:0 for x in evidence_types.keys()} reasoningTypeTotalScore = {x:0 for x in reasoning_requirements.keys()} reasoningTypeNumQuestions = {x:0 for x in reasoning_requirements.keys()} for gtObject in gtJson['data']: q_id = int(gtObject['questionId']); res_ix = res_id_to_index[q_id]; detObject = submJson[res_ix]; if q_id in question_ids_to_exclude: question_result = 0 info = 'Question EXCLUDED from the result' else: info = '' values = [] for answer in gtObject['answers']: # preprocess both the answers - gt and prediction gt_answer = ' '.join(answer.strip().lower().split()) det_answer = ' '.join(detObject['answer'].strip().lower().split()) #dist = levenshtein_distance(answer.lower(), detObject['answer'].lower()) dist = levenshtein_distance(gt_answer,det_answer) length = max( len(answer.upper()), len(detObject['answer'].upper()) ) values.append( 0.0 if length == 0 else float(dist) / float(length) ) question_result = 1 - min(values) if (question_result < evaluationParams.anls_threshold) : question_result = 0 totalScore += question_result if show_scores_per_answer_type: for q_type in gtObject["answer_type"]: answerTypeTotalScore[q_type] += question_result answerTypeNumQuestions[q_type] += 1 for q_type in gtObject["evidence"]: evidenceTypeTotalScore[q_type] += question_result evidenceTypeNumQuestions[q_type] += 1 for q_type in gtObject["operation/reasoning"]: reasoningTypeTotalScore[q_type] += question_result reasoningTypeNumQuestions[q_type] += 1 perSampleMetrics[str(gtObject['questionId'])] = { 'score':question_result, 'question':gtObject['question'], 'gt':gtObject['answers'], 'det':detObject['answer'], 'info': info } row = row + 1 methodMetrics = { 'score': 0 if len(gtJson['data']) == 0 else totalScore/ (len(gtJson['data']) - len(question_ids_to_exclude) ) } answer_types_scores = {} evidence_types_scores = {} operation_types_scores = {} if show_scores_per_answer_type: for a_type, ref in answer_types.items(): answer_types_scores[ref] = 0 if len(gtJson['data']) == 0 else answerTypeTotalScore[a_type] / (answerTypeNumQuestions[a_type] ) for e_type, ref in evidence_types.items(): evidence_types_scores[ref] = 0 if len(gtJson['data']) == 0 else evidenceTypeTotalScore[e_type] / (evidenceTypeNumQuestions[e_type] ) for r_type, ref in reasoning_requirements.items(): operation_types_scores[ref] = 0 if len(gtJson['data']) == 0 else reasoningTypeTotalScore[r_type] / (reasoningTypeNumQuestions[r_type] ) resDict = { 'result': methodMetrics, 'scores_by_types': {'answer_types': answer_types_scores, 'evidence_types': evidence_types_scores, 'operation_types': operation_types_scores}, 'per_sample_result':perSampleMetrics } return resDict; def display_results(results, show_answer_types): print("\nOverall ANLS: {:2.4f}".format(results['result']['score'])) if show_answer_types: print("\nAnswer types:") for a_type in answer_types.values(): print("\t{:12s} {:2.4f}".format(a_type, results['scores_by_types']['answer_types'][a_type])) print("\nEvidence types:") for e_type in evidence_types.values(): print("\t{:12s} {:2.4f}".format(e_type, results['scores_by_types']['evidence_types'][e_type])) print("\nOperation required:") for r_type in reasoning_requirements.values(): print("\t{:12s} {:2.4f}".format(r_type, results['scores_by_types']['operation_types'][r_type])) if __name__=='__main__': parser = argparse.ArgumentParser(description="InfographVQA evaluation script.") parser.add_argument('-g', '--ground_truth', type=str, help="Path of the Ground Truth file.", required=True) parser.add_argument('-s', '--submission_file', type=str, help="Path of your method's results file.", required=True) parser.add_argument('-t', '--anls_threshold', type=float, default=0.5, help="ANLS threshold to use (See Scene-Text VQA paper for more info.).", required=False) parser.add_argument('-a', '--answer_types', type=bool, default=False, help="Score break down by answer types (special gt file required).", required=False) parser.add_argument('-o', '--output', type=str, help="Path to a directory where to copy the file 'results.json' that contains per-sample results.", required=False) args = parser.parse_args() # Validate the format of ground truth and submission files. validate_data(args.ground_truth, args.submission_file) # Evaluate method results = evaluate_method(args.ground_truth, args.submission_file, args) display_results(results, args.answer_types) if args.output: output_dir = args.output if not os.path.exists(output_dir): os.makedirs(output_dir) resultsOutputname = os.path.join(output_dir, 'results.json') save_json(resultsOutputname, results) print("All results including per-sample result has been correctly saved!") ================================================ FILE: eval_mm/mmbench/MMBENCH.md ================================================ # MMBench Evaluation ## Data ```bash /cpfs01/shared/public/shusheng.yss/workspace/23082502_qwenvl_eval_test/eval_mm/data/mmbench ``` ## Dev ```bash checkpoint=/PATH/TO/CHECKPOINT ds=mmbench_dev_20230712 python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_multiple_choice_mmbench.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 2 \ --num-workers 2 # the results will be saved to mmbench_dev_20230712.json # without consistency constrain python mmbench_evaluation.py # with consistency constrain python mmbench_evaluation_tricky.py ``` ## Test ```bash checkpoint=/PATH/TO/CHECKPOINT ds=mmbench_test_20230712 python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_multiple_choice_mmbench.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 2 \ --num-workers 2 # the results will be saved to mmbench_test_20230712.json # convert to submission format with consistency constrain python mmbench_predict_to_submission.py ``` ================================================ FILE: eval_mm/mmbench/evaluate_multiple_choice_mmbench.py ================================================ import argparse import itertools import json import os from functools import partial import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer multiple_choices = ['A', 'B', 'C', 'D', 'E'] ds_collections = { 'mmbench_dev_20230712': { 'test': 'data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.jsonl', }, 'mmbench_test_20230712': { 'test': 'data/mmbench/mmbench_test_20230712/mmbench_test_20230712.jsonl', } } def collate_fn(batches, pad_token_id): indexes = [_['index'] for _ in batches] input_tokens = [_['input_tokens'] for _ in batches] target_lengths = [_['target_lengths'] for _ in batches] chunk_sizes = [len(_) for _ in input_tokens] input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)] max_lengths = max([len(_) for _ in input_tokens]) input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _ for _ in input_tokens] input_tokens = torch.LongTensor(input_tokens) attention_mask = 1 - input_tokens.eq(pad_token_id).float() return input_tokens, attention_mask, target_lengths, chunk_sizes, indexes class MultipleChoiceDataste(torch.utils.data.Dataset): def __init__(self, test, prompt, tokenizer): self.datas = open(test).readlines() self.prompt = prompt self.tokenizer = tokenizer def __len__(self): return len(self.datas) def __getitem__(self, idx): data = json.loads(self.datas[idx].strip()) index = data['index'] image = data['image'] hint = data['hint'] if data['hint'] else 'N/A' question = data['question'] choices = data['choices'] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(multiple_choices[i], c)) choice_txt = '\n'.join(choice_list) prompt = self.prompt.format(image, hint, question, choice_txt) prompt_tokens = self.tokenizer(prompt).input_ids target_tokens = [ self.tokenizer(' ' + _).input_ids for _ in multiple_choices[:len(choices)] ] return { 'index': index, 'input_tokens': [prompt_tokens + _ for _ in target_tokens], 'target_lengths': [len(_) for _ in target_tokens], # 'answer': data['answer'], } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) prompt = '<img>{}</img>Context: {}\nQuestion: {}\nOptions: {}\nAnswer:' dataset = MultipleChoiceDataste(test=ds_collections[args.dataset]['test'], prompt=prompt, tokenizer=tokenizer) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id), ) results = [] with torch.no_grad(): for _, (input_tokens, attention_mask, target_lengths, chunk_sizes, indexes) in tqdm(enumerate(dataloader)): outputs = model( input_ids=input_tokens[:, :-1].cuda(), attention_mask=attention_mask[:, :-1].cuda(), return_dict=True, ) losses = torch.nn.functional.cross_entropy(outputs.logits.permute( 0, 2, 1), input_tokens[:, 1:].cuda(), reduction='none') losses = losses.split(chunk_sizes, dim=0) for loss, target_length, index in zip(losses, target_lengths, indexes): target_loss = loss.mean(-1) for _ in range(len(target_length)): target_loss[_] = loss[_, -target_length[_]:].mean() pred = target_loss.argmin().item() results.append({ "index": index, "prediction": pred, }) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_results = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_results, results) merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)] if torch.distributed.get_rank() == 0: json.dump(merged_results, open(f"{args.dataset}.json", "w")) torch.distributed.barrier() ================================================ FILE: eval_mm/mmbench/mmbench_converter_dev.py ================================================ import pandas as pd import io import base64 import json from PIL import Image ''' This scripts convert mmbench_dev tsv file to jsonl ''' datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t') global_choices = ['A', 'B', 'C', 'D'] def decode_base64_to_image(base64_string): image_data = base64.b64decode(base64_string) image = Image.open(io.BytesIO(image_data)) return image with open('./data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.jsonl', 'w') as f: for idx in range(len(datas)): data = datas.iloc[idx] index = int(data['index']) question = data['question'] hint = data['hint'] if not pd.isna(data['hint']) else 'N/A' choices = [] for opt in global_choices: if pd.isna(data[opt]): continue choices.append(data[opt]) answer = global_choices.index(data['answer']) image = decode_base64_to_image(data['image']) image.save("data/mmbench/mmbench_dev_20230712/images/%d.jpg" % index) f.write(json.dumps({ "index": index, "image": "data/mmbench/mmbench_dev_20230712/images/%d.jpg" % index, "hint": hint, "question": question, "choices": choices, "answer": answer, }) + "\n") ================================================ FILE: eval_mm/mmbench/mmbench_converter_test.py ================================================ import pandas as pd import io import base64 import json from PIL import Image ''' This script convert mmbench_test tsv file to jsonl This script is very similar to mmbench_converter_dev except there's no answer for accuracy calculation ''' datas = pd.read_csv("data/mmbench/mmbench_test_20230712/mmbench_test_20230712.tsv", sep='\t') global_choices = ['A', 'B', 'C', 'D'] def decode_base64_to_image(base64_string): image_data = base64.b64decode(base64_string) image = Image.open(io.BytesIO(image_data)) return image with open('./data/mmbench/mmbench_test_20230712/mmbench_test_20230712.jsonl', 'w') as f: for idx in range(len(datas)): data = datas.iloc[idx] index = int(data['index']) question = data['question'] hint = data['hint'] if not pd.isna(data['hint']) else 'N/A' choices = [] for opt in global_choices: if pd.isna(data[opt]): continue choices.append(data[opt]) # answer = global_choices.index(data['answer']) image = decode_base64_to_image(data['image']) image.save("data/mmbench/mmbench_test_20230712/images/%d.jpg" % index) f.write(json.dumps({ "index": index, "image": "data/mmbench/mmbench_test_20230712/images/%d.jpg" % index, "hint": hint, "question": question, "choices": choices, # "answer": answer, }) + "\n") ================================================ FILE: eval_mm/mmbench/mmbench_evaluation.py ================================================ import pandas as pd import json ''' This script provides `global top-1 accuracy` metric calculation for mmbench_dev. ''' predictions = json.load(open('mmbench_dev_20230712.json')) index2predictions = {} for pred in predictions: index2predictions[pred['index']] = pred['prediction'] datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t') glb_opts = ['A', 'B', 'C', 'D'] index2answer = {} for idx in range(len(datas)): data = datas.iloc[idx] index2answer[data['index']] = glb_opts.index(data['answer']) identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()])) correct = 0 total = 0 for index in identity_indexes: for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: if index2predictions[cycle_index] == index2answer[cycle_index]: continue else: print(cycle_index) break else: correct += 1 total += 1 print(correct, total) ================================================ FILE: eval_mm/mmbench/mmbench_evaluation_tricky.py ================================================ import pandas as pd import json import random ''' This script provides metric calculation for mmbench_dev with the same accuarcy algo as OpenCompass server ''' predictions = json.load(open('mmbench_dev_20230712.json')) index2predictions = {} for pred in predictions: index2predictions[pred['index']] = pred['prediction'] from collections import Counter def most_common_elements(lst): counter = Counter(lst) max_count = max(counter.values()) most_common = [element for element, count in counter.items() if count == max_count] return random.choice(most_common) # random sample from random choice datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t') glb_opts = ['A', 'B', 'C', 'D'] index2answer = {} index2choices = {} index2rawanswer = {} for idx in range(len(datas)): data = datas.iloc[idx] choices = [] for opt in glb_opts: if not pd.isna(data[opt]): choices.append(data[opt]) index2choices[data['index']] = choices index2answer[data['index']] = glb_opts.index(data['answer']) index2rawanswer[data['index']] = choices[glb_opts.index(data['answer'])] identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()])) correct = 0 total = 0 for index in identity_indexes: raw_preds = [] raw_answer = [] for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: raw_answer = index2rawanswer[cycle_index] raw_pred = index2choices[cycle_index][index2predictions[cycle_index]] raw_preds.append(raw_pred) if len(set(raw_preds)) == 1: if raw_preds[0] == raw_answer: correct += 1 else: result = most_common_elements(raw_preds) if result == raw_answer: correct += 1 total += 1 print(correct, total, correct / total * 100.) ================================================ FILE: eval_mm/mmbench/mmbench_predict_to_submission.py ================================================ import pandas as pd import json import random ''' This script convert the output file of our inference processor to target formation of OpenCompass evaluator server ''' predictions = json.load(open('mmbench_test_20230712.json')) index2predictions = {} for pred in predictions: index2predictions[pred['index']] = pred['prediction'] from collections import Counter def most_common_elements(lst): counter = Counter(lst) max_count = max(counter.values()) most_common = [element for element, count in counter.items() if count == max_count] print(most_common) return random.choice(most_common) # return most_common datas = pd.read_csv("data/mmbench/mmbench_test_20230712/mmbench_test_20230712.tsv", sep='\t') datas = datas.drop('image', axis=1) glb_opts = ['A', 'B', 'C', 'D'] index2choices = {} for idx in range(len(datas)): data = datas.iloc[idx] choices = [] for opt in glb_opts: if not pd.isna(data[opt]): choices.append(data[opt]) index2choices[data['index']] = choices identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()])) processed_index2predictions = {} for index in identity_indexes: raw_preds = [] for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: raw_pred = index2choices[cycle_index][index2predictions[cycle_index]] raw_preds.append(raw_pred) if len(set(raw_preds)) == 1: pred_answer = raw_preds[0] else: pred_answer = most_common_elements(raw_preds) print(index, pred_answer) for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: processed_index2predictions[cycle_index] = index2choices[cycle_index].index(pred_answer) predictions = [] for idx in range(len(datas)): data = datas.iloc[idx] index = data['index'] prediction = glb_opts[processed_index2predictions[index]] predictions.append(prediction) datas['prediction'] = predictions datas.to_excel("mmbench_test_20230712_230831_constrained.xlsx", index=False) # constrained means we force the model predict same answer when tested on a question for multiple times ================================================ FILE: eval_mm/mme/EVAL_MME.md ================================================ # MME Benchmark [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning. Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation. Perception Evaluation | Rank | Model | Version | Score | |:----:|:---------------:|:------------------------:|:-------:| | 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)**| **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **1487.57** | | 2 | Skywork-MM | Skywork-MM-13B | 1419.08 | | 3 | MMICL | FlanT5xxl | 1376.00 | | 4 | Lynx | vicuna-7b | 1373.23 | | 5 | BLIVA | FlanT5xxl | 1337.73 | Cognition Evaluation | Rank | Model | Version | Score | |:----:|:----------------:|:--------------:|:----------:| | 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)** | **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **360.71** | | 2 | MMICL | FlanT5xxl | 360.36 | | 3 | Skywork-MM | Skywork-MM-13B | 356.43 | | 4 | BLIVA | FlanT5xxl | 331.43 | | 5 | LRV-Instruction | LRV-7B | 328.21 | Full Metrics ``` =========== Perception =========== total score: 1487.576330532213 existence score: 158.33333333333331 count score: 150.0 position score: 128.33333333333334 color score: 170.0 posters score: 178.57142857142856 celebrity score: 120.58823529411764 scene score: 152.25 landmark score: 164.0 artwork score: 125.5 OCR score: 140.0 =========== Cognition =========== total score: 360.71428571428567 commonsense_reasoning score: 130.7142857142857 numerical_calculation score: 40.0 text_translation score: 147.5 code_reasoning score: 42.5 ``` ## How To Reproduce Results of MME Benchmark 1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md) 2. Rearrange images by executing `python get_images.py` 3. Evaluate Qwen-VL-Chat results by executing `python eval.py` 4. Calculate MME results by executing `python calculation.py --results_dir Qwen-VL-Chat`, which the calculation script comes from the MME eval_tool. ================================================ FILE: eval_mm/mme/eval.py ================================================ import os from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig checkpoint = 'Qwen/Qwen-VL-Chat' tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( checkpoint, device_map='cuda', trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained(checkpoint, trust_remote_code=True) model.generation_config.top_p = 0.01 root = 'Your_Results' output = 'Qwen-VL-Chat' os.makedirs(output, exist_ok=True) for filename in os.listdir(root): with open(os.path.join(root, filename), 'r') as fin, open(os.path.join(output, filename), 'w') as fout: lines = fin.read().splitlines() filename = filename.replace('.txt', '') for line in tqdm(lines): img, question, gt = line.strip().split('\t') img_path = os.path.join('images', filename, img) assert os.path.exists(img_path), img_path query = f'<img>{img_path}</img>\n{question}' response, _ = model.chat(tokenizer, query=query, history=None) print(img, question, gt, response, sep='\t', file=fout) ================================================ FILE: eval_mm/mme/get_images.py ================================================ import os from tqdm import tqdm os.system('rm -rf images') os.system('mkdir images') os.system('cp -r ../MME_Benchmark_release/OCR images/') os.system('mkdir images/artwork') os.system('cp ../MME_Benchmark_release/artwork/questions_answers_YN/* images/artwork/') with open('LaVIN/artwork.txt') as fin: paths = [ line.strip().split('\t', 1)[0] for line in fin ] paths = list(set(paths)) for path in tqdm(paths): os.system(f'cp ../MME_Benchmark_release/artwork/images/toy_dataset/{path} images/artwork/{path}') os.system('mkdir images/celebrity') os.system('cp ../MME_Benchmark_release/celebrity/images/* images/celebrity/') os.system('cp ../MME_Benchmark_release/celebrity/questions_answers_YN/* images/celebrity/') os.system('cp -r ../MME_Benchmark_release/code_reasoning images/') os.system('cp -r ../MME_Benchmark_release/color images/') os.system('cp -r ../MME_Benchmark_release/commonsense_reasoning images/') os.system('cp -r ../MME_Benchmark_release/count images/') os.system('cp -r ../MME_Benchmark_release/existence images/') os.system('mkdir images/landmark') os.system('cp ../MME_Benchmark_release/landmark/images/* images/landmark/') os.system('cp ../MME_Benchmark_release/landmark/questions_answers_YN/* images/landmark/') os.system('cp -r ../MME_Benchmark_release/numerical_calculation images/') os.system('cp -r ../MME_Benchmark_release/position images/') os.system('mkdir images/posters') os.system('cp ../MME_Benchmark_release/posters/images/* images/posters/') os.system('cp ../MME_Benchmark_release/posters/questions_answers_YN/* images/posters/') os.system('mkdir images/scene') os.system('cp ../MME_Benchmark_release/scene/images/* images/scene/') os.system('cp ../MME_Benchmark_release/scene/questions_answers_YN/* images/scene/') os.system('cp -r ../MME_Benchmark_release/text_translation images/') ================================================ FILE: eval_mm/seed_bench/EVAL_SEED.md ================================================ # Seed-Bench Evaluation [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both **image** and **video** understanding. Qwen-VL and Qwen-VL-Chat achieve SOTAs on this benchmark. <img src="leaderboard.jpg"/> ## How To Process Video by Qwen-VL Qwen-VL and Qwen-VL-Chat didn't train any video data or tasks during training, but they can understand some videos in a zero-shot way. For the video question-answering task, we utilize four uniformly sampled frames per video sample. These frames are treated as separate images and are stitched into the context. For example: ``` { "question_id": "v0", "prompt": "<img>video_imgs_4/v0_0.jpg</img>\n<img>video_imgs_4/v0_1.jpg</img>\n<img>video_imgs_4/v0_2.jpg</img>\n<img>video_imgs_4/v0_3.jpg</img>\nQuestion: Can you identify the action taking place in the video?\nOptions: A. pretending to take something out of something\nB. pretending to take something from somewhere\nC. feigning to insert something into something\nD. simulating putting something onto something\nAnswer:" } ``` The above JSON line can be used as the input by `eval_mm/seed_bench/eval.py` and output the following results: ``` {"question_id": "v0", "prediction": "B"} ``` Please see [eval_mm/seed_bench/eval.py](eval.py) for more inference details. ## How To Reproduce Results of Seed-Bench 1. Download all images and videos by following the [instruction](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md). Then modify the root path in `eval_mm/seed_bench/trans.py` with your customized path. ``` # path of SEED-Bench.json, download from https://huggingface.co/datasets/AILab-CVC/SEED-Bench/blob/main/SEED-Bench.json seed_bench_input_path = 'SEED-Bench.json' # root directory of evaluation dimension 1-9, following https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md cc3m_dir = "/YOUR_PATH_TO/seed_bench_image" # root directory of evaluation dimension 10 dimension10_dir = "/YOUR_PATH_TO/SSV2/videos" # root directory of evaluation dimension 11 dimension11_dir = "/YOUR_PATH_TO/EPIC-KITCHENS/3h91syskeag572hl6tvuovwv4d/videos/test" # root directory of evaluation dimension 12 dimension12_dir = "/YOUR_PATH_TO/BreakfastII_15fps_qvga_sync" ``` 2. Generate input files of Qwen-VL with the JSON formatting. ``` cd eval_mm/seed_bench/ python trans.py ``` This script will output two JSONL files and one directory. `image_input.jsonl` is the input file of image evaluation and `video_input_4.jsonl` is the input file of video evaluation by 4 frames. The directory `video_imgs_4` contains all 4-framed images extracted from videos. We provide our [image_input.jsonl](http://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/seed_bench/image_input.jsonl) and [video_input_4.jsonl](http://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/seed_bench/video_input_4.jsonl) here for reference. 3. Produce the results of Seed-Bench. ``` # The number of available GPUs export NPROC_PER_NODE=8 # Produce the Qwen-VL-Chat results of image understanding python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ eval.py \ --checkpoint Qwen/Qwen-VL-Chat \ --dataset image_input.jsonl \ --batch-size 4 \ --num-workers 2 # Collect the result files cat result_?.jsonl >results_chat_img.jsonl rm result_?.jsonl # Produce the results of video understanding python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ eval.py \ --checkpoint Qwen/Qwen-VL-Chat \ --dataset video_input_4.jsonl \ --batch-size 2 \ --num-workers 1 # Collect the result files cat result_?.jsonl >results_chat_vid.jsonl rm result_?.jsonl # The file `results_chat.jsonl` can be submitted to the leaderboard cat results_chat_img.jsonl results_chat_vid.jsonl >results_chat.jsonl ``` You can reproduce the Seed-Bench results of Qwen-VL by replacing `Qwen/Qwen-VL-Chat` with `Qwen/Qwen-VL` on the above script. ================================================ FILE: eval_mm/seed_bench/eval.py ================================================ import argparse import itertools import json import os from functools import partial import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig def collate_fn(batches, pad_token_id): input_tokens = [_['input_tokens'] for _ in batches] target_lengths = [_['target_lengths'] for _ in batches] answers = [_['answer'] for _ in batches] question_id = [_['question_id'] for _ in batches] chunk_sizes = [len(_) for _ in input_tokens] input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)] max_lengths = max([len(_) for _ in input_tokens]) input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _ for _ in input_tokens] input_tokens = torch.LongTensor(input_tokens) attention_mask = 1 - input_tokens.eq(pad_token_id).float() return input_tokens, attention_mask, target_lengths, answers, chunk_sizes, question_id class MultipleChoiceDataste(torch.utils.data.Dataset): def __init__(self, test, tokenizer): self.datas = [] with open(test) as fin: for line in tqdm(fin): self.datas.append(json.loads(line.strip())) self.tokenizer = tokenizer def __len__(self): return len(self.datas) def __getitem__(self, idx): data = self.datas[idx] prompt = data['prompt'] prompt_tokens = self.tokenizer(prompt).input_ids target_tokens = [ self.tokenizer(' ' + _).input_ids for _ in ['A', 'B', 'C', 'D'] ] return { 'input_tokens': [prompt_tokens + _ for _ in target_tokens], 'target_lengths': [len(_) for _ in target_tokens], 'answer': data['answer'], 'question_id': data['question_id'], } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) model.generation_config = GenerationConfig.from_pretrained(args.checkpoint, trust_remote_code=True) model.generation_config.top_p = 0.01 dataset = MultipleChoiceDataste(test=args.dataset, tokenizer=tokenizer) dataloader = torch.utils.data.DataLoader( dataset=dataset, # sampler=InferenceSampler(1000), sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id), ) results = [] fout = open('result_{}.jsonl'.format(torch.distributed.get_rank()), 'w') with torch.no_grad(): for _, (input_tokens, attention_mask, target_lengths, answers, chunk_sizes, question_ids) in tqdm(enumerate(dataloader)): outputs = model( input_ids=input_tokens[:, :-1].cuda(), attention_mask=attention_mask[:, :-1].cuda(), return_dict=True, ) losses = torch.nn.functional.cross_entropy(outputs.logits.permute( 0, 2, 1), input_tokens[:, 1:].cuda(), reduction='none') losses = losses.split(chunk_sizes, dim=0) for loss, target_length, answer, question_id in zip(losses, target_lengths, answers, question_ids): target_loss = loss.mean(-1) for _ in range(len(target_length)): target_loss[_] = loss[_, -target_length[_]:].mean() pred = target_loss.argmin().item() pred = chr(pred + 65) if pred == answer: results.append(1) else: results.append(0) answer_record = { 'question_id': question_id, 'prediction': pred } print(json.dumps(answer_record), file=fout) fout.close() torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_results = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_results, results) merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") print(f'Acc@1: {sum(merged_results) / len(merged_results)}') torch.distributed.barrier() ================================================ FILE: eval_mm/seed_bench/trans.py ================================================ import os import av import json import torch import numpy as np from PIL import Image from tqdm import tqdm from decord import VideoReader, cpu # path of SEED-Bench.json, download from https://huggingface.co/datasets/AILab-CVC/SEED-Bench/blob/main/SEED-Bench.json seed_bench_input_path = 'SEED-Bench.json' # root directory of evaluation dimension 1-9, following https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md cc3m_dir = "/YOUR_PATH_TO/seed_bench_image" # root directory of evaluation dimension 10 dimension10_dir = "/YOUR_PATH_TO/SSV2/videos" # root directory of evaluation dimension 11 dimension11_dir = "/YOUR_PATH_TO/EPIC-KITCHENS/3h91syskeag572hl6tvuovwv4d/videos/test" # root directory of evaluation dimension 12 dimension12_dir = "/YOUR_PATH_TO/BreakfastII_15fps_qvga_sync" def is_integer_string(s): try: int(s) return True except ValueError: return False def filter_questions(data, task='all'): if task == "image": return [q for q in data if 1 <= q["question_type_id"] <= 9] elif task == "video": return [q for q in data if 10 <= q["question_type_id"] <= 12] elif task == "all": return data elif is_integer_string(task): return [q for q in data if q["question_type_id"] == int(task)] else: raise ValueError(f"Invalid task: {task}") def get_index(num_frames, num_segments): if num_segments > num_frames: offsets = np.array([ idx for idx in range(num_frames) ]) else: # uniform sampling seg_size = float(num_frames - 1) / num_segments start = int(seg_size / 2) offsets = np.array([ start + int(np.round(seg_size * idx)) for idx in range(num_segments) ]) return offsets with open(seed_bench_input_path) as fin: qa_anno = json.load(fin)['questions'] fout = open('image_input.jsonl', 'w') i_anno = filter_questions(qa_anno, 'image') for qa_item in tqdm(i_anno): data_path = cc3m_dir + qa_item['data_id'] choices = [qa_item['choice_a'], qa_item['choice_b'], qa_item['choice_c'], qa_item['choice_d']] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(chr(i + 65), c)) choice_txt = '\n'.join(choice_list) prompt = '<img>{}</img>\nQuestion: {}\nOptions: {}\nAnswer:'.format( data_path, qa_item['question'], choice_txt) print(json.dumps({ 'question_id': qa_item['question_id'], 'prompt': prompt, 'answer': qa_item['answer'], }), file=fout) fout.close() n_frames = 8 os.system('rm -rf video_input_' + str(n_frames)) os.makedirs('video_imgs_' + str(n_frames), exist_ok=True) fout = open('video_input_{}.jsonl'.format(n_frames), 'w') v_anno = filter_questions(qa_anno, 'video') for qa_item in tqdm(v_anno): if qa_item['question_type_id'] == 12: data_path = dimension12_dir + qa_item['data_id'] elif qa_item['question_type_id'] == 11: data_path = dimension11_dir + qa_item['data_id'].split('/')[-1] elif qa_item['question_type_id'] == 10: data_path = dimension10_dir + qa_item['data_id'] else: assert False, str(qa_item) print(data_path) use_pyav = False if 'segment' in qa_item.keys(): segment = qa_item['segment'] if isinstance(segment[0], int): # using pyav for decoding videos in evaluation dimension 12 use_pyav = True start, end = segment[0], segment[1] else: start = 0.0 end = 0.0 if use_pyav: # using pyav for decoding videos in evaluation dimension 12 reader = av.open(data_path) frames = [torch.from_numpy(f.to_rgb().to_ndarray()) for f in reader.decode(video=0)] video_len = len(frames) start_frame, end_frame = start, end end_frame = min(end_frame, video_len) offset = get_index(end_frame - start_frame, n_frames) frame_indices = offset + start_frame images = torch.stack([frames[idx] for idx in frame_indices]).numpy() else: # using decord for decoding videos in evaluation dimension 10-11 try: vr = VideoReader(data_path, num_threads=1, ctx=cpu(0)) video_len = len(vr) fps = vr.get_avg_fps() if 'segment' in qa_item.keys(): # obtain start and end frame for the video segment in evaluation dimension 11 start_frame = int(min(max(start * fps, 0), video_len - 1)) end_frame = int(min(max(end * fps, 0), video_len - 1)) tot_frames = int(end_frame - start_frame) offset = get_index(tot_frames, n_frames) frame_indices = offset + start_frame else: # sample frames of the video in evaluation dimension 10 frame_indices = get_index(video_len - 1, n_frames) vr.seek(0) images = vr.get_batch(frame_indices).asnumpy() except Exception as e: print(json.dumps({ 'question_id': qa_item['question_id'], 'prompt': "Error" + str(e), 'answer': qa_item['answer'], }), file=fout) continue prompt = '' for i in range(images.shape[0]): data = Image.fromarray(images[i]) img_path = 'video_imgs_{}/{}_{}.jpg'.format(n_frames, qa_item['question_id'], i) data.save(img_path) prompt += '<img>' + img_path + '</img>\n' choices = [qa_item['choice_a'], qa_item['choice_b'], qa_item['choice_c'], qa_item['choice_d']] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(chr(i + 65), c)) choice_txt = '\n'.join(choice_list) prompt += 'Question: {}\nOptions: {}\nAnswer:'.format(qa_item['question'], choice_txt) print(json.dumps({ 'question_id': qa_item['question_id'], 'prompt': prompt, 'answer': qa_item['answer'], }), file=fout) fout.close() ================================================ FILE: eval_mm/vqa.py ================================================ """Copyright (c) 2022, salesforce.com, inc. All rights reserved. SPDX-License-Identifier: BSD-3-Clause For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause """ __author__ = 'aagrawal' __version__ = '0.9' # Interface for accessing the VQA dataset. # This code is based on the code written by Tsung-Yi Lin for MSCOCO Python API available at the following link: # (https://github.com/pdollar/coco/blob/master/PythonAPI/pycocotools/coco.py). # The following functions are defined: # VQA - VQA class that loads VQA annotation file and prepares data structures. # getQuesIds - Get question ids that satisfy given filter conditions. # getImgIds - Get image ids that satisfy given filter conditions. # loadQA - Load questions and answers with the specified question ids. # showQA - Display the specified questions and answers. # loadRes - Load result file and create result object. # Help on each function can be accessed by: "help(COCO.function)" import copy import datetime import json class VQA: def __init__(self, annotation_file=None, question_file=None): """Constructor of VQA helper class for reading and visualizing questions and answers. :param annotation_file (str): location of VQA annotation file :return: """ # load dataset self.dataset = {} self.questions = {} self.qa = {} self.qqa = {} self.imgToQA = {} if not annotation_file == None and not question_file == None: print('loading VQA annotations and questions into memory...') time_t = datetime.datetime.utcnow() dataset = json.load(open(annotation_file, 'r')) questions = json.load(open(question_file, 'r')) self.dataset = dataset self.questions = questions self.createIndex() def createIndex(self): # create index print('creating index...') imgToQA = {ann['image_id']: [] for ann in self.dataset['annotations']} qa = {ann['question_id']: [] for ann in self.dataset['annotations']} qqa = {ann['question_id']: [] for ann in self.dataset['annotations']} for ann in self.dataset['annotations']: imgToQA[ann['image_id']] += [ann] qa[ann['question_id']] = ann for ques in self.questions['questions']: qqa[ques['question_id']] = ques print('index created!') # create class members self.qa = qa self.qqa = qqa self.imgToQA = imgToQA def info(self): """Print information about the VQA annotation file. :return: """ for key, value in self.datset['info'].items(): print('%s: %s' % (key, value)) def getQuesIds(self, imgIds=[], quesTypes=[], ansTypes=[]): """Get question ids that satisfy given filter conditions. default skips that filter. :param imgIds (int array) : get question ids for given imgs quesTypes (str array) : get question ids for given question types ansTypes (str array) : get question ids for given answer types :return: ids (int array) : integer array of question ids """ imgIds = imgIds if type(imgIds) == list else [imgIds] quesTypes = quesTypes if type(quesTypes) == list else [quesTypes] ansTypes = ansTypes if type(ansTypes) == list else [ansTypes] if len(imgIds) == len(quesTypes) == len(ansTypes) == 0: anns = self.dataset['annotations'] else: if not len(imgIds) == 0: anns = sum( [ self.imgToQA[imgId] for imgId in imgIds if imgId in self.imgToQA ], [], ) else: anns = self.dataset['annotations'] anns = (anns if len(quesTypes) == 0 else [ann for ann in anns if ann['question_type'] in quesTypes]) anns = (anns if len(ansTypes) == 0 else [ann for ann in anns if ann['answer_type'] in ansTypes]) ids = [ann['question_id'] for ann in anns] return ids def getImgIds(self, quesIds=[], quesTypes=[], ansTypes=[]): """Get image ids that satisfy given filter conditions. default skips that filter. :param quesIds (int array) : get image ids for given question ids quesTypes (str array) : get image ids for given question types ansTypes (str array) : get image ids for given answer types :return: ids (int array) : integer array of image ids """ quesIds = quesIds if type(quesIds) == list else [quesIds] quesTypes = quesTypes if type(quesTypes) == list else [quesTypes] ansTypes = ansTypes if type(ansTypes) == list else [ansTypes] if len(quesIds) == len(quesTypes) == len(ansTypes) == 0: anns = self.dataset['annotations'] else: if not len(quesIds) == 0: anns = sum([ self.qa[quesId] for quesId in quesIds if quesId in self.qa ], []) else: anns = self.dataset['annotations'] anns = (anns if len(quesTypes) == 0 else [ann for ann in anns if ann['question_type'] in quesTypes]) anns = (anns if len(ansTypes) == 0 else [ann for ann in anns if ann['answer_type'] in ansTypes]) ids = [ann['image_id'] for ann in anns] return ids def loadQA(self, ids=[]): """Load questions and answers with the specified question ids. :param ids (int array) : integer ids specifying question ids :return: qa (object array) : loaded qa objects """ if type(ids) == list: return [self.qa[id] for id in ids] elif type(ids) == int: return [self.qa[ids]] def showQA(self, anns): """Display the specified annotations. :param anns (array of object): annotations to display :return: None """ if len(anns) == 0: return 0 for ann in anns: quesId = ann['question_id'] print('Question: %s' % (self.qqa[quesId]['question'])) for ans in ann['answers']: print('Answer %d: %s' % (ans['answer_id'], ans['answer'])) def loadRes(self, resFile, quesFile): """Load result file and return a result object. :param resFile (str) : file name of result file :return: res (obj) : result api object """ res = VQA() res.questions = json.load(open(quesFile)) res.dataset['info'] = copy.deepcopy(self.questions['info']) res.dataset['task_type'] = copy.deepcopy(self.questions['task_type']) res.dataset['data_type'] = copy.deepcopy(self.questions['data_type']) res.dataset['data_subtype'] = copy.deepcopy( self.questions['data_subtype']) res.dataset['license'] = copy.deepcopy(self.questions['license']) print('Loading and preparing results... ') time_t = datetime.datetime.utcnow() anns = json.load(open(resFile)) assert type(anns) == list, 'results is not an array of objects' annsQuesIds = [ann['question_id'] for ann in anns] assert set(annsQuesIds) == set( self.getQuesIds() ), 'Results do not correspond to current VQA set. Either the results do not have predictions for all question ids in annotation file or there is atleast one question id that does not belong to the question ids in the annotation file.' for ann in anns: quesId = ann['question_id'] if res.dataset['task_type'] == 'Multiple Choice': assert ( ann['answer'] in self.qqa[quesId]['multiple_choices'] ), 'predicted answer is not one of the multiple choices' qaAnn = self.qa[quesId] ann['image_id'] = qaAnn['image_id'] ann['question_type'] = qaAnn['question_type'] ann['answer_type'] = qaAnn['answer_type'] print('DONE (t=%0.2fs)' % ((datetime.datetime.utcnow() - time_t).total_seconds())) res.dataset['annotations'] = anns res.createIndex() return res ================================================ FILE: eval_mm/vqa_eval.py ================================================ """Copyright (c) 2022, salesforce.com, inc. All rights reserved. SPDX-License-Identifier: BSD-3-Clause For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause """ # coding=utf-8 __author__ = 'aagrawal' import re # This code is based on the code written by Tsung-Yi Lin for MSCOCO Python API available at the following link: # (https://github.com/tylin/coco-caption/blob/master/pycocoevalcap/eval.py). import sys class VQAEval: def __init__(self, vqa=None, vqaRes=None, n=2): self.n = n self.accuracy = {} self.evalQA = {} self.evalQuesType = {} self.evalAnsType = {} self.vqa = vqa self.vqaRes = vqaRes if vqa is not None: self.params = {'question_id': vqa.getQuesIds()} self.contractions = { 'aint': "ain't", 'arent': "aren't", 'cant': "can't", 'couldve': "could've", 'couldnt': "couldn't", "couldn'tve": "couldn't've", "couldnt've": "couldn't've", 'didnt': "didn't", 'doesnt': "doesn't", 'dont': "don't", 'hadnt': "hadn't", "hadnt've": "hadn't've", "hadn'tve": "hadn't've", 'hasnt': "hasn't", 'havent': "haven't", 'hed': "he'd", "hed've": "he'd've", "he'dve": "he'd've", 'hes': "he's", 'howd': "how'd", 'howll': "how'll", 'hows': "how's", "Id've": "I'd've", "I'dve": "I'd've", 'Im': "I'm", 'Ive': "I've", 'isnt': "isn't", 'itd': "it'd", "itd've": "it'd've", "it'dve": "it'd've", 'itll': "it'll", "let's": "let's", 'maam': "ma'am", 'mightnt': "mightn't", "mightnt've": "mightn't've", "mightn'tve": "mightn't've", 'mightve': "might've", 'mustnt': "mustn't", 'mustve': "must've", 'neednt': "needn't", 'notve': "not've", 'oclock': "o'clock", 'oughtnt': "oughtn't", "ow's'at": "'ow's'at", "'ows'at": "'ow's'at", "'ow'sat": "'ow's'at", 'shant': "shan't", "shed've": "she'd've", "she'dve": "she'd've", "she's": "she's", 'shouldve': "should've", 'shouldnt': "shouldn't", "shouldnt've": "shouldn't've", "shouldn'tve": "shouldn't've", "somebody'd": 'somebodyd', "somebodyd've": "somebody'd've", "somebody'dve": "somebody'd've", 'somebodyll': "somebody'll", 'somebodys': "somebody's", 'someoned': "someone'd", "someoned've": "someone'd've", "someone'dve": "someone'd've", 'someonell': "someone'll", 'someones': "someone's", 'somethingd': "something'd", "somethingd've": "something'd've", "something'dve": "something'd've", 'somethingll': "something'll", 'thats': "that's", 'thered': "there'd", "thered've": "there'd've", "there'dve": "there'd've", 'therere': "there're", 'theres': "there's", 'theyd': "they'd", "theyd've": "they'd've", "they'dve": "they'd've", 'theyll': "they'll", 'theyre': "they're", 'theyve': "they've", 'twas': "'twas", 'wasnt': "wasn't", "wed've": "we'd've", "we'dve": "we'd've", 'weve': "we've", 'werent': "weren't", 'whatll': "what'll", 'whatre': "what're", 'whats': "what's", 'whatve': "what've", 'whens': "when's", 'whered': "where'd", 'wheres': "where's", 'whereve': "where've", 'whod': "who'd", "whod've": "who'd've", "who'dve": "who'd've", 'wholl': "who'll", 'whos': "who's", 'whove': "who've", 'whyll': "why'll", 'whyre': "why're", 'whys': "why's", 'wont': "won't", 'wouldve': "would've", 'wouldnt': "wouldn't", "wouldnt've": "wouldn't've", "wouldn'tve": "wouldn't've", 'yall': "y'all", "yall'll": "y'all'll", "y'allll": "y'all'll", "yall'd've": "y'all'd've", "y'alld've": "y'all'd've", "y'all'dve": "y'all'd've", 'youd': "you'd", "youd've": "you'd've", "you'dve": "you'd've", 'youll': "you'll", 'youre': "you're", 'youve': "you've", } self.manualMap = { 'none': '0', 'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8', 'nine': '9', 'ten': '10', } self.articles = ['a', 'an', 'the'] self.periodStrip = re.compile('(?!<=\d)(\.)(?!\d)') self.commaStrip = re.compile('(\d)(,)(\d)') self.punct = [ ';', r'/', '[', ']', '"', '{', '}', '(', ')', '=', '+', '\\', '_', '-', '>', '<', '@', '`', ',', '?', '!', ] def evaluate(self, quesIds=None): if quesIds == None: quesIds = [quesId for quesId in self.params['question_id']] gts = {} res = {} for quesId in quesIds: gts[quesId] = self.vqa.qa[quesId] res[quesId] = self.vqaRes.qa[quesId] # ================================================= # Compute accuracy # ================================================= accQA = [] accQuesType = {} accAnsType = {} print('computing accuracy') step = 0 for quesId in quesIds: resAns = res[quesId]['answer'] resAns = resAns.replace('\n', ' ') resAns = resAns.replace('\t', ' ') resAns = resAns.strip() resAns = self.processPunctuation(resAns) resAns = self.processDigitArticle(resAns) gtAcc = [] gtAnswers = [ans['answer'] for ans in gts[quesId]['answers']] if len(set(gtAnswers)) > 1: for ansDic in gts[quesId]['answers']: ansDic['answer'] = self.processPunctuation( ansDic['answer']) for gtAnsDatum in gts[quesId]['answers']: otherGTAns = [ item for item in gts[quesId]['answers'] if item != gtAnsDatum ] matchingAns = [ item for item in otherGTAns if item['answer'] == resAns ] acc = min(1, float(len(matchingAns)) / 3) gtAcc.append(acc) quesType = gts[quesId]['question_type'] ansType = gts[quesId]['answer_type'] avgGTAcc = float(sum(gtAcc)) / len(gtAcc) accQA.append(avgGTAcc) if quesType not in accQuesType: accQuesType[quesType] = [] accQuesType[quesType].append(avgGTAcc) if ansType not in accAnsType: accAnsType[ansType] = [] accAnsType[ansType].append(avgGTAcc) self.setEvalQA(quesId, avgGTAcc) self.setEvalQuesType(quesId, quesType, avgGTAcc) self.setEvalAnsType(quesId, ansType, avgGTAcc) if step % 100 == 0: self.updateProgress(step / float(len(quesIds))) step = step + 1 self.setAccuracy(accQA, accQuesType, accAnsType) print('Done computing accuracy') def processPunctuation(self, inText): outText = inText for p in self.punct: if (p + ' ' in inText or ' ' + p in inText) or (re.search(self.commaStrip, inText) != None): outText = outText.replace(p, '') else: outText = outText.replace(p, ' ') outText = self.periodStrip.sub('', outText, re.UNICODE) return outText def processDigitArticle(self, inText): outText = [] tempText = inText.lower().split() for word in tempText: word = self.manualMap.setdefault(word, word) if word not in self.articles: outText.append(word) else: pass for wordId, word in enumerate(outText): if word in self.contractions: outText[wordId] = self.contractions[word] outText = ' '.join(outText) return outText def setAccuracy(self, accQA, accQuesType, accAnsType): self.accuracy['overall'] = round(100 * float(sum(accQA)) / len(accQA), self.n) self.accuracy['perQuestionType'] = { quesType: round( 100 * float(sum(accQuesType[quesType])) / len(accQuesType[quesType]), self.n, ) for quesType in accQuesType } self.accuracy['perAnswerType'] = { ansType: round( 100 * float(sum(accAnsType[ansType])) / len(accAnsType[ansType]), self.n) for ansType in accAnsType } def setEvalQA(self, quesId, acc): self.evalQA[quesId] = round(100 * acc, self.n) def setEvalQuesType(self, quesId, quesType, acc): if quesType not in self.evalQuesType: self.evalQuesType[quesType] = {} self.evalQuesType[quesType][quesId] = round(100 * acc, self.n) def setEvalAnsType(self, quesId, ansType, acc): if ansType not in self.evalAnsType: self.evalAnsType[ansType] = {} self.evalAnsType[ansType][quesId] = round(100 * acc, self.n) def updateProgress(self, progress): barLength = 20 status = '' if isinstance(progress, int): progress = float(progress) if not isinstance(progress, float): progress = 0 status = 'error: progress var must be float\r\n' if progress < 0: progress = 0 status = 'Halt...\r\n' if progress >= 1: progress = 1 status = 'Done...\r\n' block = int(round(barLength * progress)) text = '\rFinshed Percent: [{0}] {1}% {2}'.format( '#' * block + '-' * (barLength - block), int(progress * 100), status) sys.stdout.write(text) sys.stdout.flush() ================================================ FILE: finetune/ds_config_zero2.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "none", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: finetune/ds_config_zero3.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "none", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: finetune/finetune_ds.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True \ --deepspeed finetune/ds_config_zero3.json ================================================ FILE: finetune/finetune_lora_ds.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --use_lora \ --gradient_checkpointing \ --deepspeed finetune/ds_config_zero2.json ================================================ FILE: finetune/finetune_lora_single_gpu.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" export CUDA_VISIBLE_DEVICES=0 python finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --gradient_checkpointing \ --use_lora ================================================ FILE: finetune/finetune_qlora_ds.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 MODEL="Qwen/Qwen-VL-Chat-Int4" # Qwen/Qwen-VL-Chat-Int4 Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " # Remember to use --fp16 instead of --bf16 due to autogptq torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --fp16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --use_lora \ --q_lora \ --gradient_checkpointing \ --deepspeed finetune/ds_config_zero2.json ================================================ FILE: finetune/finetune_qlora_single_gpu.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` MODEL="Qwen/Qwen-VL-Chat-Int4" # Qwen/Qwen-VL-Chat-Int4 Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" export CUDA_VISIBLE_DEVICES=0 # Remember to use --fp16 instead of --bf16 due to autogptq python finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --fp16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --gradient_checkpointing \ --use_lora \ --q_lora \ --deepspeed finetune/ds_config_zero2.json ================================================ FILE: finetune.py ================================================ # This code is based on the revised code from fastchat based on tatsu-lab/stanford_alpaca. from dataclasses import dataclass, field import json import math import logging import os from typing import Dict, Optional, List import torch from torch.utils.data import Dataset from deepspeed import zero from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus import transformers from transformers import Trainer, GPTQConfig, deepspeed from transformers.trainer_pt_utils import LabelSmoother from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from accelerate.utils import DistributedType IGNORE_TOKEN_ID = LabelSmoother.ignore_index @dataclass class ModelArguments: model_name_or_path: Optional[str] = field(default="Qwen/Qwen-7B") @dataclass class DataArguments: data_path: str = field( default=None, metadata={"help": "Path to the training data."} ) eval_data_path: str = field( default=None, metadata={"help": "Path to the evaluation data."} ) lazy_preprocess: bool = False @dataclass class TrainingArguments(transformers.TrainingArguments): cache_dir: Optional[str] = field(default=None) optim: str = field(default="adamw_torch") model_max_length: int = field( default=8192, metadata={ "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)." }, ) use_lora: bool = False fix_vit: bool = True @dataclass class LoraArguments: lora_r: int = 64 lora_alpha: int = 16 lora_dropout: float = 0.05 lora_target_modules: List[str] = field( default_factory=lambda: ["c_attn", "attn.c_proj", "w1", "w2"] ##["in_proj","out_proj","c_fc"] ) lora_weight_path: str = "" lora_bias: str = "none" q_lora: bool = False def maybe_zero_3(param): if hasattr(param, "ds_id"): assert param.ds_status == ZeroParamStatus.NOT_AVAILABLE with zero.GatheredParameters([param]): param = param.data.detach().cpu().clone() else: param = param.detach().cpu().clone() return param # Borrowed from peft.utils.get_peft_model_state_dict def get_peft_state_maybe_zero_3(named_params, bias): if bias == "none": to_return = {k: t for k, t in named_params if "lora_" in k} elif bias == "all": to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k} elif bias == "lora_only": to_return = {} maybe_lora_bias = {} lora_bias_names = set() for k, t in named_params: if "lora_" in k: to_return[k] = t bias_name = k.split("lora_")[0] + "bias" lora_bias_names.add(bias_name) elif "bias" in k: maybe_lora_bias[k] = t for k, t in maybe_lora_bias: if bias_name in lora_bias_names: to_return[bias_name] = t else: raise NotImplementedError to_return = {k: maybe_zero_3(v) for k, v in to_return.items()} return to_return local_rank = None def rank0_print(*args): if local_rank == 0: print(*args) def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str, bias="none"): """Collects the state dict and dump to disk.""" # check if zero3 mode enabled if deepspeed.is_deepspeed_zero3_enabled(): state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict() else: if trainer.args.use_lora: state_dict = get_peft_state_maybe_zero_3( trainer.model.named_parameters(), bias ) else: state_dict = trainer.model.state_dict() if trainer.args.should_save and trainer.args.local_rank == 0: trainer._save(output_dir, state_dict=state_dict) def preprocess( sources, tokenizer: transformers.PreTrainedTokenizer, max_len: int, system_message: str = "You are a helpful assistant." ) -> Dict: roles = {"user": "<|im_start|>user", "assistant": "<|im_start|>assistant"} im_start = tokenizer.im_start_id im_end = tokenizer.im_end_id nl_tokens = tokenizer('\n').input_ids _system = tokenizer('system').input_ids + nl_tokens _user = tokenizer('user').input_ids + nl_tokens _assistant = tokenizer('assistant').input_ids + nl_tokens # Apply prompt templates input_ids, targets = [], [] for i, source in enumerate(sources): if roles[source[0]["from"]] != roles["user"]: source = source[1:] input_id, target = [], [] system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens input_id += system target += [im_start] + [IGNORE_TOKEN_ID] * (len(system)-3) + [im_end] + nl_tokens assert len(input_id) == len(target) for j, sentence in enumerate(source): role = roles[sentence["from"]] _input_id = tokenizer(role).input_ids + nl_tokens + \ tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens input_id += _input_id if role == '<|im_start|>user': _target = [im_start] + [IGNORE_TOKEN_ID] * (len(_input_id)-3) + [im_end] + nl_tokens elif role == '<|im_start|>assistant': _target = [im_start] + [IGNORE_TOKEN_ID] * len(tokenizer(role).input_ids) + \ _input_id[len(tokenizer(role).input_ids)+1:-2] + [im_end] + nl_tokens else: raise NotImplementedError target += _target assert len(input_id) == len(target) input_id += [tokenizer.pad_token_id] * (max_len - len(input_id)) target += [IGNORE_TOKEN_ID] * (max_len - len(target)) input_ids.append(input_id[:max_len]) targets.append(target[:max_len]) input_ids = torch.tensor(input_ids, dtype=torch.int) targets = torch.tensor(targets, dtype=torch.int) return dict( input_ids=input_ids, labels=targets, attention_mask=input_ids.ne(tokenizer.pad_token_id), ) class SupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int): super(SupervisedDataset, self).__init__() rank0_print("Formatting inputs...") sources = [example["conversations"] for example in raw_data] data_dict = preprocess(sources, tokenizer, max_len) self.input_ids = data_dict["input_ids"] self.labels = data_dict["labels"] self.attention_mask = data_dict["attention_mask"] def __len__(self): return len(self.input_ids) def __getitem__(self, i) -> Dict[str, torch.Tensor]: return dict( input_ids=self.input_ids[i], labels=self.labels[i], attention_mask=self.attention_mask[i], ) class LazySupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int): super(LazySupervisedDataset, self).__init__() self.tokenizer = tokenizer self.max_len = max_len rank0_print("Formatting inputs...Skip in lazy mode") self.tokenizer = tokenizer self.raw_data = raw_data self.cached_data_dict = {} def __len__(self): return len(self.raw_data) def __getitem__(self, i) -> Dict[str, torch.Tensor]: if i in self.cached_data_dict: return self.cached_data_dict[i] ret = preprocess([self.raw_data[i]["conversations"]], self.tokenizer, self.max_len) ret = dict( input_ids=ret["input_ids"][0], labels=ret["labels"][0], attention_mask=ret["attention_mask"][0], ) self.cached_data_dict[i] = ret return ret def make_supervised_data_module( tokenizer: transformers.PreTrainedTokenizer, data_args, max_len, ) -> Dict: """Make dataset and collator for supervised fine-tuning.""" dataset_cls = ( LazySupervisedDataset if data_args.lazy_preprocess else SupervisedDataset ) rank0_print("Loading data...") train_json = json.load(open(data_args.data_path, "r")) train_dataset = dataset_cls(train_json, tokenizer=tokenizer, max_len=max_len) if data_args.eval_data_path: eval_json = json.load(open(data_args.eval_data_path, "r")) eval_dataset = dataset_cls(eval_json, tokenizer=tokenizer, max_len=max_len) else: eval_dataset = None return dict(train_dataset=train_dataset, eval_dataset=eval_dataset) def train(): global local_rank parser = transformers.HfArgumentParser( (ModelArguments, DataArguments, TrainingArguments, LoraArguments) ) ( model_args, data_args, training_args, lora_args, ) = parser.parse_args_into_dataclasses() if getattr(training_args, 'deepspeed', None) and getattr(lora_args, 'q_lora', False): training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED compute_dtype = ( torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32) ) local_rank = training_args.local_rank device_map = None world_size = int(os.environ.get("WORLD_SIZE", 1)) ddp = world_size != 1 if lora_args.q_lora: device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled(): logging.warning( "FSDP or ZeRO3 are not incompatible with QLoRA." ) # Set RoPE scaling factor config = transformers.AutoConfig.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, trust_remote_code=True, ) config.use_cache = False # Load model and tokenizer model = transformers.AutoModelForCausalLM.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=training_args.cache_dir, device_map=device_map, trust_remote_code=True, quantization_config=GPTQConfig( bits=4, disable_exllama=True ) if training_args.use_lora and lora_args.q_lora else None, ) if not training_args.use_lora: if training_args.fix_vit and hasattr(model,'transformer') and hasattr(model.transformer,'visual'): model.transformer.visual.requires_grad_(False) if hasattr(model.transformer.visual,'attn_pool'): model.transformer.visual.attn_pool.requires_grad_(True) tokenizer = transformers.AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", use_fast=False, trust_remote_code=True, ) tokenizer.pad_token_id = tokenizer.eod_id if training_args.use_lora: if lora_args.q_lora or "chat" in model_args.model_name_or_path.lower(): modules_to_save = None else: modules_to_save = ["wte", "lm_head"] lora_config = LoraConfig( r=lora_args.lora_r, lora_alpha=lora_args.lora_alpha, target_modules=lora_args.lora_target_modules, lora_dropout=lora_args.lora_dropout, bias=lora_args.lora_bias, task_type="CAUSAL_LM", modules_to_save=modules_to_save # This argument serves for adding new tokens. ) if lora_args.q_lora: model = prepare_model_for_kbit_training( model, use_gradient_checkpointing=training_args.gradient_checkpointing ) model = get_peft_model(model, lora_config) if training_args.gradient_checkpointing: model.enable_input_require_grads() # Load data data_module = make_supervised_data_module( tokenizer=tokenizer, data_args=data_args, max_len=training_args.model_max_length ) # Start trainner trainer = Trainer( model=model, tokenizer=tokenizer, args=training_args, **data_module ) trainer.train() trainer.save_state() safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir, bias=lora_args.lora_bias) if __name__ == "__main__": train() ================================================ FILE: openai_api.py ================================================ # coding=utf-8 # Implements API for Qwen-7B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat) # Usage: python openai_api.py # Visit http://localhost:8000/docs for documents. import re import copy import json import time from argparse import ArgumentParser from contextlib import asynccontextmanager from typing import Dict, List, Literal, Optional, Union import torch import uvicorn from fastapi import FastAPI, HTTPException from fastapi.middleware.cors import CORSMiddleware from pydantic import BaseModel, Field from sse_starlette.sse import EventSourceResponse from transformers import AutoTokenizer, AutoModelForCausalLM from transformers.generation import GenerationConfig @asynccontextmanager async def lifespan(app: FastAPI): # collects GPU memory yield if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.ipc_collect() app = FastAPI(lifespan=lifespan) app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) class ModelCard(BaseModel): id: str object: str = "model" created: int = Field(default_factory=lambda: int(time.time())) owned_by: str = "owner" root: Optional[str] = None parent: Optional[str] = None permission: Optional[list] = None class ModelList(BaseModel): object: str = "list" data: List[ModelCard] = [] class ChatMessage(BaseModel): role: Literal["user", "assistant", "system", "function"] content: Optional[str] function_call: Optional[Dict] = None class DeltaMessage(BaseModel): role: Optional[Literal["user", "assistant", "system"]] = None content: Optional[str] = None class ChatCompletionRequest(BaseModel): model: str messages: List[ChatMessage] functions: Optional[List[Dict]] = None temperature: Optional[float] = None top_p: Optional[float] = None max_length: Optional[int] = None stream: Optional[bool] = False stop: Optional[List[str]] = None class ChatCompletionResponseChoice(BaseModel): index: int message: ChatMessage finish_reason: Literal["stop", "length", "function_call"] class ChatCompletionResponseStreamChoice(BaseModel): index: int delta: DeltaMessage finish_reason: Optional[Literal["stop", "length"]] class ChatCompletionResponse(BaseModel): model: str object: Literal["chat.completion", "chat.completion.chunk"] choices: List[ Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice] ] created: Optional[int] = Field(default_factory=lambda: int(time.time())) @app.get("/v1/models", response_model=ModelList) async def list_models(): global model_args model_card = ModelCard(id="gpt-3.5-turbo") return ModelList(data=[model_card]) # To work around that unpleasant leading-\n tokenization issue! def add_extra_stop_words(stop_words): if stop_words: _stop_words = [] _stop_words.extend(stop_words) for x in stop_words: s = x.lstrip("\n") if s and (s not in _stop_words): _stop_words.append(s) return _stop_words return stop_words def trim_stop_words(response, stop_words): if stop_words: for stop in stop_words: idx = response.find(stop) if idx != -1: response = response[:idx] return response TOOL_DESC = """{name_for_model}: Call this tool to interact with the {name_for_human} API. What is the {name_for_human} API useful for? {description_for_model} Parameters: {parameters}""" REACT_INSTRUCTION = """Answer the following questions as best you can. You have access to the following APIs: {tools_text} Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [{tools_name_text}] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can be repeated zero or more times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin!""" _TEXT_COMPLETION_CMD = object() # # Temporarily, the system role does not work as expected. # We advise that you write the setups for role-play in your query, # i.e., use the user role instead of the system role. # # TODO: Use real system role when the model is ready. # def parse_messages(messages, functions): if all(m.role != "user" for m in messages): raise HTTPException( status_code=400, detail=f"Invalid request: Expecting at least one user message.", ) messages = copy.deepcopy(messages) default_system = "You are a helpful assistant." system = "" if messages[0].role == "system": system = messages.pop(0).content.lstrip("\n").rstrip() if system == default_system: system = "" if functions: tools_text = [] tools_name_text = [] for func_info in functions: name = func_info.get("name", "") name_m = func_info.get("name_for_model", name) name_h = func_info.get("name_for_human", name) desc = func_info.get("description", "") desc_m = func_info.get("description_for_model", desc) tool = TOOL_DESC.format( name_for_model=name_m, name_for_human=name_h, # Hint: You can add the following format requirements in description: # "Format the arguments as a JSON object." # "Enclose the code within triple backticks (`) at the beginning and end of the code." description_for_model=desc_m, parameters=json.dumps(func_info["parameters"], ensure_ascii=False), ) tools_text.append(tool) tools_name_text.append(name_m) tools_text = "\n\n".join(tools_text) tools_name_text = ", ".join(tools_name_text) system += "\n\n" + REACT_INSTRUCTION.format( tools_text=tools_text, tools_name_text=tools_name_text, ) system = system.lstrip("\n").rstrip() dummy_thought = { "en": "\nThought: I now know the final answer.\nFinal answer: ", "zh": "\nThought: 我会作答了。\nFinal answer: ", } _messages = messages messages = [] for m_idx, m in enumerate(_messages): role, content, func_call = m.role, m.content, m.function_call if content: content = content.lstrip("\n").rstrip() if role == "function": if (len(messages) == 0) or (messages[-1].role != "assistant"): raise HTTPException( status_code=400, detail=f"Invalid request: Expecting role assistant before role function.", ) messages[-1].content += f"\nObservation: {content}" if m_idx == len(_messages) - 1: messages[-1].content += "\nThought:" elif role == "assistant": if len(messages) == 0: raise HTTPException( status_code=400, detail=f"Invalid request: Expecting role user before role assistant.", ) last_msg = messages[-1].content last_msg_has_zh = len(re.findall(r"[\u4e00-\u9fff]+", last_msg)) > 0 if func_call is None: if functions: content = dummy_thought["zh" if last_msg_has_zh else "en"] + content else: f_name, f_args = func_call["name"], func_call["arguments"] if not content: if last_msg_has_zh: content = f"Thought: 我可以使用 {f_name} API。" else: content = f"Thought: I can use {f_name}." content = f"\n{content}\nAction: {f_name}\nAction Input: {f_args}" if messages[-1].role == "user": messages.append( ChatMessage(role="assistant", content=content.lstrip("\n").rstrip()) ) else: messages[-1].content += content elif role == "user": messages.append( ChatMessage(role="user", content=content.lstrip("\n").rstrip()) ) else: raise HTTPException( status_code=400, detail=f"Invalid request: Incorrect role {role}." ) query = _TEXT_COMPLETION_CMD if messages[-1].role == "user": query = messages[-1].content messages = messages[:-1] if len(messages) % 2 != 0: raise HTTPException(status_code=400, detail="Invalid request") history = [] # [(Q1, A1), (Q2, A2), ..., (Q_last_turn, A_last_turn)] for i in range(0, len(messages), 2): if messages[i].role == "user" and messages[i + 1].role == "assistant": usr_msg = messages[i].content.lstrip("\n").rstrip() bot_msg = messages[i + 1].content.lstrip("\n").rstrip() if system and (i == len(messages) - 2): usr_msg = f"{system}\n\nQuestion: {usr_msg}" system = "" for t in dummy_thought.values(): t = t.lstrip("\n") if bot_msg.startswith(t) and ("\nAction: " in bot_msg): bot_msg = bot_msg[len(t) :] history.append([usr_msg, bot_msg]) else: raise HTTPException( status_code=400, detail="Invalid request: Expecting exactly one user (or function) role before every assistant role.", ) if system: assert query is not _TEXT_COMPLETION_CMD query = f"{system}\n\nQuestion: {query}" return query, history def parse_response(response): func_name, func_args = "", "" i = response.rfind("\nAction:") j = response.rfind("\nAction Input:") k = response.rfind("\nObservation:") if 0 <= i < j: # If the text has `Action` and `Action input`, if k < j: # but does not contain `Observation`, # then it is likely that `Observation` is omitted by the LLM, # because the output text may have discarded the stop word. response = response.rstrip() + "\nObservation:" # Add it back. k = response.rfind("\nObservation:") func_name = response[i + len("\nAction:") : j].strip() func_args = response[j + len("\nAction Input:") : k].strip() if func_name: choice_data = ChatCompletionResponseChoice( index=0, message=ChatMessage( role="assistant", content=response[:i], function_call={"name": func_name, "arguments": func_args}, ), finish_reason="function_call", ) return choice_data z = response.rfind("\nFinal Answer: ") if z >= 0: response = response[z + len("\nFinal Answer: ") :] choice_data = ChatCompletionResponseChoice( index=0, message=ChatMessage(role="assistant", content=response), finish_reason="stop", ) return choice_data # completion mode, not chat mode def text_complete_last_message(history, stop_words_ids): im_start = "<|im_start|>" im_end = "<|im_end|>" prompt = f"{im_start}system\nYou are a helpful assistant.{im_end}" for i, (query, response) in enumerate(history): query = query.lstrip("\n").rstrip() response = response.lstrip("\n").rstrip() prompt += f"\n{im_start}user\n{query}{im_end}" prompt += f"\n{im_start}assistant\n{response}{im_end}" prompt = prompt[: -len(im_end)] _stop_words_ids = [tokenizer.encode(im_end)] if stop_words_ids: for s in stop_words_ids: _stop_words_ids.append(s) stop_words_ids = _stop_words_ids input_ids = torch.tensor([tokenizer.encode(prompt)]).to(model.device) output = model.generate(input_ids, stop_words_ids=stop_words_ids).tolist()[0] output = tokenizer.decode(output, errors="ignore") assert output.startswith(prompt) output = output[len(prompt) :] output = trim_stop_words(output, ["<|endoftext|>", im_end]) print(f"<completion>\n{prompt}\n\n{output}\n</completion>") return output @app.post("/v1/chat/completions", response_model=ChatCompletionResponse) async def create_chat_completion(request: ChatCompletionRequest): global model, tokenizer stop_words = add_extra_stop_words(request.stop) if request.functions: stop_words = stop_words or [] if "Observation:" not in stop_words: stop_words.append("Observation:") query, history = parse_messages(request.messages, request.functions) if request.stream: if request.functions: raise HTTPException( status_code=400, detail="Invalid request: Function calling is not yet implemented for stream mode.", ) # generate = predict(query, history, request.model, stop_words) # return EventSourceResponse(generate, media_type="text/event-stream") raise HTTPException(status_code=400, detail="Stream request is not supported currently.") stop_words_ids = [tokenizer.encode(s) for s in stop_words] if stop_words else None if query is _TEXT_COMPLETION_CMD: response = text_complete_last_message(history, stop_words_ids=stop_words_ids) else: response, _ = model.chat( tokenizer, query, history=history, stop_words_ids=stop_words_ids, append_history=False, top_p=request.top_p, temperature=request.temperature, ) print(f"<chat>\n{history}\n{query}\n\n{response}\n</chat>") response = trim_stop_words(response, stop_words) if request.functions: choice_data = parse_response(response) else: choice_data = ChatCompletionResponseChoice( index=0, message=ChatMessage(role="assistant", content=response), finish_reason="stop", ) return ChatCompletionResponse( model=request.model, choices=[choice_data], object="chat.completion" ) async def predict( query: str, history: List[List[str]], model_id: str, stop_words: List[str] ): global model, tokenizer choice_data = ChatCompletionResponseStreamChoice( index=0, delta=DeltaMessage(role="assistant"), finish_reason=None ) chunk = ChatCompletionResponse( model=model_id, choices=[choice_data], object="chat.completion.chunk" ) yield "{}".format(chunk.model_dump_json(exclude_unset=True)) current_length = 0 stop_words_ids = [tokenizer.encode(s) for s in stop_words] if stop_words else None if stop_words: # TODO: It's a little bit tricky to trim stop words in the stream mode. raise HTTPException( status_code=400, detail="Invalid request: custom stop words are not yet supported for stream mode.", ) response_generator = model.chat_stream( tokenizer, query, history=history, stop_words_ids=stop_words_ids ) for new_response in response_generator: if len(new_response) == current_length: continue new_text = new_response[current_length:] current_length = len(new_response) choice_data = ChatCompletionResponseStreamChoice( index=0, delta=DeltaMessage(content=new_text), finish_reason=None ) chunk = ChatCompletionResponse( model=model_id, choices=[choice_data], object="chat.completion.chunk" ) yield "{}".format(chunk.model_dump_json(exclude_unset=True)) choice_data = ChatCompletionResponseStreamChoice( index=0, delta=DeltaMessage(), finish_reason="stop" ) chunk = ChatCompletionResponse( model=model_id, choices=[choice_data], object="chat.completion.chunk" ) yield "{}".format(chunk.model_dump_json(exclude_unset=True)) yield "[DONE]" def _get_args(): parser = ArgumentParser() parser.add_argument( "-c", "--checkpoint-path", type=str, default="QWen/QWen-7B-Chat", help="Checkpoint name or path, default to %(default)r", ) parser.add_argument( "--cpu-only", action="store_true", help="Run demo with CPU only" ) parser.add_argument( "--server-port", type=int, default=8000, help="Demo server port." ) parser.add_argument( "--server-name", type=str, default="127.0.0.1", help="Demo server name. Default: 127.0.0.1, which is only visible from the local computer." " If you want other computers to access your server, use 0.0.0.0 instead.", ) args = parser.parse_args() return args if __name__ == "__main__": args = _get_args() tokenizer = AutoTokenizer.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, ) if args.cpu_only: device_map = "cpu" else: device_map = "auto" model = AutoModelForCausalLM.from_pretrained( args.checkpoint_path, device_map=device_map, trust_remote_code=True, resume_download=True, ).eval() model.generation_config = GenerationConfig.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, ) uvicorn.run(app, host=args.server_name, port=args.server_port, workers=1) ================================================ FILE: requirements.txt ================================================ transformers==4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib ================================================ FILE: requirements_openai_api.txt ================================================ fastapi uvicorn openai pydantic sse_starlette ================================================ FILE: requirements_web_demo.txt ================================================ gradio modelscope ================================================ FILE: touchstone/README.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> <a href="../touchstone/README_CN.md">中文</a> ｜ English ｜ <a href="../touchstone/README_JA.md">日本語</a>｜ <a href="../touchstone/README_KO.md">한국어</a> **TOUCHSTONE** is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By automating the evaluation process and converting multimodal information into text, our TouchStone allows for efficient and accurate assessment of dialogue quality, leveraging the power of advanced language models without the need for manual intervention. ## DATASET To evaluate the abilities of LVLMs, we construct a diverse and comprehensive dataset that covers five key dimensions: basic descriptive ability, visual recognition ability, visual comprehension ability, visual storytelling ability, and multi-image analysis ability. - **Basic Descriptive Ability** Image description involves the ability of a model to describe the information contained in an image, including simple and detailed descriptions. Simple descriptions are typically short phrases that describe the main subject and action of the image, while detailed descriptions provide more in-depth information about the image scene, their attributes, and relationships. - **Visual Recognition Ability** Image recognition is the task of recognizing objects or scenes within an image and inferring relevant information. This area can be further divided into several sub-tasks, including attribute QA, movie/TV recognition, art recognition, landmark recognition, celebrity recognition, emotion recognition, text recognition, object recognition, and structure content recognition. - **Visual Comprehension Ability** Image understanding involves the ability of a model to understand the meaning of an image and associated tasks. This area encompasses several sub-tasks, such as style appreciation, abstract image understanding, meme understanding, image analysis, chart analysis, general problem-solving, and reasoning QA. - **Visual Storytelling Ability** The visual storytelling ability is the process of literary creation based on visual content, including writing emails, poetry, stories, ads/commodity recommendations, and brainstorming. - **Multi-Image Analysis Ability** Multi-image analysis is the task of analyzing and comparing multiple images. This area includes tasks such as comparing two/multiple images, summarizing multiple image information, comparing commodities, and step-by-step analysis of images. <img src="../assets/touchstone_datasets.jpg" width="600"/> We comprehensively evaluate the model's ability from five dimensions. As shown in the figure above, an example of 27 subtasks is given. From perception to cognition to creativity, as the difficulty increases, the requirements for models are also getting higher and higher. Currently, LVLM capabilities are in their early stages. Our dataset contains 800+ questions and 27 categories. ## Methods We apply a powerful LLM as a judge to enable automated evaluation. To effectively comprehend the contents of an image, we manually substitute the actual image input with fine-grained textual annotations. By inputting these annotations and corresponding questions to a powerful LLM like GPT4, we obtain reference answers. For the evaluation of the LVLMs, we provide actual images and questions as input and obtain their respective answers. Finally, we employ GPT4 to score the answers generated by the LVLMs based on the fine-grained annotations and questions. The scoring instructions require the model to assess the usefulness, relevance, and accuracy of the answers, considering the annotations as the content of the images. To ensure fairness in the evaluation, each model's answer is compared against a consistent reference answer from GPT4. The average score of the model in all questions is taken as the final score. To eliminate the influence of answer position, we perform a second scoring round by swapping the positions of the answers and then compute the average of the two scores obtained. This approach aims to mitigate any bias introduced by the placement of the answers. <img src="../assets/touchstone_eval.png" width="600"/> ### Evaluation #### Evaluation in English-based Multimodal Dialogue | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### Evaluation in Chinese-based Multimodal Dialogue | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: touchstone/README_CN.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> 中文｜ <a href="../touchstone/README.md">English</a> ｜ <a href="../touchstone/README_JA.md">日本語</a> **TOUCHSTONE** 是一种针对多模态语言模型（LVLM）的自动化综合评估方法，评估不仅包括基本的认知和理解，还延伸到文学创作。通过人类注解将多模态信息转换为文本，我们的 TouchStone 可以利用SOTA的语言模型来自动化地完成对LVLMs的多模态对话质量评估。 ## 数据集为了评估 LVLMs 的能力，我们构建了一个多样化且全面的数据集，涵盖五个关键维度：基本描述能力、视觉识别能力、视觉理解能力、视觉叙事能力和多图分析能力。 - **基本描述能力** 图像描述考验模型总结图片信息的能力，包括简单描述和详细描述。简单描述通常是描述图像的主要内容和关系的简短短语，而详细描述则提供有关图像场景、其属性和关系的更深入的信息。 - **视觉识别能力** 图像识别考察模型提取图像中内容的属性以及关联到知识库的能力。为了考察这方面能力，测试的问题包括属性QA、影视识别、艺术识别、地标识别、名人识别、情感识别、文本识别、物体识别和结构内容识别。 - **视觉理解能力** 图像理解需要模型理解图像内容并完成推理进行相关任务。这方面包含了例如风格欣赏、抽象图像理解、模因理解、图像分析、图表分析、一般问题解决和推理问答等任务。 - **视觉叙事能力** 视觉叙事能力是基于视觉内容的文学创作能力，包括撰写电子邮件、诗歌、故事、广告/商品推荐、头脑风暴等。 - **多图分析能力** 多图分析是分析和比较多幅图像的任务。该领域包括比较两个/多个图像、总结多个图像信息、比较商品以及逐步分析图像等任务。 <img src="../assets/touchstone_datasets.jpg" width="600"/> 我们从五个维度综合评估了模型的能力。如上图所示，给出了27个子任务的示例。从感知到认知，再到创造力，随着难度的增加，对模型的要求也越来越高。目前，LVLM的能力还处于早期阶段。我们的数据集包含800+道题目、27个类别。 ## 测评方式我们应用SOTA的LLM进行自动化评估。为了有效地理解图像的内容，我们人工用细粒度的文本注释替换实际的图像输入。通过将这些注释和相应的问题输入到像GPT4这样强LLM中，我们可以获得参考答案。对于待测评的LVLM，我们提供实际图像和问题作为输入并获得各自的答案。最后，我们使用GPT4根据细粒度注释和问题对LVLM生成的答案进行评分。评分指令要求模型评估答案的有用性、相关性和准确性，并将人工注解视为图像的内容。为了确保评估的公平性，每个模型的答案都会与 GPT4生成的参考答案进行比较。模型在所有问题上的平均得分作为最终得分。为了消除答案位置的影响，我们通过交换答案的位置来进行第二轮评分，然后计算获得的两次分数的平均值。 <img src="../assets/touchstone_eval.png" width="600"/> ## 测评结果 #### 英文版本测评 | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### 中文版本测评 | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: touchstone/README_JA.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> <a href="touchstone/README_CN.md">中文</a> ｜ <a href="../touchstone/README.md">English</a>｜日本語 **TOUCHSTONE** は、マルチモーダル言語モデルの包括的な評価であり、基本的な認識や理解だけでなく、文学的な創作にまで及びます。評価プロセスを自動化し、マルチモーダル情報をテキストに変換することで、私達の TouchStone は、人手を介することなく高度な言語モデルの力を活用し、対話の質を効率的かつ正確に評価することができます。 ## DATASET LVLMの能力を評価するために、基本的な記述能力、視覚認識能力、視覚理解能力、視覚ストーリーテリング能力、複数画像解析能力の5つの主要な次元をカバーする多様で包括的なデータセットを構築する。 - **基本的描写力** 画像記述には、単純な記述と詳細な記述を含め、画像に含まれる情報を記述するモデルの能力が含まれる。単純な記述は、通常、画像の主な主題とアクションを記述する短いフレーズであり、詳細な記述は、画像のシーン、それらの属性、および関係についてのより詳細な情報を提供します。 - **視覚認識能力** 画像認識とは、画像内のオブジェクトやシーンを認識し、関連情報を推論するタスクである。この分野はさらに、属性QA、映画/テレビ認識、アート認識、ランドマーク認識、有名人認識、感情認識、テキスト認識、オブジェクト認識、構造コンテンツ認識など、いくつかのサブタスクに分けることができる。 - **視覚理解能力** 画像理解とは、モデルが画像の意味や関連するタスクを理解する能力のことである。この分野には、スタイル理解、抽象画像理解、ミーム理解、画像分析、チャート分析、一般的な問題解決、推論QAなど、いくつかのサブタスクが含まれる。 - **視覚的ストーリーテリング能力** ビジュアルストーリーテリング能力とは、メール、詩、物語、広告／商品推薦、ブレーンストーミングの執筆など、ビジュアルコンテンツに基づいた文学創作のプロセスである。 - **マルチ画像解析能力** 複数画像解析とは、複数の画像を解析・比較する作業である。この分野には、2つまたは複数の画像を比較する、複数の画像情報を要約する、商品を比較する、画像を段階的に分析するなどのタスクが含まれます。 <img src="../assets/touchstone_datasets.jpg" width="600"/> モデルの能力を 5 つの次元から総合的に評価する。上図のように、27 のサブタスクの例を示す。知覚から認知、創造性まで、難易度が上がるにつれて、モデルに求められる要件もどんどん高くなっている。現在、LVLM の機能は初期段階にある。我々のデータセットには 800 以上の質問と 27 のカテゴリーが含まれている。 ## 方法自動評価を可能にするために、強力な LLM を判定器として適用する。画像の内容を効果的に理解するために、実際の画像入力をきめ細かいテキスト注釈に手動で置き換える。これらの注釈と対応する質問を GPT4 のような強力な LLM に入力することで、参照解答を得る。 LVLMの評価には、実際の画像と質問を入力として与え、それぞれの回答を得る。最後に、GPT4を用いて、LVLMが生成した回答を、細かいアノテーションと質問に基づいてスコアリングする。スコアリングの指示は、注釈を画像の内容とみなして、回答の有用性、関連性、正確性を評価するようモデルに要求する。評価の公平性を確保するため、各モデルの回答はGPT4の一貫した参照回答と比較されます。全問題におけるモデルの平均スコアを最終スコアとする。解答位置の影響を排除するために、解答位置を入れ替えて2回目の採点ラウンドを行い、得られた2つのスコアの平均を計算します。このアプローチは、解答の配置によって生じるバイアスを軽減することを目的としています。 <img src="../assets/touchstone_eval.png" width="600"/> ### 評価 #### 英語ベースのマルチモーダル対話における評価 | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### 中国語ベースのマルチモーダル対話における評価 | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: touchstone/README_KO.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> <a href="../touchstone/README_CN.md">中文</a> ｜ English ｜ <a href="../touchstone/README_JA.md">日本語</a> ｜ <a href="../touchstone/README_KO.md">한국어</a> **터치스톤, TOUCHSTONE**은 기본적인 인식과 이해력뿐만 아니라 문학 창작까지 아우르는 종합적인 멀티모달 언어 모델 평가입니다. 평가 프로세스를 자동화하고 멀티모달 정보를 텍스트로 변환하는 터치스톤은 수동 개입 없이도 고급 언어 모델의 성능을 활용하여 대화 품질을 효율적이고 정확하게 평가할 수 있도록 지원합니다. ## DATASET 머신러닝의 능력을 평가하기 위해 기본 설명 능력, 시각 인식 능력, 시각 이해 능력, 시각 스토리텔링 능력, 다중 이미지 분석 능력 등 5가지 주요 모달을 포괄하는 다양하고 광범위한 데이터 세트를 구축합니다. - **기본 설명 능력, Basic Descriptive Ability** 이미지 설명에는 단순 설명과 상세 설명을 포함하여 이미지에 포함된 정보를 설명하는 모델의 능력이 포함됩니다. 단순 설명은 일반적으로 이미지의 주요 주제와 동작을 설명하는 짧은 문구로 상세 설명은 이미지 장면, 속성 및 관계에 대한 보다 심층적인 정보를 제공합니다. - **시각적 인식 능력, Visual Recognition Ability** 이미지 인식은 이미지 내의 사물이나 장면을 인식하고 관련 정보를 추론하는 작업입니다. 이 영역은 속성 QA, 영화/TV 인식, 예술 인식, 랜드마크 인식, 유명인 인식, 감정 인식, 텍스트 인식, 사물 인식, 구조물 내용 인식 등 여러 하위 작업으로 세분화할 수 있습니다. - **시각적 이해 능력, Visual Comprehension Ability** 이미지 이해에는 이미지의 의미와 관련 작업을 이해하는 모델의 능력이 포함됩니다. 이 영역에는 스타일 감상, 추상적 이미지 이해, 밈 이해, 이미지 분석, 차트 분석, 일반적인 문제 해결, 추론 QA와 같은 여러 하위 작업이 포함됩니다. - **시각적 스토리텔링 능력, Visual Storytelling Ability** 시각적 스토리텔링 능력은 이메일, 시, 스토리, 광고/상품 추천, 브레인스토밍 등 시각적 콘텐츠를 기반으로 문학적 창작을 하는 과정입니다. - **다중 이미지 분석 능력, Multi-Image Analysis Ability** 다중 이미지 분석은 여러 이미지를 분석하고 비교하는 작업입니다. 이 영역에는 두 개/여러 개의 이미지 비교, 여러 이미지 정보 요약, 상품 비교, 이미지의 단계별 분석 등의 작업이 포함됩니다. <img src="../assets/touchstone_datasets.jpg" width="600"/> 5가지 측면에서 모델의 능력을 종합적으로 평가합니다. 위 그림과 같이 27개의 하위 과제를 예로 들었습니다. 지각부터 인지, 창의력까지 난이도가 높아질수록 모델에 대한 요구 사항도 점점 더 높아지고 있습니다. 현재 LVLM 기능은 초기 단계에 있습니다. 데이터 세트에는 800개 이상의 질문과 27개 카테고리가 포함되어 있습니다. ## Methods 당사는 자동화된 평가를 위해 강력한 LLM을 심사자로 적용합니다. 이미지의 내용을 효과적으로 이해하기 위해 실제 이미지 입력을 세분화된 텍스트 주석으로 수동으로 대체합니다. 이러한 주석과 해당 질문을 GPT4와 같은 강력한 LLM에 입력하면 참조 답변을 얻을 수 있습니다. LVLM의 평가를 위해 실제 이미지와 질문을 입력으로 제공하고 각각의 답변을 얻습니다. 마지막으로, 세분화된 주석과 질문을 기반으로 LVLM이 생성한 답변에 GPT4를 사용하여 점수를 매깁니다. 채점 지침에 따라 모델은 주석을 이미지의 콘텐츠로 간주하여 답변의 유용성, 관련성 및 정확성을 평가해야 합니다. 평가의 공정성을 보장하기 위해 각 모델의 답변은 GPT4의 일관된 참조 답변과 비교됩니다. 모든 문제에서 모델의 평균 점수가 최종 점수로 사용됩니다. 답안 위치의 영향을 제거하기 위해 답안 위치를 바꿔서 두 번째 채점 라운드를 수행한 다음 얻은 두 점수의 평균을 계산합니다. 이 접근 방식은 답안 배치로 인해 발생하는 편향을 완화하는 것을 목표로 합니다. <img src="../assets/touchstone_eval.png" width="600"/> ### Evaluation #### Evaluation in English-based Multimodal Dialogue | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### Evaluation in Chinese-based Multimodal Dialogue | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: web_demo_mm.py ================================================ # Copyright (c) Alibaba Cloud. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """A simple web interactive chat demo based on gradio.""" from argparse import ArgumentParser from pathlib import Path import copy import gradio as gr import os import re import secrets import tempfile from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) DEFAULT_CKPT_PATH = 'qwen/Qwen-VL-Chat' BOX_TAG_PATTERN = r"<box>([\s\S]*?)</box>" PUNCTUATION = "！？。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏." def _get_args(): parser = ArgumentParser() parser.add_argument("-c", "--checkpoint-path", type=str, default=DEFAULT_CKPT_PATH, help="Checkpoint name or path, default to %(default)r") parser.add_argument("--cpu-only", action="store_true", help="Run demo with CPU only") parser.add_argument("--share", action="store_true", default=False, help="Create a publicly shareable link for the interface.") parser.add_argument("--inbrowser", action="store_true", default=False, help="Automatically launch the interface in a new tab on the default browser.") parser.add_argument("--server-port", type=int, default=8000, help="Demo server port.") parser.add_argument("--server-name", type=str, default="127.0.0.1", help="Demo server name.") args = parser.parse_args() return args def _load_model_tokenizer(args): tokenizer = AutoTokenizer.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, revision='master', ) if args.cpu_only: device_map = "cpu" else: device_map = "cuda" model = AutoModelForCausalLM.from_pretrained( args.checkpoint_path, device_map=device_map, trust_remote_code=True, resume_download=True, revision='master', ).eval() model.generation_config = GenerationConfig.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, revision='master', ) return model, tokenizer def _parse_text(text): lines = text.split("\n") lines = [line for line in lines if line != ""] count = 0 for i, line in enumerate(lines): if "```" in line: count += 1 items = line.split("`") if count % 2 == 1: lines[i] = f'<pre><code class="language-{items[-1]}">' else: lines[i] = f" </code></pre>" else: if i > 0: if count % 2 == 1: line = line.replace("`", r"\`") line = line.replace("<", "<") line = line.replace(">", ">") line = line.replace(" ", " ") line = line.replace("*", "*") line = line.replace("_", "_") line = line.replace("-", "-") line = line.replace(".", ".") line = line.replace("!", "!") line = line.replace("(", "(") line = line.replace(")", ")") line = line.replace("$", "$") lines[i] = " " + line text = "".join(lines) return text def _remove_image_special(text): text = text.replace('<ref>', '').replace('</ref>', '') return re.sub(r'<box>.*?(</box>|$)', '', text) def _launch_demo(args, model, tokenizer): uploaded_file_dir = os.environ.get("GRADIO_TEMP_DIR") or str( Path(tempfile.gettempdir()) / "gradio" ) def predict(_chatbot, task_history): chat_query = _chatbot[-1][0] query = task_history[-1][0] print("User: " + _parse_text(query)) history_cp = copy.deepcopy(task_history) full_response = "" history_filter = [] pic_idx = 1 pre = "" for i, (q, a) in enumerate(history_cp): if isinstance(q, (tuple, list)): q = f'Picture {pic_idx}: <img>{q[0]}</img>' pre += q + '\n' pic_idx += 1 else: pre += q history_filter.append((pre, a)) pre = "" history, message = history_filter[:-1], history_filter[-1][0] # response, history = model.chat(tokenizer, message, history=history) for response in model.chat_stream(tokenizer, message, history=history): _chatbot[-1] = (_parse_text(chat_query), _remove_image_special(_parse_text(response))) yield _chatbot full_response = _parse_text(response) response = full_response history.append((message, response)) image = tokenizer.draw_bbox_on_latest_picture(response, history) if image is not None: temp_dir = secrets.token_hex(20) temp_dir = Path(uploaded_file_dir) / temp_dir temp_dir.mkdir(exist_ok=True, parents=True) name = f"tmp{secrets.token_hex(5)}.jpg" filename = temp_dir / name image.save(str(filename)) _chatbot.append((None, (str(filename),))) else: _chatbot[-1] = (_parse_text(chat_query), response) # full_response = _parse_text(response) task_history[-1] = (query, full_response) print("Qwen-VL-Chat: " + _parse_text(full_response)) yield _chatbot def regenerate(_chatbot, task_history): if not task_history: return _chatbot item = task_history[-1] if item[1] is None: return _chatbot task_history[-1] = (item[0], None) chatbot_item = _chatbot.pop(-1) if chatbot_item[0] is None: _chatbot[-1] = (_chatbot[-1][0], None) else: _chatbot.append((chatbot_item[0], None)) return predict(_chatbot, task_history) def add_text(history, task_history, text): task_text = text if len(text) >= 2 and text[-1] in PUNCTUATION and text[-2] not in PUNCTUATION: task_text = text[:-1] history = history + [(_parse_text(text), None)] task_history = task_history + [(task_text, None)] return history, task_history, "" def add_file(history, task_history, file): history = history + [((file.name,), None)] task_history = task_history + [((file.name,), None)] return history, task_history def reset_user_input(): return gr.update(value="") def reset_state(task_history): task_history.clear() return [] with gr.Blocks() as demo: gr.Markdown("""\ <img src="https://modelscope.cn/api/v1/models/qwen/Qwen-7B-Chat/repo? Revision=master&FilePath=assets/logo.jpeg&View=true" style="height: 80px"/>""") gr.Markdown("""<center>Qwen-VL-Chat Bot</center>""") gr.Markdown( """\ <center>This WebUI is based on Qwen-VL-Chat, developed by Alibaba Cloud. \ (本WebUI基于Qwen-VL-Chat打造，实现聊天机器人功能。)</center>""") gr.Markdown("""\ <center>Qwen-VL <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖 </a> | <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> ｜ Qwen-VL-Chat <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖 </a> | <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> ｜ <a href="https://github.com/QwenLM/Qwen-VL">Github</a></center>""") chatbot = gr.Chatbot(label='Qwen-VL-Chat', elem_classes="control-height", height=750) query = gr.Textbox(lines=2, label='Input') task_history = gr.State([]) with gr.Row(): empty_bin = gr.Button("🧹 Clear History (清除历史)") submit_btn = gr.Button("🚀 Submit (发送)") regen_btn = gr.Button("🤔️ Regenerate (重试)") addfile_btn = gr.UploadButton("📁 Upload (上传文件)", file_types=["image"]) submit_btn.click(add_text, [chatbot, task_history, query], [chatbot, task_history]).then( predict, [chatbot, task_history], [chatbot], show_progress=True ) submit_btn.click(reset_user_input, [], [query]) empty_bin.click(reset_state, [task_history], [chatbot], show_progress=True) regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress=True) addfile_btn.upload(add_file, [chatbot, task_history, addfile_btn], [chatbot, task_history], show_progress=True) gr.Markdown("""\ Note: This demo is governed by the original license of Qwen-VL. \ We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, \ including hate speech, violence, pornography, deception, etc. \ (注：本演示受Qwen-VL的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，\ 包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)""") demo.queue().launch( share=args.share, inbrowser=args.inbrowser, server_port=args.server_port, server_name=args.server_name, ) def main(): args = _get_args() model, tokenizer = _load_model_tokenizer(args) _launch_demo(args, model, tokenizer) if __name__ == '__main_

Repository: QwenLM/Qwen-VL Branch: master Commit: aa00ed04091e Files: 61 Total size: 375.0 KB Directory structure: gitextract_e0pgouhu/ ├── .github/ │ └── ISSUE_TEMPLATE/ │ ├── bug_report.yaml │ ├── config.yaml │ └── feature_request.yaml ├── .gitignore ├── BUILD.md ├── Dockerfile.qwendemo ├── Dockerfile.qwenint4openai ├── Dockerfile.qwenopenai ├── FAQ.md ├── FAQ_ja.md ├── FAQ_ko.md ├── FAQ_zh.md ├── LICENSE ├── NOTICE ├── README.md ├── README_CN.md ├── README_JA.md ├── README_KO.md ├── TUTORIAL.md ├── TUTORIAL_ja.md ├── TUTORIAL_ko.md ├── TUTORIAL_zh.md ├── assets/ │ └── mm_tutorial/ │ └── TUTORIAL.ipynb ├── eval_mm/ │ ├── EVALUATION.md │ ├── evaluate_caption.py │ ├── evaluate_grounding.py │ ├── evaluate_multiple_choice.py │ ├── evaluate_vqa.py │ ├── infographicsvqa_eval.py │ ├── mmbench/ │ │ ├── MMBENCH.md │ │ ├── evaluate_multiple_choice_mmbench.py │ │ ├── mmbench_converter_dev.py │ │ ├── mmbench_converter_test.py │ │ ├── mmbench_evaluation.py │ │ ├── mmbench_evaluation_tricky.py │ │ └── mmbench_predict_to_submission.py │ ├── mme/ │ │ ├── EVAL_MME.md │ │ ├── eval.py │ │ └── get_images.py │ ├── seed_bench/ │ │ ├── EVAL_SEED.md │ │ ├── eval.py │ │ └── trans.py │ ├── vqa.py │ └── vqa_eval.py ├── finetune/ │ ├── ds_config_zero2.json │ ├── ds_config_zero3.json │ ├── finetune_ds.sh │ ├── finetune_lora_ds.sh │ ├── finetune_lora_single_gpu.sh │ ├── finetune_qlora_ds.sh │ └── finetune_qlora_single_gpu.sh ├── finetune.py ├── openai_api.py ├── requirements.txt ├── requirements_openai_api.txt ├── requirements_web_demo.txt ├── touchstone/ │ ├── README.md │ ├── README_CN.md │ ├── README_JA.md │ └── README_KO.md └── web_demo_mm.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/ISSUE_TEMPLATE/bug_report.yaml ================================================ name: 🐞 Bug description: 提交错误报告 | File a bug/issue title: "[BUG] " labels: [] body: - type: checkboxes attributes: label: 是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this? description: | 请先搜索您遇到的错误是否在已有的issues或讨论中提到过。 Please search to see if an issue / discussion already exists for the bug you encountered. [Issues](https://github.com/QwenLM/Qwen-7B/issues) [Discussions](https://github.com/QwenLM/Qwen-7B/discussions) options: - label: 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions required: true - type: checkboxes attributes: label: 该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ? description: | 请先搜索您遇到的错误是否已在FAQ中有相关解答。 Please search to see if an answer already exists in FAQ for the bug you encountered. [FAQ-en](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ.md) [FAQ-zh](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ_zh.md) options: - label: 我已经搜索过FAQ | I have searched FAQ required: true - type: textarea attributes: label: 当前行为 | Current Behavior description: | 准确描述遇到的行为。 A concise description of what you're experiencing. validations: required: false - type: textarea attributes: label: 期望行为 | Expected Behavior description: | 准确描述预期的行为。 A concise description of what you expected to happen. validations: required: false - type: textarea attributes: label: 复现方法 | Steps To Reproduce description: | 复现当前行为的详细步骤。 Steps to reproduce the behavior. placeholder: | 1. In this environment... 2. With this config... 3. Run '...' 4. See error... validations: required: false - type: textarea attributes: label: 运行环境 | Environment description: | examples: - **OS**: Ubuntu 20.04 - **Python**: 3.8 - **Transformers**: 4.31.0 - **PyTorch**: 2.0.1 - **CUDA**: 11.4 value: | - OS: - Python: - Transformers: - PyTorch: - CUDA (`python -c 'import torch; print(torch.version.cuda)'`): render: Markdown validations: required: false - type: textarea attributes: label: 备注 | Anything else? description: | 您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。 Links? References? Anything that will give us more context about the issue you are encountering! Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in. validations: required: false ================================================ FILE: .github/ISSUE_TEMPLATE/config.yaml ================================================ blank_issues_enabled: true ================================================ FILE: .github/ISSUE_TEMPLATE/feature_request.yaml ================================================ name: "💡 Feature Request" description: 创建新功能请求 | Create a new ticket for a new feature request title: "💡 [REQUEST] - <title>" labels: [ "question" ] body: - type: input id: start_date attributes: label: "起始日期 | Start Date" description: | 起始开发日期 Start of development placeholder: "month/day/year" validations: required: false - type: textarea id: implementation_pr attributes: label: "实现PR | Implementation PR" description: | 实现该功能的Pull request Pull request used placeholder: "#Pull Request ID" validations: required: false - type: textarea id: reference_issues attributes: label: "相关Issues | Reference Issues" description: | 与该功能相关的issues Common issues placeholder: "#Issues IDs" validations: required: false - type: textarea id: summary attributes: label: "摘要 | Summary" description: | 简要描述新功能的特点 Provide a brief explanation of the feature placeholder: | Describe in a few lines your feature request validations: required: true - type: textarea id: basic_example attributes: label: "基本示例 | Basic Example" description: Indicate here some basic examples of your feature. placeholder: A few specific words about your feature request. validations: required: true - type: textarea id: drawbacks attributes: label: "缺陷 | Drawbacks" description: | 该新功能有哪些缺陷/可能造成哪些影响？ What are the drawbacks/impacts of your feature request ? placeholder: | Identify the drawbacks and impacts while being neutral on your feature request validations: required: true - type: textarea id: unresolved_question attributes: label: "未解决问题 | Unresolved questions" description: | 有哪些尚未解决的问题？ What questions still remain unresolved ? placeholder: | Identify any unresolved issues. validations: required: false ================================================ FILE: .gitignore ================================================ __pycache__ *.so build .coverage_* *.egg-info *~ .vscode/ .idea/ .DS_Store /private/ Qwen-VL-Chat/ Qwen-VL-Chat-Int4/ SimSun.ttf ================================================ FILE: BUILD.md ================================================ ## qwen web demo ### build ``` docker build -t qwen-vl-chat:webdemo --platform linux/amd64 -f Dockerfile.qwendemo . ``` ### run ``` docker run -it --gpus device=0 -d --restart always -v /var/run/docker.sock:/var/run/docker.sock --name qwen-vl-chat -p 8000:8000 --user=20001:20001 --platform linux/amd64 qwen-vl-chat:webdemo ``` ## qwen openai api ### build ``` docker build -t qwen-vl-chat:openai --platform linux/amd64 -f Dockerfile.qwenopenai . ``` ### run ``` docker run -it --gpus device=0 -d --restart always -v /var/run/docker.sock:/var/run/docker.sock --name qwen-vl-chat -p 8080:8080 --user=20001:20001 --platform linux/amd64 qwen-vl-chat:openai ``` ## qwen-int4 openai api ### build ``` docker build -t qwen-vl-chat:int4-openai --platform linux/amd64 -f Dockerfile.qwenint4openai . ``` ### run ``` docker run -it --gpus device=0 -d --restart always -v /var/run/docker.sock:/var/run/docker.sock --name qwen-vl-chat-int4 -p 8080:8080 --user=20001:20001 --platform linux/amd64 qwen-vl-chat:int4-openai ``` ================================================ FILE: Dockerfile.qwendemo ================================================ # python 3.8 and above # pytorch 1.12 and above, 2.0 and above are recommended # CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) # based on modelscope docker image # registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 # registry.cn-beijing.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 ARG workdir=/var/app RUN mkdir -p ${workdir} RUN git lfs install WORKDIR ${workdir} COPY requirements.txt requirements_web_demo.txt ./ # Install Qwen dependencies RUN pip install -r requirements.txt # Install webUI dependencies WORKDIR ${workdir} RUN pip install -r requirements_web_demo.txt # Offline mode, check https://huggingface.co/docs/transformers/v4.15.0/installation#offline-mode ENV HF_DATASETS_OFFLINE=1 ENV TRANSFORMERS_OFFLINE=1 # set TZ, make logs dir, and expose port 8080 ENV TZ=Asia/Shanghai RUN mkdir -p ${workdir}/logs && chmod 777 ${workdir}/logs VOLUME /var/app/logs # create user 20001 RUN useradd -r -m appuser -u 20001 -g 0 WORKDIR ${workdir} # copy model RUN git clone https://huggingface.co/Qwen/Qwen-VL-Chat # COPY --chown=20001:20001 Qwen-VL-Chat ./Qwen-VL-Chat # copy fonts ADD --chown=20001:20001 https://github.com/StellarCN/scp_zh/raw/master/fonts/SimSun.ttf ./ # COPY --chown=20001:20001 SimSun.ttf ./ # copy main app COPY --chown=20001:20001 web_demo_mm.py ./ EXPOSE 8000 CMD ["python3", "web_demo_mm.py", "-c", "./Qwen-VL-Chat", "--server-name", "0.0.0.0", "--server-port", "8000"] ================================================ FILE: Dockerfile.qwenint4openai ================================================ # python 3.8 and above # pytorch 1.12 and above, 2.0 and above are recommended # CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) # based on modelscope docker image # registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 # registry.cn-beijing.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 ARG workdir=/var/app RUN mkdir -p ${workdir} RUN git lfs install WORKDIR ${workdir} COPY requirements.txt requirements_web_demo.txt ./ # Install Qwen dependencies RUN pip install -r requirements.txt # Install webUI dependencies WORKDIR ${workdir} RUN pip install -r requirements_web_demo.txt # Offline mode, check https://huggingface.co/docs/transformers/v4.15.0/installation#offline-mode ENV HF_DATASETS_OFFLINE=1 ENV TRANSFORMERS_OFFLINE=1 # set TZ, make logs dir, and expose port 8080 ENV TZ=Asia/Shanghai RUN mkdir -p ${workdir}/logs && chmod 777 ${workdir}/logs VOLUME /var/app/logs # create user 20001 RUN useradd -r -m appuser -u 20001 -g 0 WORKDIR ${workdir} # copy model RUN git clone https://huggingface.co/Qwen/Qwen-VL-Chat-Int4 # COPY --chown=20001:20001 Qwen-VL-Chat-Int4 ./Qwen-VL-Chat-Int4 # Install AutoGPTQ RUN pip install optimum # RUN git clone https://github.com/JustinLin610/AutoGPTQ.git && \ # cd AutoGPTQ && \ # pip install -v . RUN pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/ # Install OpenAI API dependencies WORKDIR ${workdir} COPY requirements_openai_api.txt ./ RUN pip install -r requirements_openai_api.txt # copy fonts ADD --chown=20001:20001 https://github.com/StellarCN/scp_zh/raw/master/fonts/SimSun.ttf ./ # COPY --chown=20001:20001 SimSun.ttf ./ # copy main app COPY --chown=20001:20001 openai_api.py ./ EXPOSE 8080 # CMD ["python3", "openai_api.py", "-c", "./Qwen-VL-Chat", "--server-name", "0.0.0.0", "--server-port", "8080"] CMD ["python3", "openai_api.py", "-c", "./Qwen-VL-Chat-Int4", "--server-name", "0.0.0.0", "--server-port", "8080"] ================================================ FILE: Dockerfile.qwenopenai ================================================ # python 3.8 and above # pytorch 1.12 and above, 2.0 and above are recommended # CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) # based on modelscope docker image # registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 # registry.cn-beijing.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0 ARG workdir=/var/app RUN mkdir -p ${workdir} RUN git lfs install WORKDIR ${workdir} COPY requirements.txt requirements_web_demo.txt ./ # Install Qwen dependencies RUN pip install -r requirements.txt # Install webUI dependencies WORKDIR ${workdir} RUN pip install -r requirements_web_demo.txt # Offline mode, check https://huggingface.co/docs/transformers/v4.15.0/installation#offline-mode ENV HF_DATASETS_OFFLINE=1 ENV TRANSFORMERS_OFFLINE=1 # set TZ, make logs dir, and expose port 8080 ENV TZ=Asia/Shanghai RUN mkdir -p ${workdir}/logs && chmod 777 ${workdir}/logs VOLUME /var/app/logs # create user 20001 RUN useradd -r -m appuser -u 20001 -g 0 WORKDIR ${workdir} # copy model RUN git clone https://huggingface.co/Qwen/Qwen-VL-Chat # COPY --chown=20001:20001 Qwen-VL-Chat ./Qwen-VL-Chat # Install OpenAI API dependencies WORKDIR ${workdir} COPY requirements_openai_api.txt ./ RUN pip install -r requirements_openai_api.txt # copy fonts ADD --chown=20001:20001 https://github.com/StellarCN/scp_zh/raw/master/fonts/SimSun.ttf ./ # COPY --chown=20001:20001 SimSun.ttf ./ # copy main app COPY --chown=20001:20001 openai_api.py ./ EXPOSE 8080 CMD ["python3", "openai_api.py", "-c", "./Qwen-VL-Chat", "--server-name", "0.0.0.0", "--server-port", "8080"] ================================================ FILE: FAQ.md ================================================ # FAQ ## Installation & Environment #### Which version of transformers should I use? 4.31.0 is preferred. #### I downloaded the codes and checkpoints but I can't load the model locally. What should I do? Please check if you have updated the code to the latest, and correctly downloaded all the sharded checkpoint files. #### `qwen.tiktoken` is not found. What is it? This is the merge file of the tokenizer. You have to download it. Note that if you just git clone the repo without [git-lfs](https://git-lfs.com), you cannot download this file. #### transformers_stream_generator/tiktoken/accelerate not found Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt). ## Demo & Inference #### Is there any demo? Yes, see `web_demo_mm.py` for web demo. See README for more information. #### Can Qwen-VL support streaming? No. We do not support streaming yet. #### It seems that the generation is not related to the instruction... Please check if you are loading Qwen-VL-Chat instead of Qwen-VL. Qwen-VL is the base model without alignment, which behaves differently from the SFT/Chat model. #### Is quantization supported? No. We would support quantization asap. #### Unsatisfactory performance in processing long sequences Please ensure that NTK is applied. `use_dynamc_ntk` and `use_logn_attn` in `config.json` should be set to `true` (`true` by default). ## Tokenizer #### bos_id/eos_id/pad_id not found In our training, we only use `<|endoftext|>` as the separator and padding token. You can set bos_id, eos_id, and pad_id to tokenizer.eod_id. Learn more about our tokenizer from our documents about the tokenizer. ================================================ FILE: FAQ_ja.md ================================================ # FAQ ## インストールと環境 #### transformers のバージョンは？ 4.31.0 が望ましいです。 #### コードとチェックポイントをダウンロードしましたが、モデルをローカルにロードできません。どうすればよいでしょうか？コードを最新のものに更新し、すべてのシャードされたチェックポイントファイルを正しくダウンロードしたかどうか確認してください。 #### `qwen.tiktoken` が見つかりません。これは何ですか？これは tokenizer のマージファイルです。ダウンロードする必要があります。[git-lfs](https://git-lfs.com) を使わずにリポジトリを git clone しただけでは、このファイルをダウンロードできないことに注意してください。 #### transformers_stream_generator/tiktoken/accelerate が見つかりません。コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt) にあります。 ## デモと推論 #### デモはありますか？ウェブデモは `web_demo_mm.py` を参照してください。詳細は README を参照してください。 #### Qwen-VLはストリーミングに対応していますか？いいえ、まだサポートしていません。 #### 世代と命令は関係ないようですが... Qwen-VL ではなく Qwen-VL-Chat を読み込んでいないか確認してください。Qwen-VL はアライメントなしのベースモデルで、SFT/Chat モデルとは動作が異なります。 #### 量子化はサポートされていますか？いいえ。早急に量子化をサポートするつもりです。 #### 長いシーケンスの処理で不満足なパフォーマンス NTK が適用されていることを確認してください。`config.json` の `use_dynamc_ntk` と `use_logn_attn` を `true` に設定する必要がある（デフォルトでは `true`）。 ## Tokenizer #### bos_id/eos_id/pad_id が見つかりません。私たちのトレーニングでは、セパレータとパディングトークンとして `<|endoftext|>` のみを使用しています。bos_id、eos_id、pad_id は tokenizer.eod_id に設定できます。私たちの tokenizer について詳しくは、tokenizer についてのドキュメントをご覧ください。 ================================================ FILE: FAQ_ko.md ================================================ # FAQ ## 설치 및 환경 #### 어떤 버전의 transformers를 사용해야 하나요? 4.31.0 버전을 사용하는 것을 선호합니다. #### 코드와 체크포인트를 다운로드했는데 모델을 로컬에서 불러올 수 없어요. 어떻게 해야 하나요? 코드를 최신 버전으로 업데이트했는지, 그리고 모든 샤드 체크포인트 파일을 올바르게 다운로드했는지 확인해 주세요. #### `qwen.tiktoken`을 찾을 수 없어요. 이게 무엇인가요? 이것은 토크나이저의 병합 파일입니다. 이 파일을 다운로드해야 합니다. [git-lfs](https://git-lfs.com) 없이 단순히 깃 저장소를 복제했다면 이 파일을 다운로드할 수 없습니다. #### transformers_stream_generator/tiktoken/accelerate not found 오류 `pip install -r requirements.txt` 명령을 실행하세요. 이 파일은 [https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt)에서 찾을 수 있습니다. ## Demo & Inference #### 데모가 있나요? 네, 웹 데모는 `web_demo_mm.py`를 참고하세요. 더 많은 정보는 README 파일에서 확인할 수 있습니다. #### Qwen-VL은 스트리밍을 지원하나요? 아니요. 아직 스트리밍을 지원하지 않습니다. #### 생성된 내용이 지시사항과 관련 없는 것 같습니다. Qwen-VL 대신 Qwen-VL-Chat을 로드하고 있는지 확인해 주세요. Qwen-VL은 SFT/Chat 모델과 달리 정렬이 없는 기본 모델이므로 다르게 작동합니다. #### 양자화를 지원하나요? 아니요. 가능한 빨리 양자화를 지원할 예정입니다. #### 긴 시퀀스 처리에서 만족스럽지 못한 성능 NTK가 적용되었는지 확인해 주세요. `config.json`의 `use_dynamc_ntk`과 `use_logn_attn`은 `true`로 설정되어야 합니다(`true`가 기본값). ## Tokenizer #### bos_id/eos_id/pad_id not found 오류 저희 훈련에서는 ``을 구분자 및 패딩 토큰으로만 사용합니다. bos_id, eos_id, pad_id를 tokenizer.eod_id로 설정할 수 있습니다. 토크나이저에 대한 문서에서 토크나이저에 대해 더 알아보세요. ================================================ FILE: FAQ_zh.md ================================================ # FAQ ## 安装&环境 #### 我应该用哪个transformers版本？建议使用4.31.0。 #### 我把模型和代码下到本地，按照教程无法使用，该怎么办？答：别着急，先检查你的代码是不是更新到最新版本，然后确认你是否完整地将模型checkpoint下到本地。 #### `qwen.tiktoken`这个文件找不到，怎么办？这个是我们的tokenizer的merge文件，你必须下载它才能使用我们的tokenizer。注意，如果你使用git clone却没有使用git-lfs，这个文件不会被下载。如果你不了解git-lfs，可点击[官网](https://git-lfs.com/)了解。 #### transformers_stream_generator/tiktoken/accelerate，这几个库提示找不到，怎么办？运行如下命令：`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-VL/blob/main/requirements.txt) 可以找到。 ## Demo & 推理 #### 是否提供Demo？ `web_demo_mm.py`提供了Web UI。请查看README相关内容了解更多。 #### Qwen-VL支持流式推理吗？ Qwen-VL当前不支持流式推理。 #### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的请检查是否加载的是Qwen-VL-Chat模型进行推理，Qwen-VL模型是未经align的预训练基模型，不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`接口内进行了检查，避免您误将预训练模型作为SFT/Chat模型使用。 #### 是否有量化版本模型目前Qwen-VL不支持量化，后续我们将支持高效的量化推理实现。 #### 处理长序列时效果有问题请确认是否开启ntk。若要启用这些技巧，请将`config.json`里的`use_dynamc_ntk`和`use_logn_attn`设置为`true`。最新代码默认为`true`。 ## Tokenizer #### bos_id/eos_id/pad_id，这些token id不存在，为什么？在训练过程中，我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符，你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档，了解如何设置这些id。 ================================================ FILE: LICENSE ================================================ Tongyi Qianwen LICENSE AGREEMENT Tongyi Qianwen Release Date: August 23, 2023 By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately. 1. Definitions a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement. b. "We"(or "Us") shall mean Alibaba Cloud. c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use. d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You. e. "Tongyi Qianwen" shall mean the large language models (including Qwen-VL model and Qwen-VL-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us. f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement. g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files. h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. 2. Grant of Rights You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials. 3. Redistribution You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement; b. You shall cause any modified files to carry prominent notices stating that You changed the files; c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement. 4. Restrictions If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization. 5. Rules of use a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials. b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof). 6. Intellectual Property a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications. b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials. c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought. 7. Disclaimer of Warranty and Limitation of Liability a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto. b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM. c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED. d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials. 8. Survival and Termination. a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement. 9. Governing Law and Jurisdiction. a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement. ================================================ FILE: NOTICE ================================================ ------------- LICENSE FOR NVIDIA Megatron-LM code -------------- Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ------------- LICENSE FOR OpenAI tiktoken code -------------- MIT License Copyright (c) 2022 OpenAI, Shantanu Jain Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ <a href="README_CN.md">中文</a> ｜ English ｜ <a href="README_JA.md">日本語</a> ｜ <a href="README_KO.md">한국어</a> <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖</a> ｜ Qwen-VL-Chat <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖</a> (Int4: <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary">🤖</a> ) ｜ Qwen-VL-Plus <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">🤖</a> ｜ Qwen-VL-Max <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a> <a href="https://tongyi.aliyun.com/qianwen">Web</a> | <a href="http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg">APP</a> | <a href="https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start">API</a> | <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/CV4E9rpNSD">Discord</a> | <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="TUTORIAL.md">Tutorial</a> --- ## Qwen-VL-Plus & Qwen-VL-Max Qwen-Vl-Plus and Qwen-VL-Max are the upgraded and latest versions of the Qwen-VL model family, currently supporting access for free through <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a>, <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>, [Web pages](https://qianwen.aliyun.com), [APP](http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg) and [APIs](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start/). | Model name | Model description | | --- | --- | | Qwen-VL-Plus | Qwen's **Enhanced Large Visual Language Model**. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and extreme aspect ratios for image input. It delivers **significant** performance across a broad range of visual tasks. | | Qwen-VL-Max | Qwen's **Most Capable Large Visual Language Model**. Compared to the enhanced version, further improvements have been made to visual reasoning and instruction-following capabilities, offering a higher level of visual perception and cognitive understanding. It delivers **optimal** performance on an even broader range of complex tasks. | The key technical advancements in these versions include: - Substantially boost in image-related **reasoning capabilities**; - Considerable enhancement in recognizing, extracting, and analyzing **details of images**, especially for text-oriented tasks; - Support for **high-definition images** with resolutions above one million pixels and extreme aspect ratios; These two models not only significantly surpass all previous best results from open-source LVLM models, but also perform on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks. Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension. This breakthrough underscores the model’s advanced capabilities and its potential to set new standards in the field of multimodal AI research and application. <table> <thead> <tr> <th>Model</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>TextVQA</th> <th>MMMU</th> <th>MathVista</th> <th>MM-Bench-CN</th> </tr> </thead> <tbody align="center"> <tr> <td>Other Best Open-source LVLM</td> <td>81.6% (CogAgent)</td> <td>68.4% (CogAgent)</td> <td>73.7% (Fuyu-Medium)</td> <td>76.1% (CogAgent)</td> <td>45.9% (Yi-VL-34B)</td> <td>36.7% (SPHINX-V2)</td> <td>72.4% (InternLM-XComposer-VL)</td> </tr> <tr> <td>Gemini Pro</td> <td>88.1%</td> <td>74.1%</td> <td>73.9%</td> <td>74.6%</td> <td>47.9%</td> <td>45.2%</td> <td>74.3%</td> </tr> <tr> <td>Gemini Ultra</td> <td>90.9%</td> <td>80.8% 1</td> <td>79.5% 1</td> <td>82.3% 1</td> <td>59.4% 1</td> <td>53.0% 1</td> <td>-</td> </tr> <tr> <td>GPT-4V</td> <td>88.4%</td> <td>78.5%</td> <td>78.2%</td> <td>78.0%</td> <td>56.8%</td> <td>49.9%</td> <td>73.9%</td> </tr> <tr> <td>Qwen-VL-Plus</td> <td>91.4%</td> <td>78.1%</td> <td>75.9%</td> <td>78.9%</td> <td>45.2%</td> <td>43.3%</td> <td>68.0%</td> </tr> <tr> <td>Qwen-VL-Max</td> <td>93.1% 1</td> <td>79.8% 2</td> <td>79.3% 2</td> <td>79.5% 2</td> <td>51.4% 3</td> <td>51.0% 2</td> <td>75.1% 1</td> </tr> </tbody> </table> All numbers are obtained without any use of external OCR tools ('pixel only'). --- ## News and Updates * ```2024.01.18``` 💥💥💥 We introduce Qwen-VL-Max, our most capable model that significantly surpasses all previous open-source LVLM models, and it performs on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks. You can enjoy the new model by directly visiting our [web pages](https://qianwen.aliyun.com), <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a> and <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>. * ```2023.11.28``` 🏆🏆🏆 Qwen-VL-Plus achieved the best performance in [DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) by using a single model, surpassing GPT4V and PALI-X, without using model ensemble or OCR-pipeline. Meanwhile, it is also a general model that can help you analyze and understand various tasks by directly inputting images. * ```2023.9.25``` 🚀🚀🚀 We update Qwen-VL-Chat with more robust Chinese instruction-following ability, improved understanding of web pages and table images, and better dialogue performance (Touchstone: CN: 401.2->481.7, EN: 645.2->711.6). * ```2023.9.12``` 😃😃😃 We now support finetuning on the Qwen-VL models, including full-parameter finetuning, LoRA and Q-LoRA. * ```2023.9.8``` 👍👍👍 Thanks to [camenduru](https://github.com/camenduru) for contributing the wonderful [Colab](https://github.com/camenduru/Qwen-VL-Chat-colab). Everyone can use it as a local or online Qwen-VL-Chat-Int4 Demo tutorial on one 12G GPU. * ```2023.9.5``` 👏👏👏 Qwen-VL-Chat achieves SOTAs on [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks. * ```2023.9.4``` ⭐⭐⭐ Qwen-VL series achieve SOTAs on [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard), a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs including both image and video understanding. * ```2023.9.1``` 🔥🔥🔥 We release the [TouchStone](https://github.com/OFA-Sys/TouchStone) Evaluation, which is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By using strong LLMs as judges and converting multimodal information into text. * ```2023.8.31``` 🌟🌟🌟 We release the Int4 quantized model for Qwen-VL-Chat, **Qwen-VL-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation. * ```2023.8.22``` 🎉🎉🎉 We release both **Qwen-VL** and **Qwen-VL-Chat** on ModelScope and Hugging Face. We also provide a [paper](https://arxiv.org/abs/2308.12966) for more details about the model, including training details and model performance. --- ## Qwen-VL **Qwen-VL** (Qwen Large Vision Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text, and bounding box. The features of Qwen-VL include: - **Strong performance**: It significantly surpasses existing open-sourced Large Vision Language Models (LVLM) under a similar model scale on multiple English evaluation benchmarks (including Zero-shot Captioning, VQA, DocVQA, and Grounding). - **Multi-lingual LVLM supporting text recognition**: Qwen-VL naturally supports English, Chinese, and multi-lingual conversation, and it promotes end-to-end recognition of Chinese and English bi-lingual text in images. - **Multi-image interleaved conversations**: This feature allows for the input and comparison of multiple images, as well as the ability to specify questions related to the images and engage in multi-image storytelling. - **First generalist model supporting grounding in Chinese**: Detecting bounding boxes through open-domain language expression in both Chinese and English. - **Fine-grained recognition and understanding**: Compared to the 224\*224 resolution currently used by other open-sourced LVLM, the 448\*448 resolution promotes fine-grained text recognition, document QA, and bounding box annotation. <img src="assets/demo_vl.gif" width="400"/> We release two models of the Qwen-VL series: - Qwen-VL: The pre-trained LVLM model uses Qwen-7B as the initialization of the LLM, and [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) as the initialization of the visual encoder. And connects them with a randomly initialized cross-attention layer. - Qwen-VL-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multiple image inputs, multi-round question answering, and creative capabilities. ## Evaluation We evaluated the model's abilities from three perspectives: 1. **Standard Benchmarks**: We evaluate the model's basic task capabilities on four major categories of multimodal tasks: - Zero-shot Captioning: Evaluate model's zero-shot image captioning ability on unseen datasets; - General VQA: Evaluate the general question-answering ability of pictures, such as the judgment, color, number, category, etc; - Text-based VQA: Evaluate the model's ability to recognize text in pictures, such as document QA, chart QA, etc; - Referring Expression Comprehension: Evaluate the ability to localize a target object in an image described by a referring expression. 2. **TouchStone**: To evaluate the overall text-image dialogue capability and alignment level with humans, we have constructed a benchmark called [TouchStone](https://github.com/OFA-Sys/TouchStone), which is based on scoring with GPT4 to evaluate the LVLM model. - The TouchStone benchmark covers a total of 300+ images, 800+ questions, and 27 categories. Such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc; - In order to break the current limitation of GPT4 in terms of direct image input, TouchStone provides fine-grained image annotations by human labeling. These detailed annotations, along with the questions and the model's output, are then presented to GPT4 for scoring. - The benchmark includes both English and Chinese versions. 3. **Other Multimodal Benchmarks**: We also evaluated our model's capabilities in other multimodal benchmarks: - [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), a comprehensive evaluation benchmark for multimodal large language models. Qwen-VL-Chat achieves SOTAs on both perception and cognition tracks. - [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard), a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs. Qwen series achieves SOTAs on this benchmark. The results of the evaluation are as follows: Qwen-VL outperforms current SOTA generalist models on multiple VL tasks and has a more comprehensive coverage in terms of capability range. <img src="assets/radar.png" width="600"/> ### Zero-shot Captioning & General VQA <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>80.5</td> <td>51.1</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - For zero-shot image captioning, Qwen-VL achieves the **SOTA** on Flickr30K and competitive results on Nocaps with InstructBlip. - For general VQA, Qwen-VL achieves the **SOTA** under the same generalist LVLM scale settings. ### Text-oriented VQA (Focused on text understanding capabilities in images) <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - In text-related recognition/QA evaluation, Qwen-VL achieves the SOTA under the generalist LVLM scale settings. - Resolution is important for several above evaluations. While most open-sourced LVLM models with 224 resolution are incapable of these evaluations or can only solve these by cutting images, Qwen-VL scales the resolution to 448 so that it can be evaluated end-to-end. Qwen-VL even outperforms Pix2Struct-Large models of 1024 resolution on some tasks. ### Referring Expression Comprehension <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56</td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - Qwen-VL achieves the **SOTA** in all above referring expression comprehension benchmarks. - Qwen-VL has not been trained on any Chinese grounding data, but it can still generalize to the Chinese Grounding tasks in a zero-shot way by training Chinese Caption data and English Grounding data. We provide all of the above evaluation scripts for reproducing our experimental results. Please read [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md) for more information. ### Chat evaluation TouchStone is a benchmark based on scoring with GPT4 to evaluate the abilities of the LVLM model on text-image dialogue and alignment levels with humans. It covers a total of 300+ images, 800+ questions, and 27 categories, such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc. Please read [touchstone/README.md](touchstone/README.md) for more information. #### English evaluation | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### Chinese evaluation | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat has achieved the best results in both Chinese and English alignment evaluation. ### Other Benchmarks #### MME Benchmark [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning. Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation. See more details on [HERE](eval_mm/mme/EVAL_MME.md). <img src="eval_mm/mme/perception.jpg" width="600"/> <img src="eval_mm/mme/cognition.jpg" width="600"/> #### SEED-Bench [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both **image** and **video** understanding. See more details on [HERE](eval_mm/seed_bench/EVAL_SEED.md). Qwen-VL and Qwen-VL-Chat achieve SOTAs on this benchmark. <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## Requirements * python 3.8 and above * pytorch 1.12 and above, 2.0 and above are recommended * CUDA 11.4 and above are recommended (this is for GPU users) ## Quickstart Below, we provide simple examples to show how to use Qwen-VL and Qwen-VL-Chat with 🤖 ModelScope and 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. ```bash pip install -r requirements.txt ``` Now you can start with ModelScope or Transformers. More usage aboue vision encoder, please refer to the [tutorial](TUTORIAL.md). #### 🤗 Transformers To use Qwen-VL-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # Note: The default behavior now has injection attack prevention off. tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 1st dialogue turn query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。 # 2nd dialogue turn response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> <details> <summary>Running Qwen-VL</summary> Running Qwen-VL pretrained base model is also simple. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0) # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> </details> In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below: ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below: ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # use bf16 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # use auto model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>=4.32.0) # model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 1st dialogue turn # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。 # 2nd dialogue turn response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> ## Quantization ### Usage We provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-VL-Chat, Qwen-VL-Chat-Int4 [Click here](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed. Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel. Then you can load the quantized model easily and run inference as same as usual: ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### Performance We illustrate the model performance of both BF16 and Int4 models on the benchmark **[TouchStone](https://github.com/OFA-Sys/TouchStone)**, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### Inference Speed We measured the average inference speed (tokens/s) of generating 1792 (2048-258) and 7934 (8192-258) tokens with the context of an image (which takes 258 tokens) under BF16 precision and Int4 quantization, respectively. | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. ### GPU Memory Usage We also profile the peak GPU memory usage for encoding 1792 (2048-258) tokens (including an image) as context (and generating single token) and generating 7934 (8192-258) tokens (with an image as context) under BF16 or Int4 quantization level, respectively. The results are shown below. | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py). ## Finetuning Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with DeepSpeed and FSDP. The shell scripts that we provide use DeepSpeed, and thus we advise you to install DeepSpeed before you start: ```bash pip install deepspeed ``` ### Data preparation To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample: ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` For the VL tasks, there are special tokens that are used, including `<img> </img> <ref> </ref> <box> </box>`. The picture is represented as `Picture id: <img>img_path</img>\n{your prompt}`, where `id` indicates the position of the image in the conversation, starting from 1. The "img_path" can be a local file path or a web link. The coordinate box is expressed as `<box>(x1,y1),(x2,y2)</box>`·, where `(x1, y1)` and `(x2, y2)` are normalized values in the range `[0, 1000)`. Its corresponding text description can be identified by `<ref>text_caption</ref>`. After data preparation, you can use the provided shell scripts to run finetuning. Remember to specify the path to the data file, `$DATA`. The finetuning scripts allow you to perform: - Full-parameter finetuning - LoRA - Q-LoRA ### Full-parameter finetuning Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. In our experiments, frozening the parameters of ViT during the fine-tuning phase achieves better performance. To launch your training, run the following script: ```bash sh finetune/finetune_ds.sh ``` Remember to specify the correct model name or path, the data path, as well as the output directory in the shell scripts. If you want to make changes, just remove the argument `--deepspeed` or make changes in the DeepSpeed configuration json file based on your requirements. Additionally, this script supports mixed-precision training, and thus you can use `--bf16 True` or `--fp16 True`. Empirically we advise you to use bf16 to make your training consistent with our pretraining and alignment if your machine supports bf16, and thus we use it by default. ### LoRA Similarly, to run LoRA, use another script to run as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load. ```bash # Single GPU training sh finetune/finetune_lora_single_gpu.sh # Distributed training sh finetune/finetune_lora_ds.sh ``` In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs. Note that if you use LoRA to finetune the base language model, e.g., Qwen-VL, instead of chat models, e.g., Qwen-VL-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA finetune the chat models. Check the profile below for more information. ### Q-LoRA However, if you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs. To run Q-LoRA, directly run the following script: ```bash # Single GPU training sh finetune/finetune_qlora_single_gpu.sh # Distributed training sh finetune/finetune_qlora_ds.sh ``` For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-VL-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work. Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the finetuned model for inference as shown below: ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` If you want to merge the adapters and save the finetuned model as a standalone model (you can only do this with LoRA, and you CANNOT merge the parameters from Q-LoRA), you can run the following codes: ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size and safe serialization are not necessary. # They respectively work for sharding checkpoint and save the model to safetensors merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` Note: For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument --model_max_length, based on your consideration of data, memory footprint, and training speed. ### Profiling of Memory and Speed We profile the GPU memory and training speed of both LoRA (Base) refers to training the embedding and output layer, while LoRA (Chat) has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. We uniformly use a batch size of 1 and gradient accumulation of 8. Each sample contains an image. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 384, 512, 1024, and 2048. The statistics are listed below: <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> ## Demo ### Web UI We provide code for users to build a web UI demo. Before you start, make sure you install the following packages: ``` pip install -r requirements_web_demo.txt ``` Then run the command below and click on the generated link: ``` python web_demo_mm.py ``` ## FAQ If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue. ## License Agreement Researchers and developers are free to use the codes and model weights of both Qwen-VL and Qwen-VL-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. ## Citation If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## Contact Us If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com. ================================================ FILE: README_CN.md ================================================ 中文</a> ｜ <a href="README.md">English</a> ｜ <a href="README_JA.md">日本語</a> ｜ <a href="README_KO.md">한국어</a> <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖</a> ｜ Qwen-VL-Chat <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖</a> (Int4: <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary">🤖</a> ) ｜ Qwen-VL-Plus <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">🤖</a> ｜ Qwen-VL-Max <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a> <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a> <a href="https://tongyi.aliyun.com/qianwen">Web</a> | <a href="http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg">APP</a> | <a href="https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start">API</a> | <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a> | <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="TUTORIAL.md">Tutorial</a> --- ## Qwen-VL-Plus & Qwen-VL-Max Qwen-VL 系列再次迎来重磅升级，我们推出 Qwen-VL-Plus 和 Qwen-VL-Max 两个升级版的模型。目前支持通过<a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Max">🤗</a>、<a href="https://modelscope.cn/studios/qwen/Qwen-VL-Max/summary">🤖</a>、[网页端](https://qianwen.aliyun.com)、[APP](http://ofasys-wlcb.oss-accelerate-overseas.aliyuncs.com/QwenVL/blog/app_qrcode.jpg) 和 [API](https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start)免费访问。 | 模型名 | 模型简介 | | --- | --- | | Qwen-VL-Plus | 通义千问大规模视觉语言模型增强版。大幅提升细节识别能力和文字识别能力，支持超百万像素分辨率和任意长宽比规格的图像。在广泛的视觉任务上提供**卓越**的性能。 | | Qwen-VL-Max | 通义千问超大规模视觉语言模型。相比增强版，再次提升视觉推理能力和指令遵循能力，提供更高的视觉感知和认知水平。在更多复杂任务上提供**最佳**的性能。 | 这两个版本的主要技术升级在于： - 大幅提升图像相关的推理能力； - 大幅提升对图中细节和文字的识别、提取和分析能力； - 支持百万像素以上的高清分辨率图，支持各种长宽比的图像；这两个模型不仅大幅超越此前所有开源 LVLM 模型的最佳水平，并且在多项图文多模态标准测试中获得了堪比 Gemini Ultra 和 GPT4-v 的水准。甚至，Qwen-VL-Max 在中文问答、中文文字理解相关的任务上超越了 OpenAI的 GPT4-v 和 Google 的 Gemini-Pro。 <table> <thead> <tr> <th>Model</th> <th>DocVQA (文档理解)</th> <th>ChartQA (图表理解)</th> <th>AI2D (科学图例)</th> <th>TextVQA (文字阅读)</th> <th>MMMU (多学科问题)</th> <th>MathVista (数学推理)</th> <th>MM-Bench-CN (中文问答)</th> </tr> </thead> <tbody align="center"> <tr> <td>Other Best Open-source LVLM</td> <td>81.6% (CogAgent)</td> <td>68.4% (CogAgent)</td> <td>73.7% (Fuyu-Medium)</td> <td>76.1% (CogAgent)</td> <td>45.9% (Yi-VL-34B)</td> <td>36.7% (SPHINX-V2)</td> <td>72.4% (InternLM-XComposer-VL)</td> </tr> <tr> <td>Gemini Pro</td> <td>88.1%</td> <td>74.1%</td> <td>73.9%</td> <td>74.6%</td> <td>47.9%</td> <td>45.2%</td> <td>74.3%</td> </tr> <tr> <td>Gemini Ultra</td> <td>90.9%</td> <td>80.8% 1</td> <td>79.5% 1</td> <td>82.3% 1</td> <td>59.4% 1</td> <td>53.0% 1</td> <td>-</td> </tr> <tr> <td>GPT-4V</td> <td>88.4%</td> <td>78.5%</td> <td>78.2%</td> <td>78.0%</td> <td>56.8%</td> <td>49.9%</td> <td>73.9%</td> </tr> <tr> <td>Qwen-VL-Plus</td> <td>91.4%</td> <td>78.1%</td> <td>75.9%</td> <td>78.9%</td> <td>44.0%</td> <td>43.3%</td> <td>68.0%</td> </tr> <tr> <td>Qwen-VL-Max</td> <td>92.5% 1</td> <td>79.8% 2</td> <td>79.3% 2</td> <td>79.5% 2</td> <td>51.4% 3</td> <td>51.0% 2</td> <td>75.1% 1</td> </tr> </tbody> </table> 所有评测都是在不使用任何外部OCR工具(“only pixel”)的情况下获得的。 --- ## 新闻 * 2024年01月18日我们推出 Qwen-Vl-Max，大幅超越此前所有开源 LVLM 模型的最佳水平，并且在多项图文多模态标准测试中获得了堪比 Gemini Ultra 和 GPT4-v 的水准。直接访问[通义千问网页端或APP](https://qianwen.aliyun.com)就能体验新模型。 * 2023年11月28日 Qwen-VL单模型在[DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1)达到了最强水平，超越了GPT4V,PALI-X，与此同时它还是一个通用模型，直接输入图片就能帮你分析理解各种任务。 * 2023年9月12日更新Qwen-VL-Chat模型，该模型有更鲁棒的中文指令跟随，更好的网页和表格图片理解和问答能力以及更好的对话表现(Touchstone: 中文: 401.2->481.7, 英文: 645.2->711.6)。 * 2023年9月12日支持Qwen-VL和Qwen-VL-Chat的微调，其中包括全参数微调、LoRA以及Q-LoRA * 2023年9月8日感谢[camenduru](https://github.com/camenduru)贡献了[Colab](https://github.com/camenduru/Qwen-VL-Chat-colab)示例，每个人都可以以此为教程，在12G的GPU上做本地或在线的Demo。 * 2023年9月5日在社区多模态通用模型榜单 [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) 上取得了感知和认知双赛道的当前最好结果。 * 2023年9月4日在社区多模态通用模型榜单 [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) 上取得了图像理解和视频理解的当前最好结果。 * 2023年9月1日发布[TouchStone](https://github.com/OFA-Sys/TouchStone) 测评, 这是一个综合评估LVLM能力的测评,它不仅考察模型的视觉描述和推理能力，还包括根据视觉内容的文学创作能力。同时它是将多模态信息用文本表述并用LLMs进行评估的方法。 * 2023年8月31日发布Qwen-VL-Chat量化模型，**Qwen-VL-Chat-Int4**,该模型显存占用低，推理速度相比半精度模型显著提升，在基准评测上效果损失较小。 * 2023年8月22日在魔搭社区（ModelScope）和Hugging Face同步推出Qwen-VL和Qwen-VL-Chat模型。同时，我们提供一个[论文](https://arxiv.org/abs/2308.12966)介绍了相关的模型结构、训练细节和模型表现。 --- ## Qwen-VL **Qwen-VL** 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。Qwen-VL 系列模型的特点包括： - **强大的性能**：在四大类多模态任务的标准英文测评中（Zero-shot Captioning/VQA/DocVQA/Grounding）上，均取得同等通用模型大小下最好效果； - **多语言对话模型**：天然支持英文、中文等多语言对话，端到端支持图片里中英双语的长文本识别； - **多图交错对话**：支持多图输入和比较，指定图片问答，多图文学创作等； - **首个支持中文开放域定位的通用模型**：通过中文开放域语言表达进行检测框标注； - **细粒度识别和理解**：相比于目前其它开源LVLM使用的224分辨率，Qwen-VL是首个开源的448分辨率的LVLM模型。更高分辨率可以提升细粒度的文字识别、文档问答和检测框标注。 <img src="assets/demo_vl.gif" width="400"/> 目前，我们提供了 Qwen-VL 系列的两个模型： - Qwen-VL: Qwen-VL 以 Qwen-7B 的预训练模型作为语言模型的初始化，并以 [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) 作为视觉编码器的初始化，中间加入单层随机初始化的 cross-attention，经过约1.5B的图文数据训练得到。最终图像输入分辨率为448。 - Qwen-VL-Chat: 在 Qwen-VL 的基础上，我们使用对齐机制打造了基于大语言模型的视觉AI助手Qwen-VL-Chat，它支持更灵活的交互方式，包括多图、多轮问答、创作等能力。 ## 评测我们从三个角度评测了模型的能力： 1. 在**英文标准 Benchmark** 上评测模型的基础任务能力。目前评测了四大类多模态任务： - Zero-shot Captioning: 评测模型在未见过数据集上的零样本图片描述能力； - General VQA: 评测模型的通用问答能力，例如判断题、颜色、个数、类目等问答能力； - Text-based VQA：评测模型对于图片中文字相关的识别/问答能力，例如文档问答、图表问答、文字问答等； - Referring Expression Compression：评测模型给定物体描述画检测框的能力； 2. **试金石 (TouchStone)**：为了评测模型整体的图文对话能力和人类对齐水平。我们为此构建了一个基于 GPT4 打分来评测 LVLM 模型的 Benchmark：TouchStone。在 TouchStone-v0.1 中： - 评测基准总计涵盖 300+张图片、800+道题目、27个类别。包括基础属性问答、人物地标问答、影视作品问答、视觉推理、反事实推理、诗歌创作、故事写作，商品比较、图片解题等**尽可能广泛的类别**。 - 为了弥补目前 GPT4 无法直接读取图片的缺陷，我们给所有的带评测图片提供了**人工标注的充分详细描述**，并且将图片的详细描述、问题和模型的输出结果一起交给 GPT4 打分。 - 评测同时包含英文版本和中文版本。 3. **其它多模态通用模型榜单**：我们也在其它多模态通用模型榜单中评测了模型的能力： - MME Benchmark: 是一个多模态大型语言模型的综合评价基准。它在总共14个子任务上评测**感知和认知**能力，Qwen-VL-Chat在这两个总维度上都实现了当前最好结果。 - SEED-Bench: 是一个包含1.9万选择题的多模态基准测评，通过人工注释的结果评估多模态大模型，涵盖12个评估维度，包括**图像和视频理解**，Qwen-VL和Qwen-VL-chat在这个基准上实现了当前最好结果。评测结果如下： Qwen-VL在多个VL任务上相比目前SOTA的Generalist Models都有明显优势，并且在能力范围也覆盖更加全面。 <img src="assets/radar.png" width="600"/> ### 零样本图像描述生成（Zero-shot Image Caption）及通用视觉问答（General VQA） <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>66.7</td> <td>45.6</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - 在 Zero-shot Captioning 中，Qwen-VL 在 Flickr30K 数据集上取得了 **SOTA** 的结果，并在 Nocaps 数据集上取得了和 InstructBlip 可竞争的结果。 - 在 General VQA 中，Qwen-VL 取得了 LVLM 模型同等量级和设定下 **SOTA** 的结果。 ### 文本导向的视觉问答（Text-oriented VQA） <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - 在文字相关的识别/问答评测上，取得了当前规模下通用 LVLM 达到的最好结果。 - 分辨率对上述某几个评测非常重要，大部分 224 分辨率的开源 LVLM 模型无法完成以上评测，或只能通过切图的方式解决。Qwen-VL 将分辨率提升到 448，可以直接以端到端的方式进行以上评测。Qwen-VL 在很多任务上甚至超过了 1024 分辨率的 Pix2Struct-Large 模型。 ### 细粒度视觉定位（Referring Expression Comprehension） <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> </tr> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56 </td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - 在定位任务上，Qwen-VL 全面超过 Shikra-13B，取得了目前 Generalist LVLM 模型上在 Refcoco 上的 **SOTA**。 - Qwen-VL 并没有在任何中文定位数据上训练过，但通过中文 Caption 数据和英文 Grounding 数据的训练，可以 Zero-shot 泛化出中文 Grounding 能力。我们提供了以上**所有**评测脚本以供复现我们的实验结果。请阅读 [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md) 了解更多信息。 ### 对话能力测评 TouchStone 是一个基于 GPT4 打分来评测 LVLM 模型的图文对话能力和人类对齐水平的基准。它涵盖了 300+张图片、800+道题目、27个类别，包括基础属性、人物地标、视觉推理、诗歌创作、故事写作、商品比较、图片解题等**尽可能广泛的类别**。关于 TouchStone 的详细介绍，请参考[touchstone/README_CN.md](touchstone/README_CN.md)了解更多信息。 #### 英语 | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### 中文 | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat 模型在中英文的对齐评测中均取得当前 LVLM 模型下的最好结果。 ### 其它榜单测评 #### MME Benchmark MME是多模态大型语言模型的综合评价基准。它在总共14个子任务上评测**感知和认知**能力。Qwen-VL-Chat在这个基准上实现了SOTAs。完整复现[见此](eval_mm/mme/EVAL_MME.md). <img src="eval_mm/mme/perception.jpg" width="600"/> <img src="eval_mm/mme/cognition.jpg" width="600"/> #### SEED-Bench SEED-Bench是一个包含1.9万选择题的多模态基准测评，通过人工注释的结果评估多模态大模型，涵盖12个评估维度，包括**图像和视频理解**。Qwen-VL和Qwen-VL-chat在这个基准上实现了SOTAs。完整复现[见此](eval_mm/seed_bench/EVAL_SEED.md)。 <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## 部署要求 * python 3.8及以上版本 * pytorch 1.12及以上版本，推荐2.0及以上版本 * 建议使用CUDA 11.4及以上（GPU用户需考虑此选项） ## 快速使用我们提供简单的示例来说明如何利用 🤖 ModelScope 和 🤗 Transformers 快速使用 Qwen-VL 和 Qwen-VL-Chat。在开始前，请确保你已经配置好环境并安装好相关的代码包。最重要的是，确保你满足上述要求，然后安装相关的依赖库。 ```bash pip install -r requirements.txt ``` 接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。关于视觉模块的更多用法，请参考[教程](TUTORIAL_zh.md)。 #### 🤗 Transformers 如希望使用 Qwen-VL-chat 进行推理，所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码。** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # 请注意：分词器默认行为已更改为默认关闭特殊token攻击防护。 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # 使用CPU进行推理，需要约32GB内存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # 默认gpu进行推理，需要约24GB显存 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # 可指定不同的生成长度、top_p等相关超参（transformers 4.32.0及以上无需执行此操作） # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 第一轮对话 query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。 # 第二轮对话 response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> 运行Qwen-VL同样非常简单。 <summary>运行Qwen-VL</summary> ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # 使用CPU进行推理，需要约32GB内存 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # 默认gpu进行推理，需要约24GB显存 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # 可指定不同的生成长度、top_p等相关超参（transformers 4.32.0及以上无需执行此操作） # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> 若在使用上述代码时由于各种原因无法从 HuggingFace 拉取模型和代码，可以先从 ModelScope 下载模型及代码至本地，再从本地加载模型： ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope 魔搭（ModelScope）是开源的模型即服务共享平台，为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单，代码如下所示： ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # 使用CPU进行推理，需要约32GB内存 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # 默认gpu进行推理，需要约24GB显存 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # 指定生成超参数（transformers 4.32.0及以上无需执行此操作） # model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 第一轮对话 # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。 # 第二轮对话 response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` ## 量化 ### 用法当前我们提供了基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化方案，并提供了Qwen-VL-Chat的Int4量化版本Qwen-VL-Chat-Int4 [点击此处](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)。该模型在效果评测上几乎无损，并在显存占用和推理速度上具有明显优势。下文说明如何使用该量化模型。开始之前，请确保你满足要求（如torch2.0及以上、transformers 4.32.0及以上，等）并安装所需的代码库： ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` 如遇到安装 `auto-gptq` 的问题，建议您前往官方[repo](https://github.com/PanQiWei/AutoGPTQ) 寻找合适的wheel。随后你便可以按照上述用法****，轻松调用量化模型： ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### 效果评测我们列出不同精度下模型在评测基准 **[TouchStone](https://github.com/OFA-Sys/TouchStone)** 上的表现，并发现量化模型并没有显著性能损失。结果如下所示： | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### 推理速度我们测算了在输入一张图片（即258个token）的条件下BF16和Int4的模型生成1792 (2048-258) 和 7934 (8192-258) 个token的平均速度。 | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | 推理速度测算是在单卡 A100-SXM4-80G GPU上运行，使用PyTorch 2.0.1及CUDA 11.4。 ### GPU显存占用我们还测算了在一张图片输入的条件下BF16和Int4模型生成1792 (2048-258) 和 7934 (8192-258) 个token所需显存。结果如下所示： | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | 上述速度和显存测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)完成。 ## 微调我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能，以接入下游任务。此外，我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed，因此建议您确保已经安装DeepSpeed。首先，你需要准备你的训练数据。你需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典，包含id和conversation，其中后者为一个列表。示例如下所示： ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` 为针对多样的VL任务，我们增加了一下的特殊tokens： `<img> </img> <ref> </ref> <box> </box>`. 对于带图像输入的内容可表示为 `Picture id: <img>img_path</img>\n{your prompt}`，其中`id`表示对话中的第几张图片。"img_path"可以是本地的图片或网络地址。对话中的检测框可以表示为`<box>(x1,y1),(x2,y2)</box>`，其中 `(x1, y1)` 和`(x2, y2)`分别对应左上角和右下角的坐标，并且被归一化到`[0, 1000)`的范围内. 检测框对应的文本描述也可以通过`<ref>text_caption</ref>`表示。准备好数据后，你可以使用我们提供的shell脚本实现微调。注意，你需要在脚本中指定你的数据的路径。微调脚本能够帮你实现： - 全参数微调 - LoRA - Q-LoRA ### 全参数微调默认下全参数微调在训练过程中更新LLM所有参数。我们的实验中，在微调阶段不更新ViT的参数会取得更好的表现。你可以运行这个脚本开始训练： ```bash # 分布式训练。由于显存限制将导致单卡训练失败，我们不提供单卡训练脚本。 sh finetune/finetune_ds.sh ``` 尤其注意，你需要在脚本中指定正确的模型名称或路径、数据路径、以及模型输出的文件夹路径。如果你想修改deepspeed配置，可以删除掉`--deepspeed`这个输入或者自行根据需求修改DeepSpeed配置json文件。此外，我们支持混合精度训练，因此你可以设置`--bf16 True`或者`--fp16 True`。经验上，如果你的机器支持bf16，我们建议使用bf16，这样可以和我们的预训练和对齐训练保持一致，这也是为什么我们把默认配置设为它的原因。 ### LoRA 运行LoRA的方法类似全参数微调。但在开始前，请确保已经安装`peft`代码库。另外，记住要设置正确的模型、数据和输出路径。我们建议你为模型路径使用绝对路径。这是因为LoRA仅存储adapter部分参数，而adapter配置json文件记录了预训练模型的路径，用于读取预训练模型权重。同样，你可以设置bf16或者fp16。 ```bash # 单卡训练 sh finetune/finetune_lora_single_gpu.sh # 分布式训练 sh finetune/finetune_lora_ds.sh ``` 与全参数微调不同，LoRA ([论文](https://arxiv.org/abs/2106.09685)) 只更新adapter层的参数而无需更新原有语言模型的参数。这种方法允许用户用更低的显存开销来训练模型，也意味着更小的计算开销。注意，如果你使用预训练模型进行LoRA微调，而非chat模型，模型的embedding和输出层的参数将被设为可训练的参数。这是因为预训练模型没有学习过ChatML格式中的特殊token，因此需要将这部分参数设为可训练才能让模型学会理解和预测这些token。这也意味着，假如你的训练引入新的特殊token，你需要通过代码中的`modules_to_save`将这些参数设为可训练的参数。如果你想节省显存占用，可以考虑使用chat模型进行LoRA微调，显存占用将大幅度降低。下文的显存占用和训练速度的记录将详细介绍这部分细节。 ### Q-LoRA 如果你依然遇到显存不足的问题，可以考虑使用Q-LoRA ([论文](https://arxiv.org/abs/2305.14314))。该方法使用4比特量化模型以及paged attention等技术实现更小的显存开销。运行Q-LoRA你只需运行如下脚本： ```bash # 单卡训练 sh finetune/finetune_qlora_single_gpu.sh # 分布式训练 sh finetune/finetune_qlora_ds.sh ``` 我们建议你使用我们提供的Int4量化模型进行训练，即Qwen-VL-Chat-Int4。请**不要使用**非量化模型！与全参数微调以及LoRA不同，Q-LoRA仅支持fp16。此外，上述LoRA关于特殊token的问题在Q-LoRA依然存在。并且，Int4模型的参数无法被设为可训练的参数。所幸的是，我们只提供了Chat模型的Int4模型，因此你不用担心这个问题。但是，如果你执意要在Q-LoRA中引入新的特殊token，很抱歉，我们无法保证你能成功训练。与全参数微调不同，LoRA和Q-LoRA的训练只需存储adapter部分的参数。假如你需要使用LoRA训练后的模型，你需要使用如下方法。你可以用如下代码读取模型： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` 如果你觉得这样一步到位的方式让你很不安心或者影响你接入下游应用，你可以选择先合并并存储模型（LoRA支持合并，Q-LoRA不支持），再用常规方式读取你的新模型，示例如下： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size and safe serialization are not necessary. # They respectively work for sharding checkpoint and save the model to safetensors merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` 注意：分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外，你需要根据你的数据、显存情况和训练速度预期，使用`--model_max_length`设定你的数据长度。 ### 显存占用及训练速度下面记录Qwen_VL模型在单GPU使用LoRA（LoRA (Base)指的是embedding和输出层参与训练，而LoRA (Chat)则不优化这部分参数）和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU，使用CUDA 11.8和Pytorch 2.0。我们统一使用batch size为1，gradient accumulation为8的训练配置，每个样本包含一张图，分别记录输入长度分别为384、512、1024和2048的显存占用（GB）和训练速度（s/iter）。具体数值如下所示： <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> ## Demo ### Web UI 我们提供了Web UI的demo供用户使用。在开始前，确保已经安装如下代码库： ``` pip install -r requirements_web_demo.txt ``` 随后运行如下命令，并点击生成链接： ``` python web_demo_mm.py ``` ## FAQ 如遇到问题，敬请查阅 [FAQ](FAQ_zh.md)以及issue区，如仍无法解决再提交issue。 ## 使用协议研究人员与开发者可使用Qwen-VL和Qwen-VL-Chat或进行二次开发。我们同样允许商业使用，具体细节请查看[LICENSE](LICENSE)。如需商用，请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。 ## 引用如果你觉得我们的论文和代码对你的研究有帮助，请考虑:star: 和引用 :pencil: :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## 联系我们如果你想给我们的研发团队和产品团队留言，请通过邮件（qianwen_opensource@alibabacloud.com）联系我们。 ================================================ FILE: README_JA.md ================================================ <a href="README_CN.md">中文</a> ｜ <a href="README.md">English</a> ｜日本語 <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> ｜ Qwen-VL-Chat <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> ｜ Qwen-VL-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a> | <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">Demo</a> ｜ <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="https://github.com/camenduru/Qwen-VL-Chat-colab">Colab</a> | <a href="TUTORIAL_ja.md">Tutorial</a> 日本語ドキュメントメンテナー: <a href="https://github.com/eltociear">Ikko Eltociear Ashimine</a> **Qwen-VL** （Qwen Large Vision Language Model）は、アリババクラウドが提唱するラージモデルシリーズ Qwen（略称: Tongyi Qianwen）のマルチモーダル版です。Qwen-VL は、画像、テキスト、バウンディングボックスを入力として受け付け、テキストとバウンディングボックスを出力します。Qwen-VL の特徴は以下の通りです: - **好調なパフォーマンス**: 複数の英語評価ベンチマーク（Zero-shot Captioning、VQA、DocVQA、Grounding を含む）において、同様のモデル規模でオープンソース化された既存の大規模ビジョン言語モデル（LVLM）を大幅に上回ります。 - **テキスト認識をサポートする多言語 LVLM**: Qwen-VL は、英語、中国語、多言語の会話を自然にサポートし、画像内の中国語と英語の二言語テキストのエンドツーエンドの認識を促進します。 - **複数画像のインターリーブ会話**: この機能により、複数の画像を入力し、比較することができる。また、画像に関連する質問を指定し、複数の画像によるストーリーテリングを行うこともできます。 - **中国語のグラウンディングを支える初のジェネラリストモデル**: 中国語と英語のオープンドメイン言語表現によるバウンディングボックスの検出。 - **きめ細やかな認識と理解**: 現在他のオープンソース LVLM で使用されている 224\*224 の解像度と比較して、448\*448 の解像度は、きめ細かいテキスト認識、文書 QA、バウンディングボックス注釈を促進する。 <img src="assets/demo_vl.gif" width="400"/> Qwen-VL シリーズの 2 つのモデルを公開します: - Qwen-VL: LLM の初期化に Qwen-7B を、視覚エンコーダの初期化に [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) を用いた学習済み LVLM モデル。そして、それらをランダムに初期化されたクロスアテンションレイヤーで接続する。 - Qwen-VL-Chat: マルチモーダルな LLM ベースの AI アシスタント。Qwen-VL-Chat は、複数の画像入力、複数ラウンドの質問応答、クリエイティブな機能など、より柔軟なインタラクションをサポートします。 ## ニュースとアップデート * 2023.11.28 Qwen-VL は、GPT4V、PALI-X を凌駕する最高レベルの [DOCVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) をシングルモデルで達成し、直接画像を入力するだけで様々なタスクを分析理解できる汎用モデルであり。 https://qianwen.aliyun.com のマルチモーダルタブで直接新しいモデルを体験できます。 * 2023.9.25 Qwen-VL-Chat モデルが更新され、中国語コマンドのフォローがより堅牢になり、Web ページと表の画像の理解と質問と回答の機能が向上し、対話のパフォーマンスが向上しました (タッチストーン: 中国語: 401.2->481.7、英語: 645.2->711.6)。 * 2023.9.12 フルパラメータ微調整、LoRA、Q-LoRA を含む、Qwen-VL モデルの微調整をサポートするようになりました。 * 2023.9.8 [Colab](https://github.com/camenduru/Qwen-VL-Chat-colab) のサンプルを提供してくれた [camenduru](https://github.com/camenduru) に感謝します。これをチュートリアルとして使用して、12G GPU でローカルまたはオンラインのデモを行うことができます。 * 2023.9.4 Qwen-VL シリーズは、画像とビデオの両方の理解を含むマルチモーダル LLM を評価するための、正確な人による注釈を備えた 19,000 個の多肢選択質問のマルチモーダルベンチマークである [Seed-Bench](eval_mm/seed_bench/EVAL_SEED.md) で SOTA を達成します。 * 2023.9.1 基本的な認識と理解だけでなく、文学創作までを含むマルチモーダル言語モデルの包括的な評価である [TouchStone](https://github.com/OFA-Sys/TouchStone) 評価をリリースします。強力な LLM を判定者として使用し、マルチモーダルな情報をテキストに変換します。 * 2023.8.31 低メモリコストでありながら推論速度の向上を実現する Qwen-VL-Chat 用の Int4 量子化モデル **Qwen-VL-Chat-Int4** をリリースしました。また、ベンチマーク評価においても大きなパフォーマンスの低下はありません。 * 2023.8.22 ModelScope と Hugging Face で **Qwen-VL** と **Qwen-VL-Chat** をリリースしました。また、トレーニングの詳細やモデルのパフォーマンスなど、モデルの詳細については [論文](https://arxiv.org/abs/2308.12966) も提供しています。 ## 評価モデルの能力を2つの観点から評価しました: 1. **標準ベンチマーク**: マルチモーダルなタスクの 4 つの主要カテゴリーについて、モデルの基本的なタスク能力を評価する: - ゼロショットキャプション: 未見のデータセットに対して、モデルのゼロショット画像キャプション能力を評価する; - 一般的な VQA: 判定、色、数、カテゴリなど、画像の一般的な質問応答能力を評価する; - テキストベース VQA: 文書 QA、図表 QAなど、写真内のテキストを認識するモデルの能力を評価する; - 参照表現理解: 参照表現理解: 参照表現で記述された画像内の対象物を特定する能力を評価する。 2. **TouchStone**: 総合的なテキスト画像対話能力と人間とのアライメントレベルを評価するために、GPT4 によるスコアリングに基づく TouchStone と呼ばれるベンチマークを構築し、LVLM モデルを評価しました。 - TouchStone ベンチマークは、合計 300 以上の画像、800 以上の質問、27 のカテゴリをカバーしています。例えば、属性ベースの Q&A、有名人の認識、詩の作文、複数の画像の要約、商品比較、数学の問題解決などです; - 画像の直接入力という GPT4 の現在の制限を打ち破るため、TouchStone は人間のラベル付けによるきめ細かい画像注釈を提供します。これらの詳細な注釈は、質問とモデルの出力と共に、採点のために GPT4 に提示されます。 - ベンチマークには英語版と中国語版があります。評価結果は以下の通りです: Qwen-VL は、複数の VL タスクにおいて、現行の SOTA ジェネラリストモデルを上回り、また、能力範囲の点でより包括的なカバレッジを持ちます。 <img src="assets/radar.png" width="600"/> ### ゼロショットキャプションと一般的な VQA <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>80.5</td> <td>51.1</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - ゼロショット画像のキャプション付けでは、Qwen-VL は Flickr30K で **SOTA** を達成し、InstructBlip を使用した Nocaps でも競争力のある結果を得ています。 - 一般的な VQA では、Qwen-VL は同じ一般的な LVLM スケール設定で **SOTA** を達成しています。 ### テキスト指向VQA（画像中のテキスト理解能力に重点を置く） <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - テキスト関連の認識/QA 評価において、Qwen-VL は汎用の LVLM スケール設定で SOTA を達成しています。 - 解像度は上記のいくつかの評価において重要である。解像度が 224 のオープンソースの LVLM モデルの多くは、これらの評価ができないか、画像をカットすることでしか解決できないが、Qwen-VL は解像度を 448 にスケーリングし、エンドツーエンドで評価できるようにしました。Qwen-VL は、一部のタスクにおいて、解像度 1024 の Pix2Struct-Large モデルをも凌駕しています。 ### 表現理解の参照 <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56 </td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - Qwen-VL は、上記のすべての参照表現理解ベンチマークで **SOTA** を達成した。 - Qwen-VL は中国語の下地データを学習していないが、中国語のキャプションデータと英語の下地データを学習することで、ゼロショットで中国語の下地タスクに汎化することができます。私たちの実験結果を再現するために、上記の評価スクリプトをすべて提供しています。詳しくは [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md) をお読みください。 ### チャット評価 TouchStone は GPT4 によるスコアリングに基づくベンチマークで、テキストと画像の対話および人間とのアライメントレベルにおける LVLM モデルの能力を評価する。合計 300 以上の画像、800 以上の質問、属性ベースの Q&A、有名人の認識、詩の作成、複数の画像の要約、商品比較、数学の問題解決など27のカテゴリをカバーしています。詳しくは [touchstone/README_JA.md](touchstone/README_JA.md) をお読みください。 #### 英語 | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### 中国語 | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat は中国語と英語のアライメント評価で最高の結果を得ました。 ### その他のベンチマーク #### SEED-Bench SEED-Bench は、マルチモーダル LLM を評価するための正確な人による注釈を備えた 19,000 個の多肢選択式質問のマルチモーダルベンチマークで、**画像** と **ビデオ** の両方の理解を含む 12 の評価次元をカバーしています。詳細については、[こちら](eval_mm/seed_bench/EVAL_SEED.md) をご覧ください。 Qwen-VL と Qwen-VL-Chat は、このベンチマークで SOTA を達成しています。 <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## 必要条件 * python 3.8 以上 * pytorch 1.12 以上、2.0 以上を推奨 * CUDA 11.4 以上を推奨（GPU ユーザー向けです） ## クイックスタート以下では、Qwen-VL と Qwen-VL-Chat を 🤖 ModelScope と 🤗 Transformers とともに使う方法を、簡単な例で示します。コードを実行する前に、環境のセットアップと必要なパッケージのインストールが済んでいることを確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。 ```bash pip install -r requirements.txt ``` これで ModelScope や Transformers を使い始めることができます。ビジョンエンコーダについての詳しい使い方は、[チュートリアル](TUTORIAL_ja.md)を参照してください。 #### 🤗 Transformers Qwen-VL-Chat を推論に使用するために必要なのは、以下に示す数行のコードを入力することだけです。ただし、**最新のコードを使用していることを確認してください。** ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # Note: デフォルトの動作では、インジェクション攻撃防止機能がオフになりました。 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # bf16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # fp16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # cpu のみの使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # cuda デバイスの使用 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # 生成のためのハイパーパラメータの指定 # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 第 1 回対話ターン query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # ローカルパスまたは url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 写真はビーチでラブラドールの隣で愛犬と戯れる女性が写っており、彼らは砂の中にいる。 # 第 2 回対話ターン response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> <details> <summary>Running Qwen-VL</summary> Running Qwen-VL pretrained base model is also simple. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # bf16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # fp16 の使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # cpu のみの使用 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # cuda デバイスの使用 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # 生成のためのハイパーパラメータの指定 model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # ローカルパスまたは url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> </details> HuggingFaceからモデルのチェックポイントとコードをダウンロードする際にネットワークの問題が発生した場合、ModelScopeからチェックポイントをダウンロードする方法はこちらでございます。 ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope ModelScope は、MaaS（Model-as-a-Service）のためのオープンソースプラットフォームであり、AI 開発者に柔軟で費用対効果の高いモデルサービスを提供します。同様に、以下のように ModelScope でモデルを実行することができます: ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # bf16 の使用 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # fp16 の使用 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # cpu の使用 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # auto の使用 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # 生成のためのハイパーパラメータの指定 model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 第 1 回対話ターン # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 写真は、若い女性がビーチで愛犬のラブラドール種と戯れているところ。二人は浜辺に座り、犬の前脚を上げて触れ合っている。 # 第 2 回対話ターン response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> ## 量子化 ### 使用方法私たちは、[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)に基づいた新しいソリューションを提供し、Qwen-VL-ChatのためのInt4量子化モデル、Qwen-VL-Chat-Int4[Click here](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)をリリースします。このモデルは、ほぼ無損失なモデル効果を達成しながら、メモリコストと推論速度の両方のパフォーマンスを向上させます。ここでは、量子化されたモデルを推論に使用する方法を説明します。始める前に、必要な要件（torch 2.0以上、transformers 4.32.0以上など）を満たしていることを確認し、必要なパッケージをインストールしてください： ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` `auto-gptq`のインストールに問題がある場合は、公式の[repo](https://github.com/PanQiWei/AutoGPTQ)をチェックして、ホイールを見つけることをお勧めする。そうすれば、量子化されたモデルを簡単にロードすることができ、いつもと同じように推論を実行することができる： ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### 性能ベンチマーク **[TouchStone](https://github.com/OFA-Sys/TouchStone)** において、BF16 モデルと Int4 モデルの両方のモデル性能を例示し、量子化モデルが大きな性能劣化に悩まされないことを見出しました。結果を以下に示します： | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### 推論スピード BF16 精度と Int4 量子化の下で、画像（258 トークンを要する）のコンテキストで 1792（2048-258）トークンと 7934（8192-258）トークンを生成する平均推論速度（トークン/秒）をそれぞれ測定した。 | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | プロファイリングは、PyTorch 2.0.1 と CUDA 11.4 を搭載したシングル A100-SXM4-80G GPU で実行されます。 ### GPU メモリ使用量また、1792 (2048-258) 個のトークン (画像を含む) をコンテキストとしてエンコードする場合 (および単一のトークンを生成する場合) と、7934 (8192-258) 個のトークン (画像をコンテキストとして生成する場合) をそれぞれ BF16 または Int4 量子化レベルでエンコードする場合の GPU メモリ使用量のピーク値をプロファイリングしました。結果を以下に示します。 | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | 上記のスピードとメモリーのプロファイリングは、[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)を使用しています。 ## ファインチューニング現在、公式のトレーニングスクリプト `finetune.py` を提供しています。さらに、finetune.py のシェルスクリプトを提供し、finetune.py を実行することで、finetune.py を起動することができる。さらに、安心してファインチューニングを開始するためのシェルスクリプトも提供しています。このスクリプトは、[DeepSpeed](https://github.com/microsoft/DeepSpeed) および [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) を使用したトレーニングをサポートします。弊社が提供するシェル・スクリプトは DeepSpeed を使用するため、事前に DeepSpeed をインストールすることをお勧めします: 学習データを準備するには、すべてのサンプルをリストにまとめ、json ファイルに保存する必要があります。各サンプルは id と会話リストで構成される辞書です。以下は 1 つのサンプルを含む単純なリストの例です: ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` VL タスクの場合、`<img> </img> <ref> </ref> <box> </box>` などの特別なトークンが使用されます。画像は「画像 ID: `<img>img_path</img>\n{your prompt}`」として表されます。ここで、「id」は会話内の画像の位置を 1 から示します。「img_path」はローカルファイルパスまたは Web リンク。座標ボックスは `<box>(x1,y1),(x2,y2)</box>`・として表されます。ここで、`(x1, y1)` と `(x2, y2)` は範囲内の正規化された値です。 `[0, 1000)`。対応するテキスト説明は `<ref>text_caption</ref>` によって識別できます。データ準備の後、提供されているシェルスクリプトを使って微調整を実行することができる。データファイルのパス `$DATA` を忘れずに指定してください。ファインチューニングのスクリプトを使用することで、以下のことが可能になる： - フルパラメーター・ファインチューニング - LoRA - Q-LoRA ### フルパラメーターファインチューニングフルパラメータパラメータのファインチューニングを行うには、トレーニングプロセス全体ですべてのパラメータを更新する必要があります。トレーニングを開始するには、以下のスクリプトを実行します： ```bash # 分散トレーニング。GPU メモリが不足するとトレーニングが破綻するため、シングル GPU のトレーニングスクリプトは提供していません。 sh finetune/finetune_ds.sh ``` シェルスクリプトでは、正しいモデル名またはパス、データパス、出力ディレクトリを指定することを忘れないでください。変更したい場合は、引数 `--deepspeed` を削除するか、要件に基づいて DeepSpeed 設定 json ファイルを変更してください。さらに、このスクリプトは混合精度のトレーニングに対応しており、`--bf16 True` または `--fp16 True` を使用することができます。経験的に、あなたのマシンがbf16をサポートしている場合、私たちのプリトレーニングとアライメントを整合させるためにbf16を使用することをお勧めします。 ### LoRA 同様に、LoRA を実行するには、以下のように別のスクリプトを使って実行する。始める前に、`peft` がインストールされていることを確認してください。また、モデル、データ、出力へのパスを指定する必要があります。学習済みモデルには絶対パスを使用することをお勧めします。なぜなら、LoRA はアダプタのみを保存し、アダプタ設定 json ファイルの絶対パスは、ロードする事前学習済みモデルを見つけるために使用されるからです。また、このスクリプトは bf16 と fp16 の両方をサポートしている。 ```bash # シングル GPU トレーニング sh finetune/finetune_lora_single_gpu.sh # 分散トレーニング sh finetune/finetune_lora_ds.sh ``` LoRA ([論文](https://arxiv.org/abs/2106.09685)) は、フルパラメーターによるファインチューニングと比較して、adapter のパラメーターを更新するだけで、元の大きな言語モデル層は凍結されたままである。そのため、メモリコストが大幅に削減でき、計算コストも削減できる。なお、チャットモデル（Qwen-VL-Chatなど）ではなく、ベース言語モデル（Qwen-VLなど）の微調整にLoRAを使用した場合、スクリプトは自動的に学習可能なパラメータとして埋め込み層と出力層を切り替えます。これは、ベースとなる言語モデルには、ChatMLフォーマットによってもたらされる特殊なトークンに関する知識がないためです。したがって、これらのレイヤーは、モデルがトークンを理解し予測するために更新される必要があります。別の言い方をすれば、もしLoRAで特殊なトークンを学習するのであれば、コード内で `modules_to_save` を設定することで、レイヤーを学習可能なパラメータに設定する必要があります。さらに、LoRAのメモリフットプリントは、このような学習可能なパラメータがある場合とない場合で、大きな開きがあることがわかります。そのため、メモリに問題がある場合は、LoRAのChatモデルを微調整することをお勧めします。詳細は以下のプロファイルを参照してください。 ### Q-LoRA しかし、それでもメモリ不足に悩む場合は、Q-LoRA（[論文](https://arxiv.org/abs/2305.14314)）を検討することができます。これは、量子化されたラージ言語モデルと、ページド・アテンションなどの他のテクニックを使用し、さらに少ないメモリコストで実行することができます。Q-LoRA を実行するには、以下のスクリプトを直接実行してください： ```bash # シングルGPUトレーニング sh finetune/finetune_qlora_single_gpu.sh # 分散トレーニング sh finetune/finetune_qlora_ds.sh ``` Q-LoRA については、弊社が提供する量子化モデル、例えば Qwen-7B-Chat-Int4 をロードすることをお勧めします。ただし、フルパラメータ・ファインチューニングや LoRA とは異なり、Q-LoRA では fp16 のみがサポートされる。 LoRA と Q-LoRA の学習は、フルパラメータによるファインチューニングとは異なり、アダプターパラメータのみを保存する。仮に Qwen-7B から学習を開始したとすると、以下のようにファインチューニングされたモデルを読み込んで推論を行うことができる： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` アダプターをマージし、微調整したモデルをスタンドアロンモデルとして保存したい場合は（これは LoRA でのみ可能で、Q-LoRA からパラメータをマージすることはできません）、以下のコードを実行します： ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size and safe serialization are not necessary. # They respectively work for sharding checkpoint and save the model to safetensors merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` 注意：マルチGPUトレーニングの場合、分散トレーニング用の適切なハイパーパラメータをマシンに応じて指定する必要があります。また、データ、メモリフットプリント、トレーニング速度を考慮して、引数 `--model_max_length` で最大シーケンス長を指定することをお勧めします。 ### メモリと速度のプロファイリングシングルGPUトレーニングのセットアップにおいて、LoRA (LoRA (Base)はembeddingと出力層を学習させるが、LoRA (Chat)はembeddingと出力層を学習させない) とQ-LoRAのGPUメモリとトレーニング速度をプロファイリングする。このテストでは、シングルA100-SXM4-80G GPUで実験し、CUDA 11.8とPytorch 2.0を使用します。各サンプルには写真が含まれています。384、512、1024、2048という異なる長さの入力のメモリ（GB）と速度（s/iter）をプロファイリングします。統計量を以下に示す： <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> シェルスクリプトは `torchrun` を使用してシングル GPU またはマルチGPUトレーニングを実行します。そのため、分散トレーニングのための適切なハイパーパラメータをマシンに応じて指定する必要があります。 ## デモ ### Web UI Web UI デモを構築するためのコードを提供します。始める前に、以下のパッケージがインストールされていることを確認してください: ```bash pip install -r requirements_web_demo.txt ``` 次に以下のコマンドを実行し、生成されたリンクをクリックします: ```bash python web_demo_mm.py ``` ## FAQ 問題が発生した場合は、[FAQ](FAQ_ja.md) や issue を参照し、新しい issue を立ち上げる前に解決策を探してください。 ## ライセンス契約研究者や開発者は、Qwen-VL と Qwen-VL-Chat のコードとモデルウェイトを自由に使用することができます。また、商用利用も可能です。詳しくは [LICENSE](LICENSE) をご覧ください。 ## 引用私たちの論文やコードがあなたの研究に役立つとお感じになりましたら、スター :star: と引用 :pencil: をお付けください :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## お問い合わせ研究チームまたは製品チームへのメッセージは、qianwen_opensource@alibabacloud.com までお気軽にお送りください。 ================================================ FILE: README_KO.md ================================================ <a href="README_CN.md">中文</a> ｜ English ｜ <a href="README_JA.md">日本語</a> ｜ <a href="README_KO.md">한국어</a> <img src="assets/logo.jpg" width="400"/> Qwen-VL <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> ｜ Qwen-VL-Chat <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> ｜ Qwen-VL-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-VL-Chat-Int4">🤗</a> <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat</a> | <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a> | <a href="https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary">Demo</a> ｜ <a href="https://arxiv.org/abs/2308.12966">Paper</a> | <a href="https://github.com/camenduru/Qwen-VL-Chat-colab">Colab</a> | <a href="TUTORIAL.md">Tutorial</a> --- **Qwen-VL**(Qwen Large Vision Language Model)은 알리바바 클라우드가 제안한 큰 모델 시리즈인 Qwen(약칭, Tongyi Qianwen)의 멀티모달 버전입니다. Qwen-VL은 이미지, 텍스트, 그리고 바운딩 박스를 입력으로 받아 텍스트와 바운딩 박스를 출력합니다. Qwen-VL의 특징은 다음과 같습니다. - **강력한 성능**: 동일한 모델 규모의 기존 공개된 대규모 시각 언어 모델(Large Vision Language Models, ㄴLVLM)보다 영어 평가 벤치마크(Zero-shot Captioning, VQA, DocVQA, Grounding 포함)에서 현저히 우수합니다. - **텍스트 인식을 지원하는 다국어 LVLM**: Qwen-VL은 자연스러운 영어, 중국어 및 다국어 대화를 지원하며, 이미지 내 중국어-영어 간 이중 언어 텍스트의 종단 간 인식을 개선했습니다. - **다중 이미지 교차 대화**: 이 기능은 여러 이미지의 입력과 비교뿐만 아니라 이미지와 관련된 질문을 지정하고 다중 이미지 스토리텔링에 참여할 수 있는 기능을 제공합니다. - **중국어에서 지상화를 지원하는 첫 번째 일반 모델**: 중국어와 영어의 개방형 언어 표현을 통해 바운딩 박스를 인식합니다. - **세밀한 인식 및 이해**: 다른 공개된 LVLM이 현재 사용하는 224\*224 해상도와 비교하여 448\*448 해상도는 세밀한 텍스트 인식, 문서 QA 및 바운딩 어노테이션을 개선했습니다. <img src="assets/demo_vl.gif" width="400"/> Qwen-VL 시리즈의 두 모델을 출시합니다. - Qwen-VL: 사전 훈련된 LVLM 모델로, Qwen-7B를 LLM의 초기화에 사용하며, 시각 인코더의 초기화로는 [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip)를 사용하여, 무작위로 초기화된 교차 어텐션 레이어(randomly initialized cross-attention layer)에 연결합니다. - Qwen-VL-Chat: 정렬 기술로 훈련된 멀티모달 LLM 기반 AI 어시스턴트입니다. Qwen-VL-Chat은 여러 이미지 입력, 다중 라운드 질문 응답, 창의적 능력과 같은 더 유연한 상호작용을 지원합니다. ## 뉴스 및 업데이트 * ```2023.9.25``` 🚀🚀🚀 Qwen-VL-Chat을 더욱 강력한 중국어 지시 수행 능력, 웹페이지 및 표 이미지에 대한 개선된 이해력, 더 나은 대화 성능(TouchStone: CN: 401.2->481.7, EN: 645.2->711.6)으로 업데이트 되었습니다. * ```2023.9.12``` 😃😃😃 이제 Qwen-VL 모델에 대한 파인튜닝을 지원합니다. 이에는 전체 파라미터 파인튜닝, LoRA 및 Q-LoRA가 포함됩니다. * ```2023.9.8``` 👍👍👍 camenduru가 멋진 Colab을 기여해 주셔서 감사합니다. 모두가 12G GPU에서 로컬 또는 온라인 Qwen-VL-Chat-Int4 데모 튜토리얼로 사용할 수 있습니다. * ```2023.9.5``` 👏👏👏 Qwen-VL-Chat은 MME Benchmark, 멀티모달 대형 언어 모델을 위한 종합적인 평가 벤치마크에서 SOTAs를 달성했습니다. 이는 총 14개의 하위 과제에서 인식과 인지 능력을 모두 측정합니다. * ```2023.9.4``` ⭐⭐⭐ Qwen-VL 시리즈는 Seed-Bench, 이미지 및 비디오 이해를 평가하는 19K 다중 선택 질문의 멀티모달 벤치마크에서 SOTAs를 달성했습니다. 이는 정확한 인간 주석을 갖추고 있습니다. * ```2023.9.1``` 🔥🔥🔥 기본적인 인식과 이해력뿐만 아니라 문학 창작까지 아우르는 복합 언어 모델에 대한 종합적인 평가인 [TouchStone](https://github.com/OFA-Sys/TouchStone) 평가를 출시합니다. 강력한 LLM을 심사위원으로 활용하고, 멀티모달 정보를 텍스트로 변환하여 평가합니다. * ```2023.8.31``` 🌟🌟🌟 Qwen-VL-Chat용 Int4 양자화 모델인 **Qwen-VL-Chat-Int4**를 출시하여 메모리 비용은 낮추고 추론 속도는 향상시켰습니다. 또한 벤치마크 평가에서도 성능 저하가 크지 않습니다. * ```2023.8.22``` 🎉🎉🎉 모델스코프와 허깅페이스에 **Qwen-VL**과 **Qwen-VL-Chat**을 모두 출시합니다. 학습 내용 및 모델 성능 등 모델에 대한 자세한 내용은 [논문](https://arxiv.org/abs/2308.12966)을 통해 확인할 수 있습니다. ## Evaluation 세 가지 관점에서 모델의 기능을 평가했습니다: 1. **표준 벤치마크**: 멀티모달 작업의 네 가지 주요 범주에 대한 모델의 기본 작업 기능을 평가합니다: - 제로 샷 캡션: 보이지 않는 데이터 세트에 대한 모델의 제로샷 이미지 캡션 능력을 평가합니다. - 일반 VQA: 판단, 색상, 숫자, 카테고리 등과 같은 사진의 일반적인 질문에 대한 답변 능력을 평가합니다. - 텍스트 기반 VQA: 문서 QA, 차트 QA 등과 같이 사진 속 텍스트를 인식하는 모델의 능력을 평가합니다. - 참조 표현 이해: 참조 표현식으로 설명된 이미지에서 대상 객체를 찾아내는 능력을 평가합니다. 2. **터치스톤**: 전반적인 텍스트-이미지 대화 능력과 사람과의 일치도를 평가하기 위해 [TouchStone](https://github.com/OFA-Sys/TouchStone)이라는 벤치마크를 구축했으며, 이 벤치마크는 GPT4로 채점하여 LVLM 모델을 평가합니다. - 터치스톤 벤치마크는 총 300개 이상의 이미지, 800개 이상의 질문, 27개 카테고리를 다룹니다. 속성 기반 Q&A, 유명인 인식, 시 쓰기, 여러 이미지 요약, 제품 비교, 수학 문제 풀이 등이 포함됩니다. - 직접 이미지 입력이라는 현재 GPT4의 한계를 극복하기 위해 TouchStone은 사람이 직접 라벨을 지정하여 세분화된 이미지 주석을 제공합니다. 이러한 세부 주석은 문제 및 모델의 출력과 함께 채점을 위해 GPT4에 제공됩니다. - 벤치마크에는 영어와 중국어 버전이 모두 포함되어 있습니다. 3. **기타 멀티모달 벤치마크**: 다른 멀티모달 벤치마크에서도 모델의 성능을 평가했습니다: - 멀티모달 대규모 언어 모델에 대한 종합적인 평가 벤치마크인 [MME 벤치마크](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation). Qwen-VL-Chat은 지각과 인지 트랙 모두에서 SOTA를 달성했습니다. - [Seed-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard)는 멀티모달 LLM을 평가하기 위한 정확한 인간 주석이 포함된 19K 객관식 질문으로 구성된 멀티모달 벤치마크입니다. 큐원 시리즈는 이 벤치마크에서 SOTA를 달성했습니다. 평가 결과는 다음과 같습니다. Qwen-VL은 여러 VL 작업에서 현재 SOTA 제너럴리스트 모델보다 성능이 뛰어나며, 기능 범위 측면에서 더 포괄적인 기능을 지원합니다. <img src="assets/radar.png" width="600"/> ### Zero-shot Captioning & General VQA <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="2">Zero-shot Captioning</th> <th colspan="5">General VQA</th> </tr> <tr> <th>NoCaps</th> <th>Flickr30K</th> <th>VQAv2dev</th> <th>OK-VQA</th> <th>GQA</th> <th>SciQA-Img (0-shot)</th> <th>VizWiz (0-shot)</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="10">Generalist Models</td> <td>Flamingo-9B</td> <td>-</td> <td>61.5</td> <td>51.8</td> <td>44.7</td> <td>-</td> <td>-</td> <td>28.8</td> </tr> <tr> <td>Flamingo-80B</td> <td>-</td> <td>67.2</td> <td>56.3</td> <td>50.6</td> <td>-</td> <td>-</td> <td>31.6</td> </tr> <tr> <td>Unified-IO-XL</td> <td>100.0</td> <td>-</td> <td>77.9</td> <td>54.0</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Kosmos-1</td> <td>-</td> <td>67.1</td> <td>51.0</td> <td>-</td> <td>-</td> <td>-</td> <td>29.2</td> </tr> <tr> <td>Kosmos-2</td> <td>-</td> <td>80.5</td> <td>51.1</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>BLIP-2 (Vicuna-13B)</td> <td>103.9</td> <td>71.6</td> <td>65.0</td> <td>45.9</td> <td>32.3</td> <td>61.0</td> <td>19.6</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>121.9</td> <td>82.8</td> <td>-</td> <td>-</td> <td>49.5</td> <td>63.1</td> <td>33.4</td> </tr> <tr> <td>Shikra (Vicuna-13B)</td> <td>-</td> <td>73.9</td> <td>77.36</td> <td>47.16</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>121.4</td> <td>85.8</td> <td>78.8</td> <td>58.6</td> <td>59.3</td> <td>67.1</td> <td>35.2</td> </tr>  <tr> <td>Qwen-VL-Chat</td> <td>120.2</td> <td>81.0</td> <td>78.2</td> <td>56.6</td> <td>57.5</td> <td>68.2</td> <td>38.9</td> </tr>  <tr> <td>Previous SOTA (Per Task Fine-tuning)</td> <td>-</td> <td>127.0 (PALI-17B)</td> <td>84.5 (InstructBLIP -FlanT5-XL)</td> <td>86.1 (PALI-X -55B)</td> <td>66.1 (PALI-X -55B)</td> <td>72.1 (CFR)</td> <td>92.53 (LLaVa+ GPT-4)</td> <td>70.9 (PALI-X -55B)</td> </tr> </tbody> </table> - 제로 샷 이미지 캡션의 경우, Qwen-VL은 Flickr30K에서 **SOTA**를 달성했고 InstructBlip을 사용하여 노캡스에서 경쟁력 있는 결과를 얻었습니다. - 일반 VQA의 경우, Qwen-VL은 동일한 일반 LVLM 스케일 설정에서 **SOTA**를 달성했습니다. ### Text-oriented VQA (Focused on text understanding capabilities in images) <table> <thead> <tr> <th>Model type</th> <th>Model</th> <th>TextVQA</th> <th>DocVQA</th> <th>ChartQA</th> <th>AI2D</th> <th>OCR-VQA</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="5">Generalist Models</td> <td>BLIP-2 (Vicuna-13B)</td> <td>42.4</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>InstructBLIP (Vicuna-13B)</td> <td>50.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>mPLUG-DocOwl (LLaMA-7B)</td> <td>52.6</td> <td>62.2</td> <td>57.4</td> <td>-</td> <td>-</td> </tr> <tr> <td>Pix2Struct-Large (1.3B)</td> <td>-</td> <td>76.6</td> <td>58.6</td> <td>42.1</td> <td>71.3</td> </tr> <tr> <td>Qwen-VL (Qwen-7B)</td> <td>63.8</td> <td>65.1</td> <td>65.7</td> <td>62.3</td> <td>75.7</td> </tr> <tr> <td>Specialist SOTAs (Specialist/Finetuned)</td> <td>PALI-X-55B (Single-task FT) (Without OCR Pipeline)</td> <td>71.44</td> <td>80.0</td> <td>70.0</td> <td>81.2</td> <td>75.0</td> </tr> </tbody> </table> - 텍스트 관련 인식/QA 평가에서 Qwen-VL은 일반적인 LVLM 스케일 설정에서 SOTA를 달성합니다. - 해상도는 위의 여러 평가에서 중요합니다. 224 해상도의 대부분의 오픈 소스 LVLM 모델은 이러한 평가를 수행할 수 없거나 이미지를 잘라내야만 해결할 수 있지만, Qwen-VL은 해상도를 448로 확장하여 엔드투엔드 평가가 가능합니다. Qwen-VL은 일부 작업에서 1024 해상도의 Pix2Struct-Large 모델보다 더 뛰어난 성능을 발휘합니다. ### Referring Expression Comprehension <table> <thead> <tr> <th rowspan="2">Model type</th> <th rowspan="2">Model</th> <th colspan="3">RefCOCO</th> <th colspan="3">RefCOCO+</th> <th colspan="2">RefCOCOg</th> <th>GRIT</th> </tr> <tr> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val</th> <th>test-A</th> <th>test-B</th> <th>val-u</th> <th>test-u</th> <th>refexp</th> </tr> </thead> <tbody align="center"> <tr> <td rowspan="8">Generalist Models</td> <td>GPV-2</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>51.50</td> </tr> <tr> <td>OFA-L*</td> <td>79.96</td> <td>83.67</td> <td>76.39</td> <td>68.29</td> <td>76.00</td> <td>61.75</td> <td>67.57</td> <td>67.58</td> <td>61.70</td> </tr> <tr> <td>Unified-IO</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>78.61</td> </tr> <tr> <td>VisionLLM-H</td> <td></td> <td>86.70</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Shikra-7B</td> <td>87.01</td> <td>90.61</td> <td>80.24 </td> <td>81.60</td> <td>87.36</td> <td>72.12</td> <td>82.27</td> <td>82.19</td> <td>69.34</td> </tr> <tr> <td>Shikra-13B</td> <td>87.83 </td> <td>91.11</td> <td>81.81</td> <td>82.89</td> <td>87.79</td> <td>74.41</td> <td>82.64</td> <td>83.16</td> <td>69.03</td> </tr> <tr> <td>Qwen-VL-7B</td> <td>89.36</td> <td>92.26</td> <td>85.34</td> <td>83.12</td> <td>88.25</td> <td>77.21</td> <td>85.58</td> <td>85.48</td> <td>78.22</td> </tr> <tr> <td>Qwen-VL-7B-Chat</td> <td>88.55</td> <td>92.27</td> <td>84.51</td> <td>82.82</td> <td>88.59</td> <td>76.79</td> <td>85.96</td> <td>86.32</td> <td>-</td> <tr> <td rowspan="3">Specialist SOTAs (Specialist/Finetuned)</td> <td>G-DINO-L</td> <td>90.56</td> <td>93.19</td> <td>88.24</td> <td>82.75</td> <td>88.95</td> <td>75.92</td> <td>86.13</td> <td>87.02</td> <td>-</td> </tr> <tr> <td>UNINEXT-H</td> <td>92.64 </td> <td>94.33</td> <td>91.46</td> <td>85.24</td> <td>89.63</td> <td>79.79</td> <td>88.73</td> <td>89.37</td> <td>-</td> </tr> <tr> <td>ONE-PEACE</td> <td>92.58 </td> <td>94.18</td> <td>89.26</td> <td>88.77</td> <td>92.21</td> <td>83.23</td> <td>89.22</td> <td>89.27</td> <td>-</td> </tr> </tbody> </table> - Qwen-VL은 위의 모든 참조 표현 이해도 벤치마크에서 **SOTA**를 달성했습니다. - Qwen-VL은 중국어 자막 데이터에 대해 학습되지 않았지만, 중국어 자막 데이터와 영어 자막 데이터를 학습하여 제로 샷 방식으로 중국어 자막 작업에 일반화할 수 있습니다. 실험 결과를 재현하기 위해 위의 모든 평가 스크립트를 제공합니다. 자세한 내용은 [eval_mm/EVALUATION.md](eval_mm/EVALUATION.md)를 참조하세요. ### Chat evaluation TouchStone은 텍스트-이미지 대화 및 사람과의 일치 수준에 대한 LVLM 모델의 능력을 평가하기 위해 GPT4로 점수를 매기는 벤치마크입니다. 총 300개 이상의 이미지, 800개 이상의 질문, 속성 기반 Q&A, 유명인 인식, 시 쓰기, 여러 이미지 요약, 제품 비교, 수학 문제 풀이 등 27개 카테고리로 구성되어 있습니다. 자세한 내용은 [터치스톤/README.md](터치스톤/README.md)를 참조하세요. #### English evaluation | Model | Score | | ---------------- | ----- | | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | LLaVA | 602.7 | | mPLUG-Owl | 605.4 | | Qwen-VL-Chat | 645.2 | | Qwen-VL-Chat-1.1 | 711.6 | #### Chinese evaluation | Model | Score | | ---------------- | ----- | | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | | Qwen-VL-Chat-1.1 | 481.7 | Qwen-VL-Chat은 중국어와 영어 정렬 평가에서 모두 최고의 결과를 얻었습니다. ### Other Benchmarks #### MME Benchmark [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)는 멀티모달 대규모 언어 모델에 대한 종합적인 평가 벤치마크입니다. 존재, 수, 위치, 색상, 포스터, 유명인, 장면, 랜드마크, 예술품, OCR, 상식 추론, 숫자 계산, 텍스트 번역, 코드 추론 등 총 14개의 하위 과제에 대한 지각과 인지 능력을 모두 측정합니다. Qwen-VL-Chat은 지각과 인지 평가 모두에서 SOTA를 달성했습니다. 자세한 내용은 [여기](eval_mm/mme/EVAL_MME.md)에서 확인하세요. <img src="eval_mm/mme/perception.jpg" width="600"/> <img src="eval_mm/mme/cognition.jpg" width="600"/> #### SEED-Bench [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard)는 **이미지** 및 **동영상** 이해도를 포함한 12가지 평가 차원을 포괄하는 멀티모달 LLM을 평가하기 위한 정확한 사람의 주석이 포함된 19K 개의 객관식 문항으로 구성된 멀티모달 벤치마크입니다. 자세한 내용은 [여기](eval_mm/seed_bench/EVAL_SEED.md)에서 확인할 수 있습니다. 이 벤치마크에서 Qwen-VL과 Qwen-VL-Chat은 SOTA를 달성했습니다. <img src="eval_mm/seed_bench/leaderboard.jpg"/> ## Requirements * python 3.8 and above * pytorch 1.12 and above, 2.0 and above are recommended * CUDA 11.4 and above are recommended (this is for GPU users) ## Quickstart 아래에서는 🤖 모델스코프 및 🤗 트랜스포머와 함께 Qwen-VL 및 Qwen-VL-Chat을 사용하는 방법을 보여주는 간단한 예제를 제공합니다. 코드를 실행하기 전에 환경을 설정하고 필요한 패키지를 설치했는지 확인하세요. 위의 요구 사항을 충족하는지 확인한 다음 종속 라이브러리를 설치하세요. ```bash pip install -r requirements.txt ``` 이제 모델스코프 또는 트랜스포머로 시작할 수 있습니다. 비전 인코더에 대한 자세한 사용법은 [튜토리얼](TUTORIAL.md)을 참조하세요. #### 🤗 Transformers 추론에 Qwen-VL-Chat을 사용하려면 아래에 설명된 대로 몇 줄의 코드를 입력하기만 하면 됩니다. 단, **최신 코드를 사용하고 있는지 확인하세요**. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) # Note: The default behavior now has injection attack prevention off. tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) # 1st dialogue turn query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': '这是什么?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。 # 2nd dialogue turn response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history) print(response) # <ref>击掌</ref><box>(536,509),(588,602)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> <details> <summary>Running Qwen-VL</summary> Qwen-VL pretrained base model을 실행하는 것도 매우 간단합니다. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) # use bf16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu only # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cpu", trust_remote_code=True).eval() # use cuda device model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", device_map="cuda", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0) # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True) query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url {'text': 'Generate the caption in English with grounding:'}, ]) inputs = tokenizer(query, return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs) response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(response) # <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>Generate the caption in English with grounding:<ref> Woman</ref><box>(451,379),(731,806)</box> and<ref> her dog</ref><box>(219,424),(576,896)</box> playing on the beach<|endoftext|> image = tokenizer.draw_bbox_on_latest_picture(response) if image: image.save('2.jpg') else: print("no box") ``` <img src="assets/demo_spotting_caption.jpg" width="500"/> </details> HuggingFace에서 모델 체크포인트와 코드를 다운로드하는 동안 네트워크 문제가 발생하는 경우, 아래에 설명된 대로 모델스코프에서 체크포인트를 먼저 가져온 다음 로컬 디렉터리에서 로드하는 방법을 사용할 수 있습니다. ```python from modelscope import snapshot_download from transformers import AutoModelForCausalLM, AutoTokenizer # Downloading model checkpoint to a local dir model_dir # model_dir = snapshot_download('qwen/Qwen-VL') model_dir = snapshot_download('qwen/Qwen-VL-Chat') # Loading local checkpoints # trust_remote_code is still set as True since we still load codes from local dir instead of transformers tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_dir, device_map="cuda", trust_remote_code=True ).eval() ``` #### 🤖 ModelScope ModelScope는 서비스형 모델(MaaS)을 위한 오픈소스 플랫폼으로, AI 개발자에게 유연하고 비용 효율적인 모델 서비스를 제공합니다. 마찬가지로 아래와 같이 ModelScope로 모델을 실행할 수 있습니다. ```python from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) import torch model_id = 'qwen/Qwen-VL-Chat' revision = 'v1.0.0' model_dir = snapshot_download(model_id, revision=revision) torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) if not hasattr(tokenizer, 'model_dir'): tokenizer.model_dir = model_dir # use bf16 # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval() # use fp16 model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval() # use cpu # model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval() # use auto model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval() # Specify hyperparameters for generation (No need to do this if you are using transformers>=4.32.0) # model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 1st dialogue turn # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。 # 2nd dialogue turn response, history = model.chat(tokenizer, '输出击掌的检测框', history=history) print(response) # <ref>"击掌"</ref><box>(211,412),(577,891)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('output_chat.jpg') else: print("no box") ``` <img src="assets/demo_highfive.jpg" width="500"/> ## Quantization ### Usage [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)를 기반으로 하는 새로운 솔루션을 제공하고, 거의 무손실 모델 효과를 달성하면서도 메모리 비용과 추론 속도 모두에서 성능이 향상된 Qwen-VL-Chat용 Int4 양자화 모델인 [Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)를 출시했습니다. 여기에서는 제공된 양자화된 모델을 추론에 사용하는 방법을 보여줍니다. 시작하기 전에 요구 사항(예: torch 2.0 이상, transformers 4.32.0 이상 등) 및 필요한 패키지를 제대로 설치했는지 확인하세요. ```bash pip install optimum git clone https://github.com/JustinLin610/AutoGPTQ.git & cd AutoGPTQ pip install -v . ``` 만약 'auto-gptq' 설치에 문제가 있다면, 공식 [repo](https://github.com/PanQiWei/AutoGPTQ)에서 휠을 찾아보시길 권장합니다. 그러면 정량화된 모델을 쉽게 로드하고 평소와 동일하게 추론을 실행할 수 있습니다. ```python model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-VL-Chat-Int4", device_map="auto", trust_remote_code=True ).eval() # Either a local path or an url between <img></img> tags. image_path = 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg' response, history = model.chat(tokenizer, query=f'<img>{image_path}</img>这是什么', history=None) print(response) ``` ### Performance [TouchStone](https://github.com/OFA-Sys/TouchStone)벤치마크에서 BF16 및 Int4 모델의 모델 성능을 살펴본 결과, 양자화된 모델에서 성능 저하가 크지 않은 것으로 나타났습니다. 결과는 아래와 같습니다. | Quantization | ZH | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | ### Inference Speed 이미지의 컨텍스트(258개의 토큰이 필요한)를 가지고 각각 1792개(2048-258개), 7934개(8192-258개)의 토큰을 생성하는 평균 추론 속도(토큰/초)를 BF16 정밀도와 Int4 양자화 하에서 측정했습니다. | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | 프로파일링은 PyTorch 2.0.1 및 CUDA 11.4가 탑재된 단일 A100-SXM4-80G GPU에서 실행됩니다. ### GPU Memory Usage 또한 1792개(2048-258개)의 토큰(이미지 포함)을 컨텍스트로 인코딩하고 단일 토큰을 생성할 때와 7934개(8192-258개)의 토큰(이미지가 컨텍스트로 포함)을 생성할 때 각각 BF16 또는 Int4 양자화 수준에서 최대 GPU 메모리 사용량을 프로파일링했습니다. 결과는 아래와 같습니다. | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | 위의 속도 및 메모리 프로파일링은 [이 스크립트](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile_mm.py)를 사용하여 수행되었습니다. ## Finetuning 이제 사용자가 다운스트림 애플리케이션을 위해 사전 학습된 모델을 간단한 방식으로 미세 조정할 수 있도록 공식 학습 스크립트인 `finetune.py`를 제공합니다. 또한, 걱정 없이 미세 조정을 시작할 수 있는 셸 스크립트도 제공합니다. 이 스크립트는 딥스피드와 FSDP를 통한 학습을 지원합니다. 제공되는 셸 스크립트는 DeepSpeed를 사용하므로 시작하기 전에 DeepSpeed를 설치하는 것이 좋습니다. ```bash pip install deepspeed ``` ### Data preparation 학습 데이터를 준비하려면 모든 샘플을 목록에 넣고 json 파일에 저장해야 합니다. 각 샘플은 ID와 대화 목록으로 구성된 사전입니다. 아래는 샘플 1개가 포함된 간단한 예제 목록입니다. ```json [ { "id": "identity_0", "conversations": [ { "from": "user", "value": "你好" }, { "from": "assistant", "value": "我是Qwen-VL,一个支持视觉输入的大模型。" } ] }, { "id": "identity_1", "conversations": [ { "from": "user", "value": "Picture 1: <img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\n图中的狗是什么品种？" }, { "from": "assistant", "value": "图中是一只拉布拉多犬。" }, { "from": "user", "value": "框出图中的格子衬衫" }, { "from": "assistant", "value": "<ref>格子衬衫</ref><box>(588,499),(725,789)</box>" } ] }, { "id": "identity_2", "conversations": [ { "from": "user", "value": "Picture 1: <img>assets/mm_tutorial/Chongqing.jpeg</img>\nPicture 2: <img>assets/mm_tutorial/Beijing.jpeg</img>\n图中都是哪" }, { "from": "assistant", "value": "第一张图片是重庆的城市天际线，第二张图片是北京的天际线。" } ] } ] ``` VL 작업에서는 `<img> </img> <ref> </ref> <box> </box>`등과 같은 특수 토큰이 사용됩니다. 이미지는 `Picture id: <img>img_path</img>\n{your prompt}`로 표시되며, 여기서 `id`는 대화에서 이미지의 위치(1부터 시작)를 나타냅니다. `img_path`는 로컬 파일 경로 또는 웹 링크일 수 있습니다. 박스의 좌표는 `<box>(x1,y1),(x2,y2)</box>`로 표시되는데, 여기에서 `(x1, y1)`과 `(x2, y2)`의 좌표는 `[0, 1000)`으로 정규화되게 됩니다. 해당 텍스트 설명은 `<ref>text_caption</ref>`과 같은 방법으로 식별할 수 있습니다. 데이터 준비 후 제공된 셸 스크립트를 사용하여 미세 조정을 실행할 수 있습니다. 데이터 파일 경로인 `$DATA`를 지정하는 것을 잊지 마세요. 미세 조정 스크립트를 통해 다음을 수행할 수 있습니다. - Full-parameter finetuning - LoRA - Q-LoRA ### Full-parameter finetuning 전체 파라미터를 미세 조정하려면 전체 훈련 과정에서 LLM의 모든 파라미터를 업데이트해야 합니다. 실험 결과, 미세 조정 단계에서 **ViT의 파라미터를 동결(frozening)하면 더 나은 성능을 얻을 수 있었습니다.** 훈련을 시작하려면 다음 스크립트를 실행합니다. ```bash sh finetune/finetune_ds.sh ``` 셸 스크립트에서 올바른 모델 이름 또는 경로, 데이터 경로, 출력 디렉터리를 지정하는 것을 잊지 마세요. 변경하려면 `--deepspeed` 인수를 제거하거나 요구 사항에 따라 DeepSpeed 구성 json 파일을 변경하면 됩니다. 또한, 이 스크립트는 혼합 정밀도 훈련을 지원하므로 `--bf16 True` 또는 `--fp16 True`를 사용할 수 있습니다. 경험적으로 머신이 bf16을 지원하는 경우 사전 훈련 및 정렬과 일관된 훈련을 위해 bf16을 사용하는 것이 좋으며, 따라서 기본값으로 사용됩니다. ### LoRA 마찬가지로 LoRA를 실행하려면 아래와 같이 다른 스크립트를 사용하여 실행합니다. 시작하기 전에 `peft`를 설치했는지 확인하세요. 또한 모델, 데이터, 출력에 대한 경로를 지정해야 합니다. 사전 학습된 모델에는 절대 경로를 사용하는 것이 좋습니다. LoRA는 어댑터만 저장하고 어댑터 구성 json 파일의 절대 경로는 로드할 사전 학습된 모델을 찾는 데 사용되기 때문입니다. ```bash # Single GPU training sh finetune/finetune_lora_single_gpu.sh # Distributed training sh finetune/finetune_lora_ds.sh ``` 전체 매개변수 미세 조정과 비교할 때 LoRA([paper](https://arxiv.org/abs/2106.09685))는 어댑터 레이어의 매개변수만 업데이트하고 원래의 대규모 언어 모델 레이어는 고정된 상태로 유지합니다. 따라서 메모리 비용이 훨씬 적게 들고 계산 비용도 적게 듭니다. LoRA를 사용하여 채팅 모델 대신 기본 언어 모델(예: Qwen-VL)을 미세 조정하는 경우, 스크립트는 임베딩 및 출력 레이어를 학습 가능한 파라미터로 자동 전환합니다. 이는 기본 언어 모델에 ChatML 형식에서 가져온 특수 토큰에 대한 지식이 없기 때문입니다. 따라서 모델이 토큰을 이해하고 예측하려면 이러한 레이어를 업데이트해야 합니다. 다시 말해, 학습이 LoRA에서 특수 토큰을 가져오는 경우 코드 내에서 `modules_to_save`를 설정하여 레이어를 학습 가능한 파라미터로 설정해야 합니다. 또한 이러한 트레이닝 가능한 파라미터가 있는 경우와 없는 경우 LoRA의 메모리 사용량에는 상당한 차이가 있음을 발견했습니다. 따라서 메모리에 문제가 있는 경우 LoRA에서 채팅 모델을 미세 조정하는 것이 좋습니다. 자세한 내용은 아래 프로필을 확인하세요. ### Q-LoRA 그러나 여전히 메모리가 부족하다면 양자화된 대규모 언어 모델과 페이징 주의와 같은 기타 기술을 사용하여 메모리 비용을 훨씬 더 적게 사용할 수 있는 Q-LoRA([paper](https://arxiv.org/abs/2305.14314))를 고려해 볼 수 있습니다. Q-LoRA를 실행하려면 다음 스크립트를 직접 실행하세요. ```bash # Single GPU training sh finetune/finetune_qlora_single_gpu.sh # Distributed training sh finetune/finetune_qlora_ds.sh ``` Q-LoRA의 경우, 당사에서 제공하는 정량화된 모델(예: Qwen-VL-Chat-Int4)을 로드하는 것이 좋습니다. bf16 모델을 사용해서는 안 됩니다. 전체 파라미터 미세 조정 및 LoRA와 달리 Q-LoRA에는 fp16만 지원됩니다. 또한 Q-LoRA의 경우 LoRA의 특수 토큰에 대한 문제가 여전히 존재합니다. 하지만 저희는 채팅 모델에 Int4 모델만 제공하기 때문에 언어 모델이 ChatML 형식의 특수 토큰을 학습했기 때문에 레이어에 대한 걱정은 하지 않으셔도 됩니다. 단, Int4 모델의 레이어는 학습할 수 없어야 하므로 학습에 특수 토큰을 도입하면 Q-LoRA가 작동하지 않을 수 있습니다. 전체 매개변수 미세 조정과 달리 LoRA 및 Q-LoRA의 훈련은 어댑터 매개변수만 저장합니다. 아래와 같이 추론을 위해 미세 조정된 모델을 로드할 수 있습니다: ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() ``` 어댑터를 병합하고 미세 조정된 모델을 독립형 모델로 저장하려면(이 작업은 LoRA에서만 가능하며 Q-LoRA에서 파라미터를 병합할 수 없음) 다음 코드를 실행하면 됩니다. ```python from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( path_to_adapter, # path to the output directory device_map="auto", trust_remote_code=True ).eval() merged_model = model.merge_and_unload() # max_shard_size와 안전한 직렬화는 필요하지 않습니다. # 이들은 각각 샤딩 체크포인트에 대해 작동하고 모델을 세이프텐서에 저장합니다. merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True) ``` 참고: 멀티 GPU 트레이닝의 경우, 머신에 따라 분산 트레이닝에 적합한 하이퍼파라미터를 지정해야 합니다. 또한 데이터, 메모리 사용량, 훈련 속도 등을 고려하여 --model_max_length 인수를 사용하여 최대 시퀀스 길이를 지정하는 것이 좋습니다. ### Profiling of Memory and Speed 단일 GPU 트레이닝 설정에서 임베딩 및 출력 레이어를 트레이닝하는 LoRA(Base)와 임베딩 및 출력 레이어를 트레이닝할 수 없는 LoRA(Chat)의 GPU 메모리 및 트레이닝 속도를 프로파일링합니다. 이 테스트에서는 단일 A100-SXM4-80G GPU에서 실험했으며, CUDA 11.8과 Python 2.0을 사용했습니다. 배치 크기는 1, 그라데이션 누적은 8을 균일하게 사용합니다. 각 샘플에는 이미지가 포함됩니다. 384, 512, 1024, 2048 등 다양한 길이의 입력에 대한 메모리(GB)와 속도(s/iter)를 프로파일링합니다. 통계는 아래와 같습니다. <table> <tr> <th rowspan="2">Method</th><th colspan="4" align="center">Sequence Length</th> </tr> <tr> <th align="center">384</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th> </tr> <tr> <td>LoRA (Base)</td><td align="center">37.1G / 2.3s/it</td><td align="center">37.3G / 2.4s/it</td><td align="center">38.7G / 3.6s/it</td><td align="center">38.7G / 6.1s/it</td> </tr> <tr> <td>LoRA (Chat)</td><td align="center">23.3G / 2.2s/it</td><td align="center">23.6G / 2.3s/it</td><td align="center">25.1G / 3.5s/it</td><td align="center">27.3G / 5.9s/it</td> </tr> <tr> <td>Q-LoRA</td><td align="center">17.0G / 4.2s/it</td><td align="center">17.2G / 4.5s/it</td><td align="center">18.2G / 5.5s/it</td><td align="center">19.3G / 7.9s/it</td> </tr> </table> ## Demo ### Web UI 사용자가 웹 UI 데모를 빌드할 수 있는 코드를 제공합니다. 시작하기 전에 다음 패키지를 설치해야 합니다. ``` pip install -r requirements_web_demo.txt ``` Then run the command below and click on the generated link: ``` python web_demo_mm.py ``` ## FAQ 문제가 발생하면 새 이슈를 시작하기 전에 먼저 [자주 묻는 질문](FAQ.md)과 이슈를 참조하여 해결 방법을 찾아보시기 바랍니다. ## License Agreement 연구자와 개발자는 Qwen-VL과 Qwen-VL-Chat의 코드와 모델 가중치를 자유롭게 사용할 수 있습니다. 또한 상업적 사용도 허용됩니다. 자세한 내용은 [LICENSE](라이센스)에서 라이센스를 확인하세요. ## Citation 저희 논문과 코드가 여러분의 연구에 도움이 되었다면 star:star: 와 인용:pencil: 해주시면 감사드리겠습니다. :) ```BibTeX @article{Qwen-VL, title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, journal={arXiv preprint arXiv:2308.12966}, year={2023} } ``` ## Contact Us 연구팀이나 제품팀에 메시지를 남기고 싶으시면 언제든지 이메일(qianwen_opensource@alibabacloud.com)을 보내주세요. ================================================ FILE: TUTORIAL.md ================================================ # Qwen-VL-Chat Tutorial Qwen-VL-Chat is a generalist multimodal large-scale language model, and it can perform a wide range of vision-language tasks. In this tutorial, we will give some concise examples to demonstrate the capabilities of Qwen-VL-Chat in **Visual Question Answering, Text Understanding, Mathematical Reasoning with Diagrams, Multi-Figure Reasoning, and Grounding**. Please note that the examples shown are far from the limit of Qwen-VL-Chat's capabilities, **you can further explore Qwen-VL-Chat's capabilities by changing the input images and prompts!** ## Initializing the Qwen-VL-Chat model Before you can use Qwen-VL-Chat, you first need to initialize Qwen-VL-Chat's tokenizer and Qwen-VL-Chat's model: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # If you expect the results to be reproducible, set a random seed. # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` After executing the above code, ```tokenizer``` will correspond to the classifier used by Qwen-VL-Chat, while ```model``` will correspond to the model of Qwen-VL-Chat. The ```tokenizer``` is used for preprocessing the interleaved multimodal inputs, while the ```model``` is the Qwen-VL-Chat model itself. ## Using Qwen-VL-Chat ### **Multi-round visual question answering** #### **The first question** Let's get started with a simple example. As shown below, the file ```assets/mm_tutorial/Rebecca_(1939_poster).jpeg``` is a poster for the 1940 film Rebecca. ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) Let's ask what is the name of the film on the Qwen-VL-Chat poster. First of all, we use ```tokenizer.from_list_format``` which can preprocess and tokenize the input: ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` Next, we can use ```model.chat``` to ask questions to the Qwen-VL-Chat model and get its response. Note that for the first question, the dialogue history is empty, so we use ```history=None```. ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > The name of the movie in the poster is "Rebecca." This shows that the model correctly answered the given question! According to the poster, the title of the film is indeed **Rebecca**. #### **Multi-round question answering** We can also continue to ask the model other questions, such as who is the director of the film. The dialogue history is not empty for subsequent questions, therefore we use ```history=history``` to pass the history of previous conversations to ``model.chat``: ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: > The movie "Rebecca" was directed by Alfred Hitchcock. Again, the model answered the given question correctly! According to the poster, the director of the film is Alfred Hitchcock。 ### **Text Understanding** Qwen-VL-Chat also has the ability to understand images containing dense text. As shown below, the file ```assets/mm_tutorial/Hospital.jpeg``` is a hospital signage containing dense text. ![](assets/mm_tutorial/Hospital_Small.jpg) We can ask questions about the location of different departments in the Hospital. Since the dialogue history is empty, so we use ```history=None```. ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > The Department of Otorhinolaryngology is located on the 4th floor. You can also ask further questions. In this case you need to use ```history=history``` to pass a history of previous conversations to ```model.chat```. ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: > The Department of Surgery is located on the 3rd floor. ### **Mathematical Reasoning with Diagram** Using the model's diagram comprehension and mathematical reasoning capabilities, Qwen-VL-Chat can also perform some more complex tasks! As shown below, the file ```assets/mm_tutorial/Menu.jpeg``` is the menu of a restaurant. Now we want to know how much it would cost to purchase two Salmon Burgers and three Meat Lover's Pizzas. ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ```Think carefully step by step.``` is a common prompt that guides the model through complex tasks step by step. So if you have a complex task to complete, try using it to improve the accuracy of the model. You are expected to get an output similar to the following: > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **Multi-Figure Reasoning and Chinese Input** In the previous examples, we have demonstrated Qwen-VL-Chat's question-answering capability for a single image and English questions. However, Qwen-VL-Chat is actually a multilingual model that supports Chinese input and multiple images! In the following example, we let Qwen-VL-Chat compare the photos of two cities (Chongqing and Beijing) for us (```assets/mm_tutorial/Chongqing.jpeg``` and ```assets/mm_tutorial/Beijing.jpeg```) in Chinese: ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **Please note that comparing cities is a fairly subjective question, so the responses generated by the model may be subject to a high degree of randomness. If you do not set the random seed using ```torch.manual_seed(5678)```, the output will be different each time. Even if you set the random seed, the results obtained may still differ from this tutorial due to differences in hardware and software environments.** ### **Grounding Capability** In the last section of the tutorial, we demonstrate the ability of the Qwen-VL-Chat model to produce a bounding box. Qwen-VL-Chat can frame a specified area of an image with a rectangular box according to your language description. This may be a bit abstract, so let's look at the following example. As shown below, the file ```assets/mm_tutorial/Shanghai.jpg``` is a photo of Shanghai, and we'll start by asking the model to describe the image with a regular prompt. ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` You are expected to get an output similar to the following: > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。 Next, let's talk to the model by using the prompt ```请给我框出图中上海环球金融中心和东方明珠``` and see what happens. Note that at this point you need to pass the history of previous conversations to ```model.chat``` using ```history=history```. ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` The Qwen-VL-Chat model doesn't have a hand, but it doesn't reject your request either. Instead, it outputs something "strange" - In fact, the output of the model gives the location of the 上海环球金融中心（Shanghai World Financial Centre） and the 东方明珠（Oriental Pearl Tower） in markup language. You can visualise it using the following code: ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` The saved ```Shanghai_Output.jpg``` will look similar to the screenshot below: ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) After that, you can still chat with Qwen-VL-Chat as before: ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` You are expected to get an output similar to the following: > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **Please note that travel planning is a fairly subjective question, so the responses generated by the model may be subject to a high degree of randomness. If you do not set the random seed using ```torch.manual_seed(1234)```, the output will be different each time. Even if you set the random seed, the results obtained may still differ from this tutorial due to differences in hardware and software environments.** ### Grounded Captioning Qwen-VL can output the bounding box information of the subject while captioning the image. For example: ``` img_url = 'assets/apple.jpeg' query = tokenizer.from_list_format([ {'image': img_url}, {'text': 'Generate the caption in English with grounding:'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) image = tokenizer.draw_bbox_on_latest_picture(response, history) if image is not None: image.save('apple.jpg') ``` The saved ```apple.jpg``` will look similar to the screenshot below: <img src="assets/apple_r.jpeg" width="600"/> #### How to get the caption without any box-like annotations Sometimes you may expect no box-like annotations in the response. In the case, you can stably get the cleaned text by the following post-processing. ``` # response = '<ref> Two apples</ref><box>(302,257),(582,671)</box><box>(603,252),(878,642)</box> and<ref> a bowl</ref><box>(2,269),(304,674)</box>' import re clean_response = re.sub(r'<ref>(.*?)</ref>(?:<box>.*?</box>)*(?:<quad>.*?</quad>)*', r'\1', response).strip() print(clean_response) # clean_response = 'Two apples and a bowl' ``` ================================================ FILE: TUTORIAL_ja.md ================================================ # Qwen-VL-Chat チュートリアル Qwen-VL-Chat は汎用のマルチモーダル大規模言語モデルであり、幅広い視覚言語タスクを実行できます。このチュートリアルでは、Qwen-VL-Chat の**視覚的質問応答、テキスト理解、図を用いた数学的推論、多視点推論、およびグラウンディング**の機能について、いくつかの簡潔な例を挙げて説明します。Qwen-VL-Chat は、入力画像やプロンプトを変更することで、Qwen-VL-Chat の能力をさらに引き出すことができます。 ## Qwen-VL-Chat モデルの初期化 Qwen-VL-Chat を使用する前に、まず Qwen-VL-Chat のトークナイザと Qwen-VL-Chat のモデルを初期化する必要があります: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # 結果の再現性を期待する場合は、ランダムシードを設定する。 # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` 上記のコードを実行すると、```tokenizer``` は Qwen-VL-Chat で使用される分類器に対応し、```model``` は Qwen-VL-Chat のモデルに対応します。```tokenizer``` はインターリーブされたマルチモーダル入力の前処理に使用され、```model``` は Qwen-VL-Chat のモデルそのものです。 ## Qwen-VL-Chat を使う ### **複数ラウンドのビジュアル質問回答** #### **最初の質問** 簡単な例から始めましょう。以下に示すように、```assets/mm_tutorial/Rebecca_(1939_poster).jpeg``` は 1940 年の映画レベッカのポスターです。 ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) Qwen-VL-Chat のポスターに描かれている映画の名前を聞いてみよう。まず初めに、入力を前処理してトークン化する ```tokenizer.from_list_format``` を使用します: ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` 次に、```model.chat``` を使って Qwen-VL-Chat モデルに質問をし、その回答を得ることができます。最初の質問では、ダイアログの履歴は空なので、```history=None``` を使用することに注意してください。 ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > The name of the movie in the poster is "Rebecca." これは、モデルが与えられた問題に正しく答えたことを示しています！ポスターをみると、映画のタイトルは確かに**レベッカ**です。 #### **複数ラウンドの質問回答** また、映画の監督は誰かなど、他の質問をモデルに続けることもできます。そのため、```history=history``` を使って、以前の会話の履歴を ``model.chat`` に渡します: ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: > The movie "Rebecca" was directed by Alfred Hitchcock. 再びこのモデルは与えられた問題に正解しました！ポスターによると、この映画の監督はアルフレッド・ヒッチコックです。 ### **テキスト理解** Qwen-VL-Chat には、高密度なテキストを含む画像を理解する機能もあります。下図に示すように、```assets/mm_tutorial/Hospital.jpeg``` というファイルは、濃いテキストを含む病院の看板です。 ![](assets/mm_tutorial/Hospital_Small.jpg) 病院内のさまざまな診療科の場所について質問することができます。対話の履歴は空なので、```history=None``` を使用します。 ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > The Department of Otorhinolaryngology is located on the 4th floor. さらに質問をすることもできます。この場合、```history=history``` を使用して、以前の会話の履歴を ```model.chat``` に渡す必要があります。 ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: > The Department of Surgery is located on the 3rd floor. ### **ダイアグラムによる数学的推論** Qwen-VL-Chat は、このモデルのダイアグラム理解能力と数学的推論能力を使って、より複雑なタスクを実行することもできます！下に示すように、```assets/mm_tutorial/Menu.jpeg``` というファイルはレストランのメニューです。では、Salmon Burger 2 個と Meat Lover's Pizza 3 枚を購入した場合の値段を知りたい。 ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ステップバイステップで注意深く考えてください(```Think carefully step by step.```)」は、複雑なタスクを一歩ずつでモデルをガイドする一般的なプロンプトです。複雑なタスクをこなさなければならない場合、このプロンプトを使ってモデルの精度を上げてみてください。以下のような出力が期待されます: > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **多視点推論と中国語入力** これまでの例では、Qwen-VL-Chat が 1 つの画像と英語の質問に対して質問応答ができることを示しました。しかし、実際には Qwen-VL-Chat は中国語入力と複数の画像をサポートする多言語モデルです！以下の例では、Qwen-VL-Chat に 2 つの都市（重慶と北京）の写真（```assets/mm_tutorial/Chongqing.jpeg``` と ```assets/mm_tutorial/Beijing.jpeg```）を中国語で比較させています: ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **都市の比較はかなり主観的な質問であるため、モデルによって生成される回答は高度なランダム性を持つ可能性があることに注意してください。```torch.manual_seed(5678)``` を使用してランダムシードを設定しない場合、出力は毎回異なります。ランダムシードを設定した場合でも、ハードウェアやソフトウェアの環境の違いにより、得られる結果がこのチュートリアルと異なる場合があります。** ### **グラウンディング能力** チュートリアルの最後のセクションでは、Qwen-VL-Chat モデルがバウンディングボックスを生成する機能を紹介します。Qwen-VL-Chat は、言語記述に従って、画像の指定された領域を矩形の枠で囲むことができます。少し抽象的なので、次の例を見てみましょう。下図のように、ファイル ```assets/mm_tutorial/Shanghai.jpg``` は上海の写真です。まず、通常のプロンプトでモデルに画像を記述してもらいます。 ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 以下のような出力が期待されます: > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。次に、プロンプト ```请给我框出图中上海环球金融中心和东方明珠``` を使ってモデルと会話してみましょう。このとき、```history=history``` を使って、以前の会話の履歴を ```model.chat``` に渡す必要があることに注意してください。 ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` Qwen-VL-Chat モデルには手はありませんが、だからといってリクエストを拒否することもありません。その代わりに、"奇妙な"ものが出力されます - 実際、モデルの出力は上海环球金融中心（上海ワールド・フィナンシャル・センター）と东方明珠（東方テレビタワー）の位置をマークアップ言語で示しています。次のコードで視覚化できます: ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` 保存された ```Shanghai_Output.jpg``` は以下のスクリーンショットのようになります: ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) その後、Qwen-VL-Chat で以前と同じようにチャットすることができます: ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 以下のような出力が期待されます: > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **旅行計画はかなり主観的な質問であるため、モデルによって生成される回答は高いランダム性を持つ可能性があることに注意してください。```torch.manual_seed(1234)``` を使用してランダムシードを設定しない場合、出力は毎回異なります。ランダムシードを設定した場合でも、ハードウェアやソフトウェアの環境の違いにより、得られる結果がこのチュートリアルと異なる場合があります。** ================================================ FILE: TUTORIAL_ko.md ================================================ # Qwen-VL-Chat Tutorial Qwen-VL-Chat은 범용 멀티모달 대규모 언어 모델이며 광범위한 시각 언어 작업을 수행할 수 있습니다. 이 튜토리얼에서는 **시각적 질문 답변, 텍스트 이해, 다이어그램을 사용한 수학적 추론, 다중 그림 추론 및 그라운딩(Grounding) 작업**에서 Qwen-VL-Chat의 기능을 보여주는 몇 가지 간결한 예제를 제시합니다. Qwen-VL-Chat의 기능의 한계가 아니며, **입력 이미지와 프롬프트를 변경하여 Qwen-VL-Chat의 기능**을 더 자세히 살펴보실 수도 있습니다. ## Initializing the Qwen-VL-Chat model Qwen-VL-Chat을 사용하기 전에 먼저 Qwen-VL-Chat의 Tokenizer와 Qwen-VL-Chat의 모델을 초기화해야 합니다. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # If you expect the results to be reproducible, set a random seed. # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` 위 코드를 실행하시면 ```tokenizer```변수에 Qwen-VL-Chat에서 사용하는 분류기(classifier)가 할당되고, ```model```변수에는 Qwen-VL-Chat의 모델을 할당하게 됩니다. ```tokenizer```는 인터리브된 멀티모달 입력(interleaved multimodal inputs)을 전처리하는 데 사용되며, ``model``은 Qwen-VL-Chat 모델입니다. ## Using Qwen-VL-Chat ### **Multi-round visual question answering** #### **첫 질문하기** 간단한 예제를 확인해보겠습니다. 아래에서 볼 수 있듯이, ```assets/mm_tutorial/Rebecca_(1939_poster).jpeg``` 파일은 1940년 영화 <레베카>의 포스터입니다. ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) Qwen-VL-Chat 포스터에 있는 영화 제목이 무엇인지 물어봅시다. 우선, 입력을 전처리하고 토큰화할 수 있는 ```tokenizer.from_list_format```을 사용합니다. ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` 다음으로, ```model.chat```을 사용하여 Qwen-VL-Chat 모델에 질문하고 응답을 얻을 수 있습니다. 첫 번째 질문의 경우 대화 기록이 비어 있으므로 ``history=None``을 사용합니다. ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The name of the movie in the poster is "Rebecca." 모델이 주어진 질문에 정답을 맞혔습니다. 포스터에 따르면, 영화의 제목은 실제로 **레베카**입니다. #### **Multi-round question answering** 또한 모델에게 영화 감독이 누구인지와 같은 다른 질문을 계속할 수도 있습니다. 대화 기록은 후속 질문을 위해 비어 있지 않으므로 ``history=history``를 사용하여 이전 대화의 기록을 ``model.chat``에 전달합니다: ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The movie "Rebecca" was directed by Alfred Hitchcock. 다시 한 번, 모델이 주어진 질문에 대한 정답을 맞혔습니다. 포스터에 따르면 이 영화의 감독은 <알프레드 히치콕>입니다. ### **Text Understanding** Qwen-VL-Chat은 촘촘한 텍스트가 포함된 이미지도 이해할 수 있습니다. 아래 그림과 같이 ``assets/mm_tutorial/Hospital.jpeg`` 파일은 촘촘한 텍스트가 포함된 병원 간판입니다. ![](assets/mm_tutorial/Hospital_Small.jpg) 병원 내 여러 부서의 위치에 대해 질문할 수 있습니다. 첫 질문으로 대화에 대한 이전 기록이 없으므로 ```history=None```을 사용합니다. ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The Department of Otorhinolaryngology is located on the 4th floor. 추가 질문을 하실 수도 있습니다. 이 경우 ```history=history```를 사용하여 이전 대화의 기록을 ```model.chat```에 전달해야 합니다. ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 비슷한 출력이 나올 것입니다. > The Department of Surgery is located on the 3rd floor. ### **Mathematical Reasoning with Diagram** 모델의 다이어그램 이해와 수학적 추론 기능을 사용하여 Qwen-VL-Chat은 좀 더 복잡한 작업도 수행할 수 있습니다. 아래에서 볼 수 있듯이 ``assets/mm_tutorial/Menu.jpeg`` 파일은 레스토랑의 메뉴 이미지 입니다. 이제 연어 버거 두 개와 미트 러버스 피자 세 개를 구매하는 데 드는 비용을 알아봅시다. ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ``단계별로 신중하게 생각하세요``는 복잡한 작업을 단계별로 모델에 안내하는 일반적인 프롬프트입니다. 따라서 완료해야 할 복잡한 작업이 있는 경우에는 이 프롬프트를 사용하여 모델의 정확도를 향상시켜 보세요. 다음과 유사한 출력이 나올 것입니다. > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **Multi-Figure Reasoning and Chinese Input** 이전 예제에서는 단일 이미지와 영어 질문에 대한 Qwen-VL-Chat의 질문 답변 기능을 시연했습니다. 하지만 실제로는 중국어 입력과 여러 이미지를 지원하는 다국어 모델입니다. 다음 예제에서는 두 도시(충칭과 베이징)의 사진(`assets/mm_tutorial/Chongqing.jpeg` 및 `assets/mm_tutorial/Beijing.jpeg`)을 중국어로 비교하도록 Qwen-VL-Chat을 설정했습니다. ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 유사한 출력이 나올 것입니다. > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **도시 비교는 상당히 주관적인 질문이므로 모델에 의해 생성된 응답에는 매우 다양하게 무작위의 시드가 적용될 수 있다는 점을 유의하세요. ``torch.manual_seed(5678)```를 사용하여 무작위 시드를 설정하지 않으면 매번 출력이 달라집니다. 랜덤 시드를 설정하더라도 하드웨어 및 소프트웨어 환경의 차이로 인해 얻은 결과가 이 튜토리얼과 다를 수 있습니다**. ### **Grounding Capability** 튜토리얼의 마지막 섹션에서는 Qwen-VL-Chat 모델이 바운딩 박스를 생성하는 기능을 보여드립니다. Qwen-VL-Chat은 언어 설명에 따라 직사각형 상자로 이미지의 지정된 영역에 프레임을 지정할 수 있습니다. 다소 추상적일 수 있으므로 다음 예제를 살펴보겠습니다. 아래 그림과 같이 ```assets/mm_tutorial/Shanghai.jpg`` 파일은 상하이의 사진이며, 모델에게 일반 프롬프트로 이미지를 설명하도록 요청하는 것으로 시작하겠습니다. ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 다음과 유사한 출력을 보실 수 있습니다. > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。 다음으로 '``请给我框出图中上海环球金融中心和东方明珠``라는 프롬프트를 사용하여 모델과 대화하고 어떤 일이 발생하는지 살펴봅시다. 이 시점에서 ``history=history``를 사용하여 이전 대화의 기록을 ``model.chat``에 전달해야 합니다. ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 유사한 출력을 보실 수 있습니다. ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` Qwen-VL-Chat 모델에는 손이 없지만 사용자의 요청을 거부하지도 않습니다. 대신 "이상한" 결과를 출력하는데, 실제로 이 모델의 출력은 上海环球金融中心(상하이 월드 파이낸셜 센터) 와 东方明珠(동방명주) 의 위치를 마크업 언어로 제공합니다. 다음 코드를 사용하여 시각화할 수 있습니다. ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` The saved ```Shanghai_Output.jpg``` will look similar to the screenshot below: ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) 그 후에도 이전처럼 Qwen-VL-Chat으로 계속 채팅할 수 있습니다. ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 다음과 유사한 출력을 보실 수 있습니다. > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **여행 계획은 상당히 주관적인 질문이므로 모델에 의해 생성된 응답에는 높은 수준의 랜덤 시드가 적용될 수 있다는 점에 유의하세요. ``torch.manual_seed(1234)``를 사용하여 무작위 시드를 설정하지 않으면 매번 다른 출력이 나오게 됩니다. 랜덤 시드를 일정하게 설정하더라도 하드웨어 및 소프트웨어 환경의 차이로 인해 얻은 결과가 이 튜토리얼과 다를 수 있습니다**. ### Grounded Captioning Qwen-VL은 다음과 같이 이미지를 캡쳐하는 동안 피사체의 바운딩 박스 정보를 출력할 수 있습니다. ``` img_url = 'assets/apple.jpeg' query = tokenizer.from_list_format([ {'image': img_url}, {'text': 'Generate the caption in English with grounding:'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) image = tokenizer.draw_bbox_on_latest_picture(response, history) if image is not None: image.save('apple.jpg') ``` 저장된 ``사과.jpg``는 이미지는 아래 스크린샷과 비슷하게 보이게 될 것입니다. <img src="assets/apple_r.jpeg" width="600"/> #### How to get the caption without any box-like annotations 때로는 응답에 박스형 주석이 없을 수도 있습니다. 이 경우 다음과 같은 후처리를 통해 안정적으로 정리된 텍스트를 얻을 수 있습니다. ``` # response = '<ref> Two apples</ref><box>(302,257),(582,671)</box><box>(603,252),(878,642)</box> and<ref> a bowl</ref><box>(2,269),(304,674)</box>' import re clean_response = re.sub(r'<ref>(.*?)</ref>(?:<box>.*?</box>)*(?:<quad>.*?</quad>)*', r'\1', response).strip() print(clean_response) # clean_response = 'Two apples and a bowl' ``` ================================================ FILE: TUTORIAL_zh.md ================================================ # Qwen-VL-Chat使用教程 Qwen-VL-Chat是通用多模态大规模语言模型，因此它可以完成多种视觉语言任务。在本教程之中，我们会给出一些简明的例子，用以展示Qwen-VL-Chat在**视觉问答，文字理解，图表数学推理，多图理解和Grounding**(根据指令标注图片中指定区域的包围框)等多方面的能力。请注意，展示的例子远非Qwen-VL-Chat能力的极限，**您可以通过更换不同的输入图像和提示词（Prompt），来进一步挖掘Qwen-VL-Chat的能力！** ## 初始化Qwen-VL-Chat模型在使用Qwen-VL-Chat之前，您首先需要初始化Qwen-VL-Chat的分词器（Tokenizer）和Qwen-VL-Chat的模型： ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig # 如果您希望结果可复现，可以设置随机数种子。 # torch.manual_seed(1234) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True) ``` 在执行完上述代码后，```tokenizer```将对应Qwen-VL-Chat使用的分词器，而```model```将对应Qwen-VL-Chat的模型。```tokenizer```用于对图文混排输入进行分词和预处理，而```model```则是Qwen-VL-Chat模型本身。 ## 使用Qwen-VL-Chat ### **多轮视觉问答** #### **第一个问题** 首先我们来看一个最简单的例子，如下图所示，文件```assets/mm_tutorial/Rebecca_(1939_poster).jpeg```是1940年电影Rebecca的于1939发布的海报。 ![](assets/mm_tutorial/Rebecca_(1939_poster)_Small.jpeg) 我们来问一问Qwen-VL-Chat海报上电影的名称是什么。首先，我们使用tokenizer.from_list_format可以对图文混排输入进行分词与处理： ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Rebecca_(1939_poster).jpeg'}, {'text': 'What is the name of the movie in the poster?'}, ]) ``` 接下来，我们可以使用```model.chat```向Qwen-VL-Chat模型提问并获得回复。注意在第一次提问时，对话历史为空，因此我们使用```history=None```。 ```python response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > The name of the movie in the poster is "Rebecca." 这说明模型正确的回答了问题！根据海报，该电影的名称的确是**Rebecca**。 #### **多轮问答** 我们还可以继续向模型发问，例如询问电影的导演是谁。在后续提问时，对话历史并不为空，我们使用```history=history```向```model.chat```传递之前的对话历史： ```python query = tokenizer.from_list_format([ {'text': 'Who directed this movie?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： > The movie "Rebecca" was directed by Alfred Hitchcock. 模型再次正确回答了问题！根据海报，该电影的导演是Alfred Hitchcock。 ### **文字理解** Qwen-VL-Chat具有一定的针对包含密集文字图片的理解能力。如下图所示，文件```assets/mm_tutorial/Hospital.jpeg```是一个包含密集文字的医院指示牌。 ![](assets/mm_tutorial/Hospital_Small.jpg) 我们可以像之前一样向模型询问医院中各个科室的位置，对话历史为空，因此我们使用```history=None```。 ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Hospital.jpg'}, {'text': 'Based on the photo, which floor is the Department of Otorhinolaryngology on?'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > The Department of Otorhinolaryngology is located on the 4th floor. 您同样可以进一步提出后续问题，此时需要使用```history=history```向```model.chat```传递之前的对话历史。 ```python query = tokenizer.from_list_format([ {'text': 'Based on the photo, which floor is the Department of Surgery on?'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： > The Department of Surgery is located on the 3rd floor. ### **图表数学推理** 利用模型的图表理解和数学推理能力，Qwen-VL-Chat还可以完成更复杂的一些任务！如下图所示，文件```assets/mm_tutorial/Menu.jpeg```展示了一家餐厅的菜单。现在我们想知道，如果购买两个Salmon Burger和三个Meat Lover's Pizza需要花多少钱呢？ ![](assets/mm_tutorial/Menu.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Menu.jpeg'}, {'text': 'How much would I pay if I want to order two Salmon Burger and three Meat Lover\'s Pizza? Think carefully step by step.'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` ```Think carefully step by step.```是一个引导模型分步处理复杂任务的常见提示词，如果您需要完成的任务较为复杂，可以试着使用它来提高准确率。您应该会得到类似下列的输出结果： > To order two Salmon Burgers and three Meat Lover's Pizzas, you would need to pay the following: > > 1. For two Salmon Burgers: x2 Salmon Burgers at $10 each = $20 > 2. For three Meat Lover's Pizzas: x3 Meat Lover's Pizzas at $12 each = $36 > > Therefore, the total cost would be $56. ### **多图理解与中文输入** 在之前的例子中，我们主要展示了Qwen-VL-Chat针对单张图像和英文问题的问答能力。但实际上，Qwen-VL-Chat是支持中文输入的多语言模型，而且也支持多张图片的输入！下面的例子中，我们用中文让Qwen-VL-Chat来为我们比较重庆和北京这两个城市的照片（```assets/mm_tutorial/Chongqing.jpeg```和```assets/mm_tutorial/Beijing.jpeg```）： ![](assets/mm_tutorial/Chongqing_Small.jpeg) ![](assets/mm_tutorial/Beijing_Small.jpeg) ```python query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'}, ]) torch.manual_seed(5678) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > 第一张图片是重庆的城市天际线，它反映了现代都市的繁华与喧嚣。第二张图片是北京的天际线，它象征着中国首都的现代化和国际化。两座城市都是中国的重要城市，拥有独特的文化和发展历史。 **请注意，城市间的比较是一个具有相当主观性的问题，因此模型产生的回复可能具有相当高的随机性。若不使用```torch.manual_seed(5678)```设置随机数种子，每次的输出结果会不一样。即使您设置了随机数种子，由于软硬件环境的差异，得到的结果也可能与本文档中的有所不同。** ### **Grounding能力** 在最后，我们展示Qwen-VL-Chat模型产生包围框的能力。Qwen-VL-Chat可以根据您的语言描述，在图像中用矩形框框出指定区域。这样说可能有些抽象，让我们来看下面的例子。如下图所示，文件```assets/mm_tutorial/Shanghai.jpg```是上海的一张照片，我们先用常规的提示词，问一下模型图里有什么。 ![](assets/mm_tutorial/Shanghai_Small.jpeg) ```python torch.manual_seed(1234) query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Shanghai.jpg'}, {'text': '图里有啥'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) ``` 您应该会得到类似下列的输出结果： > 图中是中国上海的天际线，包括了上海塔、金茂大厦、上海环球金融中心、海洋大厦等著名建筑。接下来，我们通过使用```请给我框出图中上海环球金融中心和东方明珠```这个提示词来和模型对话，看看会发生什么。注意此时需要使用```history=history```向```model.chat```传递之前的对话历史。 ```python query = tokenizer.from_list_format([ {'text': '请给我框出图中上海环球金融中心和东方明珠'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： ```xml <ref>上海环球金融中心</ref><box>(667,437),(760,874)</box>和<ref>东方明珠</ref><box>(506,75),(582,946)</box> ``` Qwen-VL-Chat模型没有手，但也没有拒绝您的请求，而是输出了一些“奇怪”的东西——并不是，实际上，模型的输出以标记语言的形式给出了上海环球金融中心和东方明珠在图中的具体位置。您可以使用下列代码将其可视化： ```python image = tokenizer.draw_bbox_on_latest_picture(response, history) image.save('Shanghai_Output.jpg') ``` 保存下来的```Shanghai_Output.jpg```结果将类似于下面的截图： ![](assets/mm_tutorial/Shanghai_Output_Small.jpeg) 在此之后，您还可以继续照常和Qwen-VL-Chat对话： ```python query = tokenizer.from_list_format([ {'text': '帮我写个这座城市的旅游计划'}, ]) response, history = model.chat(tokenizer, query=query, history=history) print(response) ``` 您应该会得到类似下列的输出结果： > 好的，以下是一个简单的上海旅游计划： > > 第一天： > 上午：抵达上海，前往酒店办理入住手续。 > 中午：享用午餐后，前往外滩，欣赏黄浦江畔的美景，游览上海地标性建筑如浦发银行大楼、汇丰银行大楼等。 > 下午：游览南京路步行街，购买特色礼品或品尝当地美食。 > 晚上：在南京路附近的餐厅享用晚餐，然后去看上海的夜景。 > > 第二天： > 上午：前往上海科技馆，了解科技发展历史，观看各种科技展览。 > 中午：在科技馆附近的餐厅享用午餐。 > 下午：游览世纪公园，欣赏美景并放松身心。 > 晚上：在南京路或附近的陆家嘴地区享用晚餐，然后去看上海的夜景。 > > 第三天： > 上午：游览上海迪士尼乐园或上海海昌海洋公园，与各种迪士尼角色互动，或者在海洋公园观看海洋生物表演。 > 中午：在迪士尼乐园或海洋公园附近的餐厅享用午餐。 > 下午：自由活动，可以去购物、品尝当地美食或者去博物馆等。 > 晚上：在酒店附近享用晚餐，然后离开上海。 > > 当然，以上只是一个简单的计划，上海有许多其他景点和活动，例如参观上海博物馆、游览田子坊、观看上海话剧等。具体计划可以根据个人兴趣和时间进行调整。 **请注意，旅游计划是一个具有相当主观性的问题，因此模型产生的回复可能具有相当高的随机性。若不使用```torch.manual_seed(1234)```设置随机数种子，每次的输出结果会不一样。即使您设置了随机数种子，由于软硬件环境的差异，得到的结果也可能与本文档中的有所不同。** ================================================ FILE: assets/mm_tutorial/TUTORIAL.ipynb ================================================ ================================================ FILE: eval_mm/EVALUATION.md ================================================ # Evaluation ## Dependencies ```bash pip install pycocoevalcap tqdm ``` ## Image Caption ### [Flickr30K](https://bryanplummer.com/Flickr30kEntities/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/flickr && cd data/flickr # download images from https://bryanplummer.com/Flickr30kEntities/ # karpathy split annotations can be downloaded from https://cs.stanford.edu/people/karpathy/deepimagesent/ # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/flickr30k/flickr30k_karpathy_test.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/flickr30k/flickr30k_karpathy_train.json cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="flickr" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_caption.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [Nocaps](https://nocaps.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/nocaps && cd data/nocaps # download images from https://nocaps.org/download # original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/nocaps/nocaps_val.json cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="nocaps" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_caption.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ## [COCO](https://cocodataset.org/) > COCO images are used in VQAv2/OK-VQA/RefCOCO/RefCOCO+/RefCOCOg, make sure you have already downloaded COCO images before evaluate on these benchmarks. <details> <summary>Data Preparation</summary> ```bash mkdir -p data/coco && cd data/coco # download coco2014 images wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip cd ../.. ``` </details> ## General VQA ### [VQAv2](https://visualqa.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/vqav2 && cd data/vqav2 # make sure you have downloaded COCO images # download questions and annotations wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip && unzip v2_Annotations_Train_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip && unzip v2_Questions_Train_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip && unzip v2_Annotations_Val_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip && unzip v2_Questions_Val_mscoco.zip wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip && unzip v2_Questions_Test_mscoco.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_testdev.jsonl ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "vqav2_val" "vqav2_testdev" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [OKVQA](https://okvqa.allenai.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/okvqa && cd data/okvqa # download annotations and questions wget https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip && unzip mscoco_train2014_annotations.json.zip wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip && unzip OpenEnded_mscoco_train2014_questions.json.zip wget https://okvqa.allenai.org/static/data/mscoco_val2014_annotations.json.zip && unzip mscoco_val2014_annotations.json.zip wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip && unzip OpenEnded_mscoco_val2014_questions.json.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_val.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="okvqa_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [TextVQA](https://textvqa.org/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/textvqa && cd data/textvqa # download images wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip && unzip train_val_images.zip # download annotations and questions wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="textvqa_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [VizWiz](https://vizwiz.org/tasks-and-datasets/vqa/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/vizwiz && cd data/vizwiz # download images wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/train.zip && unzip train.zip wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/val.zip && unzip val.zip wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip && unzip test.zip # download annotations wget https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip && unzip Annotations.zip # download converted files # train wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train.jsonl # val wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_annotations.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_questions.json wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val.jsonl # test wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_test.jsonl cd ../.. ``` </details> <details> <summary>Evaluation</summary> ```bash # evaluate vqa score on vizwiz val split ds="vizwiz_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [DocVQA](https://www.docvqa.org/datasets) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/docvqa && cd data/docvqa # download images and annotations from https://www.docvqa.org/datasets # download converted files # train wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.jsonl # val wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.jsonl # test wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.jsonl cd ../.. ``` </details> <details> <summary>Evaluation</summary> ```bash # evaluate vqa score on docvqa val split ds="docvqa_val" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [ChartQA](https://aclanthology.org/2022.findings-acl.177/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/chartqa && cd data/chartqa # download images from https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_human.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_augmented.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_human.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_augmented.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "chartqa_test_human" "chartqa_test_augmented" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/gqa && cd data/gqa # download images wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip unzip images.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/testdev_balanced.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/train_balanced.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT ds="gqa_testdev" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [OCRVQA](https://ocr-vqa.github.io/) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/ocrvqa && cd data/ocrvqa # download images by following instructions at https://ocr-vqa.github.io/kvqa_ProjectFiles/README.txt # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_test.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT ds="ocrvqa_test" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [AI2Diagram](https://allenai.org/data/diagrams) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/ai2diagram && cd data/ai2diagram # download images wget https://ai2-public-datasets.s3.amazonaws.com/diagrams/ai2d-all.zip # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/train.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ai2diagram/test.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT ds="ai2diagram_test" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_vqa.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### [ScienceQA](https://github.com/lupantech/ScienceQA) <details> <summary>Data Preparation</summary> ```bash mkdir -p data/scienceqa/images && cd data/scienceqa/images # download images wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip && unzip test.zip cd .. # download original questions wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/scienceqa_test_img.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash ds="scienceqa_test_img" checkpoint=/PATH/TO/CHECKPOINT python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_multiple_choice.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ## Refer Expression Comprehension ### RefCOCO <details> <summary>Data Preparation</summary> ```bash mkdir -p data/refcoco && cd data/refcoco # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testA.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testB.jsonl cd ../.. ``` </details> <details> <summary>Evaluation</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "refcoco_val" "refcoco_testA" "refcoco_testB" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_grounding.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### RefCOCO+ <details> <summary>Data Preparation</summary> ```bash mkdir -p data/refcoco+ && cd data/refcoco+ # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testA.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testB.jsonl cd ../.. ``` </details> <details> <summary>Data Preparation</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "refcoco+_val" "refcoco+_testA" "refcoco+_testB" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_grounding.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ### RefCOCOg <details> <summary>Data Preparation</summary> ```bash mkdir -p data/refcocog && data/refcocog # download converted files wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_test.jsonl cd ../.. ``` </details> <details> <summary>Evaluate</summary> ```bash checkpoint=/PATH/TO/CHECKPOINT for ds in "refcocog_val" "refcocog_test" python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_grounding.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 8 \ --num-workers 2 ``` </details> ================================================ FILE: eval_mm/evaluate_caption.py ================================================ import argparse import itertools import json import os import random import time from functools import partial import torch from pycocoevalcap.eval import COCOEvalCap from pycocotools.coco import COCO from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer ds_collections = { 'flickr': { 'train': 'data/flickr30k/flickr30k_karpathy_test.json', 'test': 'data/flickr30k/flickr30k_karpathy_test.json', }, 'nocaps': { 'train': '', 'test': 'data/nocaps/nocaps_val.json', }, } class CaptionDataset(torch.utils.data.Dataset): def __init__(self, train, test, prompt, few_shot=0): self.images = json.load(open(test))['images'] self.prompt = prompt self.few_shot = few_shot if few_shot > 0: self.train = json.load(open(train))['annotations'] def __len__(self): return len(self.images) def __getitem__(self, idx): image_id, image_path = self.images[idx]['id'], self.images[idx][ 'image'] few_shot_prompt = '' if self.few_shot > 0: few_shot_samples = random.sample(self.train, self.few_shot) for sample in few_shot_samples: few_shot_prompt += self.prompt.format( sample['image']) + f" {sample['caption']}" return { 'image_id': image_id, 'input_text': few_shot_prompt + self.prompt.format(image_path) } def collate_fn(inputs, tokenizer): image_ids = [_['image_id'] for _ in inputs] input_texts = [_['input_text'] for _ in inputs] input_tokens = tokenizer(input_texts, return_tensors='pt', padding='longest') return image_ids, input_tokens.input_ids, input_tokens.attention_mask class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) parser.add_argument('--few-shot', type=int, default=0) parser.add_argument('--seed', type=int, default=0) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) prompt = '<img>{}</img>Describe the image in English:' model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) tokenizer.padding_side = 'left' tokenizer.pad_token_id = tokenizer.eod_id random.seed(args.seed) dataset = CaptionDataset( train=ds_collections[args.dataset]['train'], test=ds_collections[args.dataset]['test'], prompt=prompt, few_shot=args.few_shot, ) coco_karpathy_test_loader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, tokenizer=tokenizer), ) image_ids = [] captions = [] for _, (ids, input_ids, attention_mask) in tqdm(enumerate(coco_karpathy_test_loader)): pred = model.generate( input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda(), do_sample=False, num_beams=1, max_new_tokens=30, min_new_tokens=8, length_penalty=0, num_return_sequences=1, use_cache=True, pad_token_id=tokenizer.eod_id, eos_token_id=tokenizer.eod_id, ) image_ids.extend(ids) captions.extend([ tokenizer.decode(_[input_ids.size(1):].cpu(), skip_special_tokens=True).strip() for _ in pred ]) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_ids = [None for _ in range(world_size)] merged_captions = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_ids, image_ids) torch.distributed.all_gather_object(merged_captions, captions) merged_ids = [_ for _ in itertools.chain.from_iterable(merged_ids)] merged_captions = [ _ for _ in itertools.chain.from_iterable(merged_captions) ] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") results = [] for image_id, caption in zip(merged_ids, merged_captions): results.append({ 'image_id': int(image_id), 'caption': caption, }) time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) results_file = f'{args.dataset}_{time_prefix}.json' json.dump(results, open(results_file, 'w')) coco = COCO(ds_collections[args.dataset]['test']) coco_result = coco.loadRes(results_file) coco_eval = COCOEvalCap(coco, coco_result) coco_eval.evaluate() print(coco_eval.eval.items()) torch.distributed.barrier() ================================================ FILE: eval_mm/evaluate_grounding.py ================================================ import argparse import itertools import json import os import re from functools import partial import torch from torchvision.ops.boxes import box_area from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer ds_collections = { 'refcoco_val': 'data/refcoco/refcoco_val.jsonl', 'refcoco_testA': 'data/refcoco/refcoco_testA.jsonl', 'refcoco_testB': 'data/refcoco/refcoco_testB.jsonl', 'refcoco+_val': 'data/refcoco+/refcoco+_val.jsonl', 'refcoco+_testA': 'data/refcoco+/refcoco+_testA.jsonl', 'refcoco+_testB': 'data/refcoco+/refcoco+_testB.jsonl', 'refcocog_val': 'data/refcocog/refcocog_val.jsonl', 'refcocog_test': 'data/refcocog/refcocog_test.jsonl', } def box_iou(boxes1, boxes2): area1 = box_area(boxes1) area2 = box_area(boxes2) lt = torch.max(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2] rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2] wh = (rb - lt).clamp(min=0) # [N,M,2] inter = wh[:, :, 0] * wh[:, :, 1] # [N,M] union = area1[:, None] + area2 - inter iou = inter / union return iou, union def collate_fn(batches, tokenizer): texts = [_['text'] for _ in batches] bboxes = [_['bbox'] for _ in batches] hws = [_['hw'] for _ in batches] input_ids = tokenizer(texts, return_tensors='pt', padding='longest') return input_ids.input_ids, input_ids.attention_mask, bboxes, hws class RefCOCODataset(torch.utils.data.Dataset): def __init__(self, test, tokenizer, prompt): self.datas = open(test).readlines() self.tokenizer = tokenizer self.prompt = prompt def __len__(self): return len(self.datas) def __getitem__(self, idx): data = json.loads(self.datas[idx].strip()) image = data['image'] text = data['sent'] bbox = data['bbox'] w, h = data['width'], data['height'] return { 'text': self.prompt.format(image, text), 'bbox': bbox, 'hw': (h, w), } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) tokenizer.padding_side = 'left' tokenizer.pad_token_id = tokenizer.eod_id prompt = '<img>{}</img><ref>{}</ref><box>' dataset = RefCOCODataset(test=ds_collections[args.dataset], tokenizer=tokenizer, prompt=prompt) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=True, collate_fn=partial(collate_fn, tokenizer=tokenizer), ) outputs = [] for _, (input_ids, attention_mask, bboxes, hws) in tqdm(enumerate(dataloader)): pred = model.generate( input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda(), do_sample=False, num_beams=1, max_new_tokens=28, min_new_tokens=10, length_penalty=1, num_return_sequences=1, use_cache=True, pad_token_id=tokenizer.eod_id, eos_token_id=tokenizer.eod_id, ) answers = [ tokenizer.decode(_[input_ids.size(1):].cpu(), skip_special_tokens=True) for _ in pred ] for bbox, hw, answer in zip(bboxes, hws, answers): outputs.append({ 'answer': answer, 'gt_bbox': bbox, 'hw': hw, }) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_outputs = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_outputs, outputs) merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] PATTERN = re.compile(r'$(.*?)$,$(.*?)$') if torch.distributed.get_rank() == 0: correct = total_cnt = 0 for i, output in enumerate(merged_outputs): predict_bbox = re.findall(PATTERN, output['answer']) try: if ',' not in predict_bbox[0][0] or ',' not in predict_bbox[0][ 1]: predict_bbox = (0., 0., 0., 0.) else: x1, y1 = [ float(tmp) for tmp in predict_bbox[0][0].split(',') ] x2, y2 = [ float(tmp) for tmp in predict_bbox[0][1].split(',') ] predict_bbox = (x1, y1, x2, y2) except: predict_bbox = (0., 0., 0., 0.) target_bbox = torch.tensor(output['gt_bbox'], dtype=torch.float32).view(-1, 4) predict_bbox = torch.tensor(predict_bbox, dtype=torch.float32).view(-1, 4) / 999 predict_bbox[:, 0::2] *= output['hw'][1] predict_bbox[:, 1::2] *= output['hw'][0] iou, _ = box_iou(predict_bbox, target_bbox) iou = iou.item() total_cnt += 1 if iou >= 0.5: correct += 1 print(f"Evaluating {args.dataset} ...") print(f'Precision @ 1: {correct / total_cnt} \n') torch.distributed.barrier() ================================================ FILE: eval_mm/evaluate_multiple_choice.py ================================================ import argparse import itertools import json import os from functools import partial import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer multiple_choices = ['A', 'B', 'C', 'D', 'E'] ds_collections = { 'scienceqa_test_img': { 'test': 'data/scienceqa/scienceqa_test_img.jsonl', } } def collate_fn(batches, pad_token_id): input_tokens = [_['input_tokens'] for _ in batches] target_lengths = [_['target_lengths'] for _ in batches] answers = [_['answer'] for _ in batches] chunk_sizes = [len(_) for _ in input_tokens] input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)] max_lengths = max([len(_) for _ in input_tokens]) input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _ for _ in input_tokens] input_tokens = torch.LongTensor(input_tokens) attention_mask = 1 - input_tokens.eq(pad_token_id).float() return input_tokens, attention_mask, target_lengths, answers, chunk_sizes class MultipleChoiceDataste(torch.utils.data.Dataset): def __init__(self, test, prompt, tokenizer): self.datas = open(test).readlines() self.prompt = prompt self.tokenizer = tokenizer def __len__(self): return len(self.datas) def __getitem__(self, idx): data = json.loads(self.datas[idx].strip()) image = data['image'] hint = data['hint'] if data['hint'] else 'N/A' question = data['question'] choices = data['choices'] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(multiple_choices[i], c)) choice_txt = '\n'.join(choice_list) prompt = self.prompt.format(image, hint, question, choice_txt) prompt_tokens = self.tokenizer(prompt).input_ids target_tokens = [ self.tokenizer(' ' + _).input_ids for _ in multiple_choices[:len(choices)] ] return { 'input_tokens': [prompt_tokens + _ for _ in target_tokens], 'target_lengths': [len(_) for _ in target_tokens], 'answer': data['answer'], } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) prompt = '<img>{}</img>Context: {}\nQuestion: {}\nOptions: {}\nAnswer:' dataset = MultipleChoiceDataste(test=ds_collections[args.dataset]['test'], prompt=prompt, tokenizer=tokenizer) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id), ) results = [] with torch.no_grad(): for _, (input_tokens, attention_mask, target_lengths, answer, chunk_sizes) in tqdm(enumerate(dataloader)): outputs = model( input_ids=input_tokens[:, :-1].cuda(), attention_mask=attention_mask[:, :-1].cuda(), return_dict=True, ) losses = torch.nn.functional.cross_entropy(outputs.logits.permute( 0, 2, 1), input_tokens[:, 1:].cuda(), reduction='none') losses = losses.split(chunk_sizes, dim=0) for loss, target_length, answer in zip(losses, target_lengths, answer): target_loss = loss.mean(-1) for _ in range(len(target_length)): target_loss[_] = loss[_, -target_length[_]:].mean() pred = target_loss.argmin().item() if pred == answer: results.append(1) else: results.append(0) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_results = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_results, results) merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") print(f'Acc@1: {sum(merged_results) / len(merged_results)}') torch.distributed.barrier() ================================================ FILE: eval_mm/evaluate_vqa.py ================================================ import argparse import itertools import json import os import random import time from functools import partial from typing import Optional import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer from vqa import VQA from vqa_eval import VQAEval ds_collections = { 'vqav2_val': { 'train': 'data/vqav2/vqav2_train.jsonl', 'test': 'data/vqav2/vqav2_val.jsonl', 'question': 'data/vqav2/v2_OpenEnded_mscoco_val2014_questions.json', 'annotation': 'data/vqav2/v2_mscoco_val2014_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'vqav2_testdev': { 'train': 'data/vqav2/vqav2_train.jsonl', 'test': 'data/vqav2/vqav2_testdev.jsonl', 'metric': None, 'max_new_tokens': 10, }, 'okvqa_val': { 'train': 'data/okvqa/okvqa_train.jsonl', 'test': 'data/okvqa/okvqa_val.jsonl', 'question': 'data/okvqa/OpenEnded_mscoco_val2014_questions.json', 'annotation': 'data/okvqa/mscoco_val2014_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'textvqa_val': { 'train': 'data/textvqa/textvqa_train.jsonl', 'test': 'data/textvqa/textvqa_val.jsonl', 'question': 'data/textvqa/textvqa_val_questions.json', 'annotation': 'data/textvqa/textvqa_val_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'vizwiz_val': { 'train': 'data/vizwiz/vizwiz_train.jsonl', 'test': 'data/vizwiz/vizwiz_val.jsonl', 'question': 'data/vizwiz/vizwiz_val_questions.json', 'annotation': 'data/vizwiz/vizwiz_val_annotations.json', 'metric': 'vqa_score', 'max_new_tokens': 10, }, 'vizwiz_test': { 'train': 'data/vizwiz/vizwiz_train.jsonl', 'test': 'data/vizwiz/vizwiz_test.jsonl', 'metric': None, 'max_new_tokens': 10, }, 'docvqa_val': { 'train': 'data/docvqa/train.jsonl', 'test': 'data/docvqa/val.jsonl', 'annotation': 'data/docvqa/val/val_v1.0.json', 'metric': 'anls', 'max_new_tokens': 100, }, 'docvqa_test': { 'train': 'data/docvqa/train.jsonl', 'test': 'data/docvqa/test.jsonl', 'metric': None, 'max_new_tokens': 100, }, 'chartqa_test_human': { 'train': 'data/chartqa/train_human.jsonl', 'test': 'data/chartqa/test_human.jsonl', 'metric': 'relaxed_accuracy', 'max_new_tokens': 100, }, 'chartqa_test_augmented': { 'train': 'data/chartqa/train_augmented.jsonl', 'test': 'data/chartqa/test_augmented.jsonl', 'metric': 'relaxed_accuracy', 'max_new_tokens': 100, }, 'gqa_testdev': { 'train': 'data/gqa/train.jsonl', 'test': 'data/gqa/testdev_balanced.jsonl', 'metric': 'accuracy', 'max_new_tokens': 10, }, 'ocrvqa_val': { 'train': 'data/ocrvqa/ocrvqa_train.jsonl', 'test': 'data/ocrvqa/ocrvqa_val.jsonl', 'metric': 'accuracy', 'max_new_tokens': 100, }, 'ocrvqa_test': { 'train': 'data/ocrvqa/ocrvqa_train.jsonl', 'test': 'data/ocrvqa/ocrvqa_test.jsonl', 'metric': 'accuracy', 'max_new_tokens': 100, }, 'ai2diagram_test': { 'train': 'data/ai2diagram/train.jsonl', 'test': 'data/ai2diagram/test.jsonl', 'metric': 'accuracy', 'max_new_tokens': 10, } } # https://github.com/google-research/pix2struct/blob/main/pix2struct/metrics.py#L81 def relaxed_correctness(target: str, prediction: str, max_relative_change: float = 0.05) -> bool: """Calculates relaxed correctness. The correctness tolerates certain error ratio defined by max_relative_change. See https://arxiv.org/pdf/2203.10244.pdf, end of section 5.1: “Following Methani et al. (2020), we use a relaxed accuracy measure for the numeric answers to allow a minor inaccuracy that may result from the automatic data extraction process. We consider an answer to be correct if it is within 5% of the gold answer. For non-numeric answers, we still need an exact match to consider an answer to be correct.” Args: target: Target string. prediction: Predicted string. max_relative_change: Maximum relative change. Returns: Whether the prediction was correct given the specified tolerance. """ def _to_float(text: str) -> Optional[float]: try: if text.endswith('%'): # Convert percentages to floats. return float(text.rstrip('%')) / 100.0 else: return float(text) except ValueError: return None prediction_float = _to_float(prediction) target_float = _to_float(target) if prediction_float is not None and target_float: relative_change = abs(prediction_float - target_float) / abs(target_float) return relative_change <= max_relative_change else: return prediction.lower() == target.lower() def evaluate_relaxed_accuracy(entries): scores = [] for elem in entries: if isinstance(elem['annotation'], str): elem['annotation'] = [elem['annotation']] score = max([ relaxed_correctness(elem['answer'].strip(), ann) for ann in elem['annotation'] ]) scores.append(score) return sum(scores) / len(scores) def evaluate_exact_match_accuracy(entries): scores = [] for elem in entries: if isinstance(elem['annotation'], str): elem['annotation'] = [elem['annotation']] score = max([ (1.0 if (elem['answer'].strip().lower() == ann.strip().lower()) else 0.0) for ann in elem['annotation'] ]) scores.append(score) return sum(scores) / len(scores) def collate_fn(batches, tokenizer): questions = [_['question'] for _ in batches] question_ids = [_['question_id'] for _ in batches] annotations = [_['annotation'] for _ in batches] input_ids = tokenizer(questions, return_tensors='pt', padding='longest') return question_ids, input_ids.input_ids, input_ids.attention_mask, annotations class VQADataset(torch.utils.data.Dataset): def __init__(self, train, test, prompt, few_shot): self.test = open(test).readlines() self.prompt = prompt self.few_shot = few_shot if few_shot > 0: self.train = open(train).readlines() def __len__(self): return len(self.test) def __getitem__(self, idx): data = json.loads(self.test[idx].strip()) image, question, question_id, annotation = data['image'], data[ 'question'], data['question_id'], data.get('answer', None) few_shot_prompt = '' if self.few_shot > 0: few_shot_samples = random.sample(self.train, self.few_shot) for sample in few_shot_samples: sample = json.loads(sample.strip()) few_shot_prompt += self.prompt.format( sample['image'], sample['question']) + f" {sample['answer']}" return { 'question': few_shot_prompt + self.prompt.format(image, question), 'question_id': question_id, 'annotation': annotation } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) parser.add_argument('--few-shot', type=int, default=0) parser.add_argument('--seed', type=int, default=0) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) tokenizer.padding_side = 'left' tokenizer.pad_token_id = tokenizer.eod_id prompt = '<img>{}</img>{} Answer:' random.seed(args.seed) dataset = VQADataset( train=ds_collections[args.dataset]['train'], test=ds_collections[args.dataset]['test'], prompt=prompt, few_shot=args.few_shot, ) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, tokenizer=tokenizer), ) outputs = [] for _, (question_ids, input_ids, attention_mask, annotations) in tqdm(enumerate(dataloader)): pred = model.generate( input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda(), do_sample=False, num_beams=1, max_new_tokens=ds_collections[args.dataset]['max_new_tokens'], min_new_tokens=1, length_penalty=1, num_return_sequences=1, output_hidden_states=True, use_cache=True, pad_token_id=tokenizer.eod_id, eos_token_id=tokenizer.eod_id, ) answers = [ tokenizer.decode(_[input_ids.size(1):].cpu(), skip_special_tokens=True).strip() for _ in pred ] for question_id, answer, annotation in zip(question_ids, answers, annotations): if args.dataset in ['vqav2_val', 'vqav2_testdev', 'okvqa_val', 'textvqa_val', 'vizwiz_val']: outputs.append({ 'question_id': question_id, 'answer': answer, }) elif args.dataset in ['docvqa_val', 'infographicsvqa', 'gqa_testdev', 'ocrvqa_val', 'ocrvqa_test']: outputs.append({ 'questionId': question_id, 'answer': answer, 'annotation': annotation, }) elif args.dataset in ['ai2diagram_test']: outputs.append({ 'image': question_id, 'answer': answer, 'annotation': annotation, }) elif args.dataset in ['chartqa_test_human', 'chartqa_test_augmented']: outputs.append({ 'answer': answer, 'annotation': annotation, }) elif args.dataset in ['docvqa_test']: outputs.append({ 'questionId': question_id, 'answer': answer, }) elif args.dataset in ['vizwiz_test']: outputs.append({ 'image': question_id, 'answer': answer, }) else: raise NotImplementedError torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_outputs = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) merged_outputs = [json.loads(_) for _ in merged_outputs] merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) results_file = f'{args.dataset}_{time_prefix}_fs{args.few_shot}_s{args.seed}.json' json.dump(merged_outputs, open(results_file, 'w'), ensure_ascii=False) if ds_collections[args.dataset]['metric'] == 'vqa_score': vqa = VQA(ds_collections[args.dataset]['annotation'], ds_collections[args.dataset]['question']) results = vqa.loadRes( resFile=results_file, quesFile=ds_collections[args.dataset]['question']) vqa_scorer = VQAEval(vqa, results, n=2) vqa_scorer.evaluate() print(vqa_scorer.accuracy) elif ds_collections[args.dataset]['metric'] == 'anls': json.dump(merged_outputs, open(results_file, 'w'), ensure_ascii=False) print('python infographicsvqa_eval.py -g ' + ds_collections[args.dataset]['annotation'] + ' -s ' + results_file) os.system('python infographicsvqa_eval.py -g ' + ds_collections[args.dataset]['annotation'] + ' -s ' + results_file) elif ds_collections[args.dataset]['metric'] == 'relaxed_accuracy': print({ 'relaxed_accuracy': evaluate_relaxed_accuracy(merged_outputs) }) elif ds_collections[args.dataset]['metric'] == 'accuracy': if 'gqa' in args.dataset: for entry in merged_outputs: response = entry['answer'] response = response.strip().split('.')[0].split( ',')[0].split('!')[0].lower() if 'is ' in response: response = response.split('is ')[1] if 'are ' in response: response = response.split('are ')[1] if 'a ' in response: response = response.split('a ')[1] if 'an ' in response: response = response.split('an ')[1] if 'the ' in response: response = response.split('the ')[1] if ' of' in response: response = response.split(' of')[0] response = response.strip() entry['answer'] = response print({'accuracy': evaluate_exact_match_accuracy(merged_outputs)}) torch.distributed.barrier() ================================================ FILE: eval_mm/infographicsvqa_eval.py ================================================ # This file can be downloaded from: https://www.docvqa.org/datasets/infographicvqa and https://rrc.cvc.uab.es/?ch=17&com=introduction import os, json import argparse question_ids_to_exclude = [] # answer_types = {'image span': 'Image-Span', 'question span': 'Question-Span', 'multiple spans': 'Multi-Span', 'non span': 'None span', 'list': 'List'} answer_types = {'image span': 'Image-Span', 'question span': 'Question-Span', 'multiple spans': 'Multi-Span', 'non span': 'None span'} evidence_types = {'table/list': 'Table/list', 'textual': 'Text', 'photo/pciture/visual_objects': 'Visual/Layout', 'figure': 'Figure', 'map': 'Map'} reasoning_requirements = {'comparison': 'Sorting', 'arithmetic': 'Arithmetic', 'counting':'Counting'} def save_json(file_path, data): with open(file_path, 'w+') as json_file: json.dump(data, json_file) def levenshtein_distance(s1, s2): if len(s1) > len(s2): s1, s2 = s2, s1 distances = range(len(s1) + 1) for i2, c2 in enumerate(s2): distances_ = [i2+1] for i1, c1 in enumerate(s1): if c1 == c2: distances_.append(distances[i1]) else: distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1]))) distances = distances_ return distances[-1] def validate_data(gtFilePath, submFilePath): """ Method validate_data: validates that all files in the results folder are correct (have the correct name contents). Validates also that there are no missing files in the folder. If some error detected, the method raises the error """ gtJson = json.load(open(gtFilePath,'rb')); submJson = json.load(open(submFilePath,'rb')); if not 'data' in gtJson: raise Exception("The GT file is not valid (no data key)") if not 'dataset_name' in gtJson: raise Exception("The GT file is not valid (no dataset_name key)") if isinstance(submJson, list) == False : raise Exception("The Det file is not valid (root item must be an array)") if len(submJson) != len(gtJson['data']) : raise Exception("The Det file is not valid (invalid number of answers. Expected:" + str(len(gtJson['data'])) + " Found:" + str(len(submJson)) + ")") gtQuestions = sorted([r['questionId'] for r in gtJson['data']]) res_id_to_index = {int(r['questionId']): ix for ix, r in enumerate(submJson)} detQuestions = sorted([r['questionId'] for r in submJson]) if( (gtQuestions == detQuestions) == False ): raise Exception("The Det file is not valid. Question IDs must much GT") for gtObject in gtJson['data']: try: q_id = int(gtObject['questionId']); res_ix = res_id_to_index[q_id]; except: raise Exception("The Det file is not valid. Question " + str(gtObject['questionId']) + " not present") else: detObject = submJson[res_ix]; # if detObject['questionId'] != gtObject['questionId'] : # raise Exception("Answer #" + str(i) + " not valid (invalid question ID. Expected:" + str(gtObject['questionId']) + "Found:" + detObject['questionId'] + ")") if not 'answer' in detObject: raise Exception("Question " + str(gtObject['questionId']) + " not valid (no answer key)") if isinstance(detObject['answer'], list) == True : raise Exception("Question " + str(gtObject['questionId']) + " not valid (answer key has to be a single string)") def evaluate_method(gtFilePath, submFilePath, evaluationParams): """ Method evaluate_method: evaluate method and returns the results Results. Dictionary with the following values: - method (required) Global method metrics. Ex: { 'Precision':0.8,'Recall':0.9 } - samples (optional) Per sample metrics. Ex: {'sample1' : { 'Precision':0.8,'Recall':0.9 } , 'sample2' : { 'Precision':0.8,'Recall':0.9 } """ show_scores_per_answer_type = evaluationParams.answer_types gtJson = json.load(open(gtFilePath,'rb')); submJson = json.load(open(submFilePath,'rb')); res_id_to_index = {int(r['questionId']): ix for ix, r in enumerate(submJson)} perSampleMetrics = {} totalScore = 0 row = 0 if show_scores_per_answer_type: answerTypeTotalScore = {x:0 for x in answer_types.keys()} answerTypeNumQuestions = {x:0 for x in answer_types.keys()} evidenceTypeTotalScore = {x:0 for x in evidence_types.keys()} evidenceTypeNumQuestions = {x:0 for x in evidence_types.keys()} reasoningTypeTotalScore = {x:0 for x in reasoning_requirements.keys()} reasoningTypeNumQuestions = {x:0 for x in reasoning_requirements.keys()} for gtObject in gtJson['data']: q_id = int(gtObject['questionId']); res_ix = res_id_to_index[q_id]; detObject = submJson[res_ix]; if q_id in question_ids_to_exclude: question_result = 0 info = 'Question EXCLUDED from the result' else: info = '' values = [] for answer in gtObject['answers']: # preprocess both the answers - gt and prediction gt_answer = ' '.join(answer.strip().lower().split()) det_answer = ' '.join(detObject['answer'].strip().lower().split()) #dist = levenshtein_distance(answer.lower(), detObject['answer'].lower()) dist = levenshtein_distance(gt_answer,det_answer) length = max( len(answer.upper()), len(detObject['answer'].upper()) ) values.append( 0.0 if length == 0 else float(dist) / float(length) ) question_result = 1 - min(values) if (question_result < evaluationParams.anls_threshold) : question_result = 0 totalScore += question_result if show_scores_per_answer_type: for q_type in gtObject["answer_type"]: answerTypeTotalScore[q_type] += question_result answerTypeNumQuestions[q_type] += 1 for q_type in gtObject["evidence"]: evidenceTypeTotalScore[q_type] += question_result evidenceTypeNumQuestions[q_type] += 1 for q_type in gtObject["operation/reasoning"]: reasoningTypeTotalScore[q_type] += question_result reasoningTypeNumQuestions[q_type] += 1 perSampleMetrics[str(gtObject['questionId'])] = { 'score':question_result, 'question':gtObject['question'], 'gt':gtObject['answers'], 'det':detObject['answer'], 'info': info } row = row + 1 methodMetrics = { 'score': 0 if len(gtJson['data']) == 0 else totalScore/ (len(gtJson['data']) - len(question_ids_to_exclude) ) } answer_types_scores = {} evidence_types_scores = {} operation_types_scores = {} if show_scores_per_answer_type: for a_type, ref in answer_types.items(): answer_types_scores[ref] = 0 if len(gtJson['data']) == 0 else answerTypeTotalScore[a_type] / (answerTypeNumQuestions[a_type] ) for e_type, ref in evidence_types.items(): evidence_types_scores[ref] = 0 if len(gtJson['data']) == 0 else evidenceTypeTotalScore[e_type] / (evidenceTypeNumQuestions[e_type] ) for r_type, ref in reasoning_requirements.items(): operation_types_scores[ref] = 0 if len(gtJson['data']) == 0 else reasoningTypeTotalScore[r_type] / (reasoningTypeNumQuestions[r_type] ) resDict = { 'result': methodMetrics, 'scores_by_types': {'answer_types': answer_types_scores, 'evidence_types': evidence_types_scores, 'operation_types': operation_types_scores}, 'per_sample_result':perSampleMetrics } return resDict; def display_results(results, show_answer_types): print("\nOverall ANLS: {:2.4f}".format(results['result']['score'])) if show_answer_types: print("\nAnswer types:") for a_type in answer_types.values(): print("\t{:12s} {:2.4f}".format(a_type, results['scores_by_types']['answer_types'][a_type])) print("\nEvidence types:") for e_type in evidence_types.values(): print("\t{:12s} {:2.4f}".format(e_type, results['scores_by_types']['evidence_types'][e_type])) print("\nOperation required:") for r_type in reasoning_requirements.values(): print("\t{:12s} {:2.4f}".format(r_type, results['scores_by_types']['operation_types'][r_type])) if __name__=='__main__': parser = argparse.ArgumentParser(description="InfographVQA evaluation script.") parser.add_argument('-g', '--ground_truth', type=str, help="Path of the Ground Truth file.", required=True) parser.add_argument('-s', '--submission_file', type=str, help="Path of your method's results file.", required=True) parser.add_argument('-t', '--anls_threshold', type=float, default=0.5, help="ANLS threshold to use (See Scene-Text VQA paper for more info.).", required=False) parser.add_argument('-a', '--answer_types', type=bool, default=False, help="Score break down by answer types (special gt file required).", required=False) parser.add_argument('-o', '--output', type=str, help="Path to a directory where to copy the file 'results.json' that contains per-sample results.", required=False) args = parser.parse_args() # Validate the format of ground truth and submission files. validate_data(args.ground_truth, args.submission_file) # Evaluate method results = evaluate_method(args.ground_truth, args.submission_file, args) display_results(results, args.answer_types) if args.output: output_dir = args.output if not os.path.exists(output_dir): os.makedirs(output_dir) resultsOutputname = os.path.join(output_dir, 'results.json') save_json(resultsOutputname, results) print("All results including per-sample result has been correctly saved!") ================================================ FILE: eval_mm/mmbench/MMBENCH.md ================================================ # MMBench Evaluation ## Data ```bash /cpfs01/shared/public/shusheng.yss/workspace/23082502_qwenvl_eval_test/eval_mm/data/mmbench ``` ## Dev ```bash checkpoint=/PATH/TO/CHECKPOINT ds=mmbench_dev_20230712 python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_multiple_choice_mmbench.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 2 \ --num-workers 2 # the results will be saved to mmbench_dev_20230712.json # without consistency constrain python mmbench_evaluation.py # with consistency constrain python mmbench_evaluation_tricky.py ``` ## Test ```bash checkpoint=/PATH/TO/CHECKPOINT ds=mmbench_test_20230712 python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ evaluate_multiple_choice_mmbench.py \ --checkpoint $checkpoint \ --dataset $ds \ --batch-size 2 \ --num-workers 2 # the results will be saved to mmbench_test_20230712.json # convert to submission format with consistency constrain python mmbench_predict_to_submission.py ``` ================================================ FILE: eval_mm/mmbench/evaluate_multiple_choice_mmbench.py ================================================ import argparse import itertools import json import os from functools import partial import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer multiple_choices = ['A', 'B', 'C', 'D', 'E'] ds_collections = { 'mmbench_dev_20230712': { 'test': 'data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.jsonl', }, 'mmbench_test_20230712': { 'test': 'data/mmbench/mmbench_test_20230712/mmbench_test_20230712.jsonl', } } def collate_fn(batches, pad_token_id): indexes = [_['index'] for _ in batches] input_tokens = [_['input_tokens'] for _ in batches] target_lengths = [_['target_lengths'] for _ in batches] chunk_sizes = [len(_) for _ in input_tokens] input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)] max_lengths = max([len(_) for _ in input_tokens]) input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _ for _ in input_tokens] input_tokens = torch.LongTensor(input_tokens) attention_mask = 1 - input_tokens.eq(pad_token_id).float() return input_tokens, attention_mask, target_lengths, chunk_sizes, indexes class MultipleChoiceDataste(torch.utils.data.Dataset): def __init__(self, test, prompt, tokenizer): self.datas = open(test).readlines() self.prompt = prompt self.tokenizer = tokenizer def __len__(self): return len(self.datas) def __getitem__(self, idx): data = json.loads(self.datas[idx].strip()) index = data['index'] image = data['image'] hint = data['hint'] if data['hint'] else 'N/A' question = data['question'] choices = data['choices'] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(multiple_choices[i], c)) choice_txt = '\n'.join(choice_list) prompt = self.prompt.format(image, hint, question, choice_txt) prompt_tokens = self.tokenizer(prompt).input_ids target_tokens = [ self.tokenizer(' ' + _).input_ids for _ in multiple_choices[:len(choices)] ] return { 'index': index, 'input_tokens': [prompt_tokens + _ for _ in target_tokens], 'target_lengths': [len(_) for _ in target_tokens], # 'answer': data['answer'], } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) prompt = '<img>{}</img>Context: {}\nQuestion: {}\nOptions: {}\nAnswer:' dataset = MultipleChoiceDataste(test=ds_collections[args.dataset]['test'], prompt=prompt, tokenizer=tokenizer) dataloader = torch.utils.data.DataLoader( dataset=dataset, sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id), ) results = [] with torch.no_grad(): for _, (input_tokens, attention_mask, target_lengths, chunk_sizes, indexes) in tqdm(enumerate(dataloader)): outputs = model( input_ids=input_tokens[:, :-1].cuda(), attention_mask=attention_mask[:, :-1].cuda(), return_dict=True, ) losses = torch.nn.functional.cross_entropy(outputs.logits.permute( 0, 2, 1), input_tokens[:, 1:].cuda(), reduction='none') losses = losses.split(chunk_sizes, dim=0) for loss, target_length, index in zip(losses, target_lengths, indexes): target_loss = loss.mean(-1) for _ in range(len(target_length)): target_loss[_] = loss[_, -target_length[_]:].mean() pred = target_loss.argmin().item() results.append({ "index": index, "prediction": pred, }) torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_results = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_results, results) merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)] if torch.distributed.get_rank() == 0: json.dump(merged_results, open(f"{args.dataset}.json", "w")) torch.distributed.barrier() ================================================ FILE: eval_mm/mmbench/mmbench_converter_dev.py ================================================ import pandas as pd import io import base64 import json from PIL import Image ''' This scripts convert mmbench_dev tsv file to jsonl ''' datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t') global_choices = ['A', 'B', 'C', 'D'] def decode_base64_to_image(base64_string): image_data = base64.b64decode(base64_string) image = Image.open(io.BytesIO(image_data)) return image with open('./data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.jsonl', 'w') as f: for idx in range(len(datas)): data = datas.iloc[idx] index = int(data['index']) question = data['question'] hint = data['hint'] if not pd.isna(data['hint']) else 'N/A' choices = [] for opt in global_choices: if pd.isna(data[opt]): continue choices.append(data[opt]) answer = global_choices.index(data['answer']) image = decode_base64_to_image(data['image']) image.save("data/mmbench/mmbench_dev_20230712/images/%d.jpg" % index) f.write(json.dumps({ "index": index, "image": "data/mmbench/mmbench_dev_20230712/images/%d.jpg" % index, "hint": hint, "question": question, "choices": choices, "answer": answer, }) + "\n") ================================================ FILE: eval_mm/mmbench/mmbench_converter_test.py ================================================ import pandas as pd import io import base64 import json from PIL import Image ''' This script convert mmbench_test tsv file to jsonl This script is very similar to mmbench_converter_dev except there's no answer for accuracy calculation ''' datas = pd.read_csv("data/mmbench/mmbench_test_20230712/mmbench_test_20230712.tsv", sep='\t') global_choices = ['A', 'B', 'C', 'D'] def decode_base64_to_image(base64_string): image_data = base64.b64decode(base64_string) image = Image.open(io.BytesIO(image_data)) return image with open('./data/mmbench/mmbench_test_20230712/mmbench_test_20230712.jsonl', 'w') as f: for idx in range(len(datas)): data = datas.iloc[idx] index = int(data['index']) question = data['question'] hint = data['hint'] if not pd.isna(data['hint']) else 'N/A' choices = [] for opt in global_choices: if pd.isna(data[opt]): continue choices.append(data[opt]) # answer = global_choices.index(data['answer']) image = decode_base64_to_image(data['image']) image.save("data/mmbench/mmbench_test_20230712/images/%d.jpg" % index) f.write(json.dumps({ "index": index, "image": "data/mmbench/mmbench_test_20230712/images/%d.jpg" % index, "hint": hint, "question": question, "choices": choices, # "answer": answer, }) + "\n") ================================================ FILE: eval_mm/mmbench/mmbench_evaluation.py ================================================ import pandas as pd import json ''' This script provides `global top-1 accuracy` metric calculation for mmbench_dev. ''' predictions = json.load(open('mmbench_dev_20230712.json')) index2predictions = {} for pred in predictions: index2predictions[pred['index']] = pred['prediction'] datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t') glb_opts = ['A', 'B', 'C', 'D'] index2answer = {} for idx in range(len(datas)): data = datas.iloc[idx] index2answer[data['index']] = glb_opts.index(data['answer']) identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()])) correct = 0 total = 0 for index in identity_indexes: for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: if index2predictions[cycle_index] == index2answer[cycle_index]: continue else: print(cycle_index) break else: correct += 1 total += 1 print(correct, total) ================================================ FILE: eval_mm/mmbench/mmbench_evaluation_tricky.py ================================================ import pandas as pd import json import random ''' This script provides metric calculation for mmbench_dev with the same accuarcy algo as OpenCompass server ''' predictions = json.load(open('mmbench_dev_20230712.json')) index2predictions = {} for pred in predictions: index2predictions[pred['index']] = pred['prediction'] from collections import Counter def most_common_elements(lst): counter = Counter(lst) max_count = max(counter.values()) most_common = [element for element, count in counter.items() if count == max_count] return random.choice(most_common) # random sample from random choice datas = pd.read_csv("data/mmbench/mmbench_dev_20230712/mmbench_dev_20230712.tsv", sep='\t') glb_opts = ['A', 'B', 'C', 'D'] index2answer = {} index2choices = {} index2rawanswer = {} for idx in range(len(datas)): data = datas.iloc[idx] choices = [] for opt in glb_opts: if not pd.isna(data[opt]): choices.append(data[opt]) index2choices[data['index']] = choices index2answer[data['index']] = glb_opts.index(data['answer']) index2rawanswer[data['index']] = choices[glb_opts.index(data['answer'])] identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()])) correct = 0 total = 0 for index in identity_indexes: raw_preds = [] raw_answer = [] for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: raw_answer = index2rawanswer[cycle_index] raw_pred = index2choices[cycle_index][index2predictions[cycle_index]] raw_preds.append(raw_pred) if len(set(raw_preds)) == 1: if raw_preds[0] == raw_answer: correct += 1 else: result = most_common_elements(raw_preds) if result == raw_answer: correct += 1 total += 1 print(correct, total, correct / total * 100.) ================================================ FILE: eval_mm/mmbench/mmbench_predict_to_submission.py ================================================ import pandas as pd import json import random ''' This script convert the output file of our inference processor to target formation of OpenCompass evaluator server ''' predictions = json.load(open('mmbench_test_20230712.json')) index2predictions = {} for pred in predictions: index2predictions[pred['index']] = pred['prediction'] from collections import Counter def most_common_elements(lst): counter = Counter(lst) max_count = max(counter.values()) most_common = [element for element, count in counter.items() if count == max_count] print(most_common) return random.choice(most_common) # return most_common datas = pd.read_csv("data/mmbench/mmbench_test_20230712/mmbench_test_20230712.tsv", sep='\t') datas = datas.drop('image', axis=1) glb_opts = ['A', 'B', 'C', 'D'] index2choices = {} for idx in range(len(datas)): data = datas.iloc[idx] choices = [] for opt in glb_opts: if not pd.isna(data[opt]): choices.append(data[opt]) index2choices[data['index']] = choices identity_indexes = list(set([int(_ % 1e6) for _ in index2predictions.keys()])) processed_index2predictions = {} for index in identity_indexes: raw_preds = [] for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: raw_pred = index2choices[cycle_index][index2predictions[cycle_index]] raw_preds.append(raw_pred) if len(set(raw_preds)) == 1: pred_answer = raw_preds[0] else: pred_answer = most_common_elements(raw_preds) print(index, pred_answer) for _ in range(4): cycle_index = int(_ * 1e6 + index) if index2predictions.get(cycle_index, None) is not None: processed_index2predictions[cycle_index] = index2choices[cycle_index].index(pred_answer) predictions = [] for idx in range(len(datas)): data = datas.iloc[idx] index = data['index'] prediction = glb_opts[processed_index2predictions[index]] predictions.append(prediction) datas['prediction'] = predictions datas.to_excel("mmbench_test_20230712_230831_constrained.xlsx", index=False) # constrained means we force the model predict same answer when tested on a question for multiple times ================================================ FILE: eval_mm/mme/EVAL_MME.md ================================================ # MME Benchmark [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning. Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation. Perception Evaluation | Rank | Model | Version | Score | |:----:|:---------------:|:------------------------:|:-------:| | 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)**| **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **1487.57** | | 2 | Skywork-MM | Skywork-MM-13B | 1419.08 | | 3 | MMICL | FlanT5xxl | 1376.00 | | 4 | Lynx | vicuna-7b | 1373.23 | | 5 | BLIVA | FlanT5xxl | 1337.73 | Cognition Evaluation | Rank | Model | Version | Score | |:----:|:----------------:|:--------------:|:----------:| | 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)** | **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **360.71** | | 2 | MMICL | FlanT5xxl | 360.36 | | 3 | Skywork-MM | Skywork-MM-13B | 356.43 | | 4 | BLIVA | FlanT5xxl | 331.43 | | 5 | LRV-Instruction | LRV-7B | 328.21 | Full Metrics ``` =========== Perception =========== total score: 1487.576330532213 existence score: 158.33333333333331 count score: 150.0 position score: 128.33333333333334 color score: 170.0 posters score: 178.57142857142856 celebrity score: 120.58823529411764 scene score: 152.25 landmark score: 164.0 artwork score: 125.5 OCR score: 140.0 =========== Cognition =========== total score: 360.71428571428567 commonsense_reasoning score: 130.7142857142857 numerical_calculation score: 40.0 text_translation score: 147.5 code_reasoning score: 42.5 ``` ## How To Reproduce Results of MME Benchmark 1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md) 2. Rearrange images by executing `python get_images.py` 3. Evaluate Qwen-VL-Chat results by executing `python eval.py` 4. Calculate MME results by executing `python calculation.py --results_dir Qwen-VL-Chat`, which the calculation script comes from the MME eval_tool. ================================================ FILE: eval_mm/mme/eval.py ================================================ import os from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig checkpoint = 'Qwen/Qwen-VL-Chat' tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( checkpoint, device_map='cuda', trust_remote_code=True).eval() model.generation_config = GenerationConfig.from_pretrained(checkpoint, trust_remote_code=True) model.generation_config.top_p = 0.01 root = 'Your_Results' output = 'Qwen-VL-Chat' os.makedirs(output, exist_ok=True) for filename in os.listdir(root): with open(os.path.join(root, filename), 'r') as fin, open(os.path.join(output, filename), 'w') as fout: lines = fin.read().splitlines() filename = filename.replace('.txt', '') for line in tqdm(lines): img, question, gt = line.strip().split('\t') img_path = os.path.join('images', filename, img) assert os.path.exists(img_path), img_path query = f'<img>{img_path}</img>\n{question}' response, _ = model.chat(tokenizer, query=query, history=None) print(img, question, gt, response, sep='\t', file=fout) ================================================ FILE: eval_mm/mme/get_images.py ================================================ import os from tqdm import tqdm os.system('rm -rf images') os.system('mkdir images') os.system('cp -r ../MME_Benchmark_release/OCR images/') os.system('mkdir images/artwork') os.system('cp ../MME_Benchmark_release/artwork/questions_answers_YN/* images/artwork/') with open('LaVIN/artwork.txt') as fin: paths = [ line.strip().split('\t', 1)[0] for line in fin ] paths = list(set(paths)) for path in tqdm(paths): os.system(f'cp ../MME_Benchmark_release/artwork/images/toy_dataset/{path} images/artwork/{path}') os.system('mkdir images/celebrity') os.system('cp ../MME_Benchmark_release/celebrity/images/* images/celebrity/') os.system('cp ../MME_Benchmark_release/celebrity/questions_answers_YN/* images/celebrity/') os.system('cp -r ../MME_Benchmark_release/code_reasoning images/') os.system('cp -r ../MME_Benchmark_release/color images/') os.system('cp -r ../MME_Benchmark_release/commonsense_reasoning images/') os.system('cp -r ../MME_Benchmark_release/count images/') os.system('cp -r ../MME_Benchmark_release/existence images/') os.system('mkdir images/landmark') os.system('cp ../MME_Benchmark_release/landmark/images/* images/landmark/') os.system('cp ../MME_Benchmark_release/landmark/questions_answers_YN/* images/landmark/') os.system('cp -r ../MME_Benchmark_release/numerical_calculation images/') os.system('cp -r ../MME_Benchmark_release/position images/') os.system('mkdir images/posters') os.system('cp ../MME_Benchmark_release/posters/images/* images/posters/') os.system('cp ../MME_Benchmark_release/posters/questions_answers_YN/* images/posters/') os.system('mkdir images/scene') os.system('cp ../MME_Benchmark_release/scene/images/* images/scene/') os.system('cp ../MME_Benchmark_release/scene/questions_answers_YN/* images/scene/') os.system('cp -r ../MME_Benchmark_release/text_translation images/') ================================================ FILE: eval_mm/seed_bench/EVAL_SEED.md ================================================ # Seed-Bench Evaluation [SEED-Bench](https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard) is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both **image** and **video** understanding. Qwen-VL and Qwen-VL-Chat achieve SOTAs on this benchmark. <img src="leaderboard.jpg"/> ## How To Process Video by Qwen-VL Qwen-VL and Qwen-VL-Chat didn't train any video data or tasks during training, but they can understand some videos in a zero-shot way. For the video question-answering task, we utilize four uniformly sampled frames per video sample. These frames are treated as separate images and are stitched into the context. For example: ``` { "question_id": "v0", "prompt": "<img>video_imgs_4/v0_0.jpg</img>\n<img>video_imgs_4/v0_1.jpg</img>\n<img>video_imgs_4/v0_2.jpg</img>\n<img>video_imgs_4/v0_3.jpg</img>\nQuestion: Can you identify the action taking place in the video?\nOptions: A. pretending to take something out of something\nB. pretending to take something from somewhere\nC. feigning to insert something into something\nD. simulating putting something onto something\nAnswer:" } ``` The above JSON line can be used as the input by `eval_mm/seed_bench/eval.py` and output the following results: ``` {"question_id": "v0", "prediction": "B"} ``` Please see [eval_mm/seed_bench/eval.py](eval.py) for more inference details. ## How To Reproduce Results of Seed-Bench 1. Download all images and videos by following the [instruction](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md). Then modify the root path in `eval_mm/seed_bench/trans.py` with your customized path. ``` # path of SEED-Bench.json, download from https://huggingface.co/datasets/AILab-CVC/SEED-Bench/blob/main/SEED-Bench.json seed_bench_input_path = 'SEED-Bench.json' # root directory of evaluation dimension 1-9, following https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md cc3m_dir = "/YOUR_PATH_TO/seed_bench_image" # root directory of evaluation dimension 10 dimension10_dir = "/YOUR_PATH_TO/SSV2/videos" # root directory of evaluation dimension 11 dimension11_dir = "/YOUR_PATH_TO/EPIC-KITCHENS/3h91syskeag572hl6tvuovwv4d/videos/test" # root directory of evaluation dimension 12 dimension12_dir = "/YOUR_PATH_TO/BreakfastII_15fps_qvga_sync" ``` 2. Generate input files of Qwen-VL with the JSON formatting. ``` cd eval_mm/seed_bench/ python trans.py ``` This script will output two JSONL files and one directory. `image_input.jsonl` is the input file of image evaluation and `video_input_4.jsonl` is the input file of video evaluation by 4 frames. The directory `video_imgs_4` contains all 4-framed images extracted from videos. We provide our [image_input.jsonl](http://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/seed_bench/image_input.jsonl) and [video_input_4.jsonl](http://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/seed_bench/video_input_4.jsonl) here for reference. 3. Produce the results of Seed-Bench. ``` # The number of available GPUs export NPROC_PER_NODE=8 # Produce the Qwen-VL-Chat results of image understanding python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ eval.py \ --checkpoint Qwen/Qwen-VL-Chat \ --dataset image_input.jsonl \ --batch-size 4 \ --num-workers 2 # Collect the result files cat result_?.jsonl >results_chat_img.jsonl rm result_?.jsonl # Produce the results of video understanding python -m torch.distributed.launch --use-env \ --nproc_per_node ${NPROC_PER_NODE:-8} \ --nnodes ${WORLD_SIZE:-1} \ --node_rank ${RANK:-0} \ --master_addr ${MASTER_ADDR:-127.0.0.1} \ --master_port ${MASTER_PORT:-12345} \ eval.py \ --checkpoint Qwen/Qwen-VL-Chat \ --dataset video_input_4.jsonl \ --batch-size 2 \ --num-workers 1 # Collect the result files cat result_?.jsonl >results_chat_vid.jsonl rm result_?.jsonl # The file `results_chat.jsonl` can be submitted to the leaderboard cat results_chat_img.jsonl results_chat_vid.jsonl >results_chat.jsonl ``` You can reproduce the Seed-Bench results of Qwen-VL by replacing `Qwen/Qwen-VL-Chat` with `Qwen/Qwen-VL` on the above script. ================================================ FILE: eval_mm/seed_bench/eval.py ================================================ import argparse import itertools import json import os from functools import partial import torch from tqdm import tqdm from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig def collate_fn(batches, pad_token_id): input_tokens = [_['input_tokens'] for _ in batches] target_lengths = [_['target_lengths'] for _ in batches] answers = [_['answer'] for _ in batches] question_id = [_['question_id'] for _ in batches] chunk_sizes = [len(_) for _ in input_tokens] input_tokens = [_ for _ in itertools.chain.from_iterable(input_tokens)] max_lengths = max([len(_) for _ in input_tokens]) input_tokens = [[pad_token_id] * (max_lengths - len(_)) + _ for _ in input_tokens] input_tokens = torch.LongTensor(input_tokens) attention_mask = 1 - input_tokens.eq(pad_token_id).float() return input_tokens, attention_mask, target_lengths, answers, chunk_sizes, question_id class MultipleChoiceDataste(torch.utils.data.Dataset): def __init__(self, test, tokenizer): self.datas = [] with open(test) as fin: for line in tqdm(fin): self.datas.append(json.loads(line.strip())) self.tokenizer = tokenizer def __len__(self): return len(self.datas) def __getitem__(self, idx): data = self.datas[idx] prompt = data['prompt'] prompt_tokens = self.tokenizer(prompt).input_ids target_tokens = [ self.tokenizer(' ' + _).input_ids for _ in ['A', 'B', 'C', 'D'] ] return { 'input_tokens': [prompt_tokens + _ for _ in target_tokens], 'target_lengths': [len(_) for _ in target_tokens], 'answer': data['answer'], 'question_id': data['question_id'], } class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): self._size = int(size) assert size > 0 self._rank = torch.distributed.get_rank() self._world_size = torch.distributed.get_world_size() self._local_indices = self._get_local_indices(size, self._world_size, self._rank) @staticmethod def _get_local_indices(total_size, world_size, rank): shard_size = total_size // world_size left = total_size % world_size shard_sizes = [shard_size + int(r < left) for r in range(world_size)] begin = sum(shard_sizes[:rank]) end = min(sum(shard_sizes[:rank + 1]), total_size) return range(begin, end) def __iter__(self): yield from self._local_indices def __len__(self): return len(self._local_indices) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--checkpoint', type=str, default='') parser.add_argument('--dataset', type=str, default='') parser.add_argument('--batch-size', type=int, default=1) parser.add_argument('--num-workers', type=int, default=1) args = parser.parse_args() torch.distributed.init_process_group( backend='nccl', world_size=int(os.getenv('WORLD_SIZE', '1')), rank=int(os.getenv('RANK', '0')), ) torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) model = AutoModelForCausalLM.from_pretrained( args.checkpoint, device_map='cuda', trust_remote_code=True).eval() tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True) model.generation_config = GenerationConfig.from_pretrained(args.checkpoint, trust_remote_code=True) model.generation_config.top_p = 0.01 dataset = MultipleChoiceDataste(test=args.dataset, tokenizer=tokenizer) dataloader = torch.utils.data.DataLoader( dataset=dataset, # sampler=InferenceSampler(1000), sampler=InferenceSampler(len(dataset)), batch_size=args.batch_size, num_workers=args.num_workers, pin_memory=True, drop_last=False, collate_fn=partial(collate_fn, pad_token_id=tokenizer.eod_id), ) results = [] fout = open('result_{}.jsonl'.format(torch.distributed.get_rank()), 'w') with torch.no_grad(): for _, (input_tokens, attention_mask, target_lengths, answers, chunk_sizes, question_ids) in tqdm(enumerate(dataloader)): outputs = model( input_ids=input_tokens[:, :-1].cuda(), attention_mask=attention_mask[:, :-1].cuda(), return_dict=True, ) losses = torch.nn.functional.cross_entropy(outputs.logits.permute( 0, 2, 1), input_tokens[:, 1:].cuda(), reduction='none') losses = losses.split(chunk_sizes, dim=0) for loss, target_length, answer, question_id in zip(losses, target_lengths, answers, question_ids): target_loss = loss.mean(-1) for _ in range(len(target_length)): target_loss[_] = loss[_, -target_length[_]:].mean() pred = target_loss.argmin().item() pred = chr(pred + 65) if pred == answer: results.append(1) else: results.append(0) answer_record = { 'question_id': question_id, 'prediction': pred } print(json.dumps(answer_record), file=fout) fout.close() torch.distributed.barrier() world_size = torch.distributed.get_world_size() merged_results = [None for _ in range(world_size)] torch.distributed.all_gather_object(merged_results, results) merged_results = [_ for _ in itertools.chain.from_iterable(merged_results)] if torch.distributed.get_rank() == 0: print(f"Evaluating {args.dataset} ...") print(f'Acc@1: {sum(merged_results) / len(merged_results)}') torch.distributed.barrier() ================================================ FILE: eval_mm/seed_bench/trans.py ================================================ import os import av import json import torch import numpy as np from PIL import Image from tqdm import tqdm from decord import VideoReader, cpu # path of SEED-Bench.json, download from https://huggingface.co/datasets/AILab-CVC/SEED-Bench/blob/main/SEED-Bench.json seed_bench_input_path = 'SEED-Bench.json' # root directory of evaluation dimension 1-9, following https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md cc3m_dir = "/YOUR_PATH_TO/seed_bench_image" # root directory of evaluation dimension 10 dimension10_dir = "/YOUR_PATH_TO/SSV2/videos" # root directory of evaluation dimension 11 dimension11_dir = "/YOUR_PATH_TO/EPIC-KITCHENS/3h91syskeag572hl6tvuovwv4d/videos/test" # root directory of evaluation dimension 12 dimension12_dir = "/YOUR_PATH_TO/BreakfastII_15fps_qvga_sync" def is_integer_string(s): try: int(s) return True except ValueError: return False def filter_questions(data, task='all'): if task == "image": return [q for q in data if 1 <= q["question_type_id"] <= 9] elif task == "video": return [q for q in data if 10 <= q["question_type_id"] <= 12] elif task == "all": return data elif is_integer_string(task): return [q for q in data if q["question_type_id"] == int(task)] else: raise ValueError(f"Invalid task: {task}") def get_index(num_frames, num_segments): if num_segments > num_frames: offsets = np.array([ idx for idx in range(num_frames) ]) else: # uniform sampling seg_size = float(num_frames - 1) / num_segments start = int(seg_size / 2) offsets = np.array([ start + int(np.round(seg_size * idx)) for idx in range(num_segments) ]) return offsets with open(seed_bench_input_path) as fin: qa_anno = json.load(fin)['questions'] fout = open('image_input.jsonl', 'w') i_anno = filter_questions(qa_anno, 'image') for qa_item in tqdm(i_anno): data_path = cc3m_dir + qa_item['data_id'] choices = [qa_item['choice_a'], qa_item['choice_b'], qa_item['choice_c'], qa_item['choice_d']] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(chr(i + 65), c)) choice_txt = '\n'.join(choice_list) prompt = '<img>{}</img>\nQuestion: {}\nOptions: {}\nAnswer:'.format( data_path, qa_item['question'], choice_txt) print(json.dumps({ 'question_id': qa_item['question_id'], 'prompt': prompt, 'answer': qa_item['answer'], }), file=fout) fout.close() n_frames = 8 os.system('rm -rf video_input_' + str(n_frames)) os.makedirs('video_imgs_' + str(n_frames), exist_ok=True) fout = open('video_input_{}.jsonl'.format(n_frames), 'w') v_anno = filter_questions(qa_anno, 'video') for qa_item in tqdm(v_anno): if qa_item['question_type_id'] == 12: data_path = dimension12_dir + qa_item['data_id'] elif qa_item['question_type_id'] == 11: data_path = dimension11_dir + qa_item['data_id'].split('/')[-1] elif qa_item['question_type_id'] == 10: data_path = dimension10_dir + qa_item['data_id'] else: assert False, str(qa_item) print(data_path) use_pyav = False if 'segment' in qa_item.keys(): segment = qa_item['segment'] if isinstance(segment[0], int): # using pyav for decoding videos in evaluation dimension 12 use_pyav = True start, end = segment[0], segment[1] else: start = 0.0 end = 0.0 if use_pyav: # using pyav for decoding videos in evaluation dimension 12 reader = av.open(data_path) frames = [torch.from_numpy(f.to_rgb().to_ndarray()) for f in reader.decode(video=0)] video_len = len(frames) start_frame, end_frame = start, end end_frame = min(end_frame, video_len) offset = get_index(end_frame - start_frame, n_frames) frame_indices = offset + start_frame images = torch.stack([frames[idx] for idx in frame_indices]).numpy() else: # using decord for decoding videos in evaluation dimension 10-11 try: vr = VideoReader(data_path, num_threads=1, ctx=cpu(0)) video_len = len(vr) fps = vr.get_avg_fps() if 'segment' in qa_item.keys(): # obtain start and end frame for the video segment in evaluation dimension 11 start_frame = int(min(max(start * fps, 0), video_len - 1)) end_frame = int(min(max(end * fps, 0), video_len - 1)) tot_frames = int(end_frame - start_frame) offset = get_index(tot_frames, n_frames) frame_indices = offset + start_frame else: # sample frames of the video in evaluation dimension 10 frame_indices = get_index(video_len - 1, n_frames) vr.seek(0) images = vr.get_batch(frame_indices).asnumpy() except Exception as e: print(json.dumps({ 'question_id': qa_item['question_id'], 'prompt': "Error" + str(e), 'answer': qa_item['answer'], }), file=fout) continue prompt = '' for i in range(images.shape[0]): data = Image.fromarray(images[i]) img_path = 'video_imgs_{}/{}_{}.jpg'.format(n_frames, qa_item['question_id'], i) data.save(img_path) prompt += '<img>' + img_path + '</img>\n' choices = [qa_item['choice_a'], qa_item['choice_b'], qa_item['choice_c'], qa_item['choice_d']] choice_list = [] for i, c in enumerate(choices): choice_list.append('{}. {}'.format(chr(i + 65), c)) choice_txt = '\n'.join(choice_list) prompt += 'Question: {}\nOptions: {}\nAnswer:'.format(qa_item['question'], choice_txt) print(json.dumps({ 'question_id': qa_item['question_id'], 'prompt': prompt, 'answer': qa_item['answer'], }), file=fout) fout.close() ================================================ FILE: eval_mm/vqa.py ================================================ """Copyright (c) 2022, salesforce.com, inc. All rights reserved. SPDX-License-Identifier: BSD-3-Clause For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause """ __author__ = 'aagrawal' __version__ = '0.9' # Interface for accessing the VQA dataset. # This code is based on the code written by Tsung-Yi Lin for MSCOCO Python API available at the following link: # (https://github.com/pdollar/coco/blob/master/PythonAPI/pycocotools/coco.py). # The following functions are defined: # VQA - VQA class that loads VQA annotation file and prepares data structures. # getQuesIds - Get question ids that satisfy given filter conditions. # getImgIds - Get image ids that satisfy given filter conditions. # loadQA - Load questions and answers with the specified question ids. # showQA - Display the specified questions and answers. # loadRes - Load result file and create result object. # Help on each function can be accessed by: "help(COCO.function)" import copy import datetime import json class VQA: def __init__(self, annotation_file=None, question_file=None): """Constructor of VQA helper class for reading and visualizing questions and answers. :param annotation_file (str): location of VQA annotation file :return: """ # load dataset self.dataset = {} self.questions = {} self.qa = {} self.qqa = {} self.imgToQA = {} if not annotation_file == None and not question_file == None: print('loading VQA annotations and questions into memory...') time_t = datetime.datetime.utcnow() dataset = json.load(open(annotation_file, 'r')) questions = json.load(open(question_file, 'r')) self.dataset = dataset self.questions = questions self.createIndex() def createIndex(self): # create index print('creating index...') imgToQA = {ann['image_id']: [] for ann in self.dataset['annotations']} qa = {ann['question_id']: [] for ann in self.dataset['annotations']} qqa = {ann['question_id']: [] for ann in self.dataset['annotations']} for ann in self.dataset['annotations']: imgToQA[ann['image_id']] += [ann] qa[ann['question_id']] = ann for ques in self.questions['questions']: qqa[ques['question_id']] = ques print('index created!') # create class members self.qa = qa self.qqa = qqa self.imgToQA = imgToQA def info(self): """Print information about the VQA annotation file. :return: """ for key, value in self.datset['info'].items(): print('%s: %s' % (key, value)) def getQuesIds(self, imgIds=[], quesTypes=[], ansTypes=[]): """Get question ids that satisfy given filter conditions. default skips that filter. :param imgIds (int array) : get question ids for given imgs quesTypes (str array) : get question ids for given question types ansTypes (str array) : get question ids for given answer types :return: ids (int array) : integer array of question ids """ imgIds = imgIds if type(imgIds) == list else [imgIds] quesTypes = quesTypes if type(quesTypes) == list else [quesTypes] ansTypes = ansTypes if type(ansTypes) == list else [ansTypes] if len(imgIds) == len(quesTypes) == len(ansTypes) == 0: anns = self.dataset['annotations'] else: if not len(imgIds) == 0: anns = sum( [ self.imgToQA[imgId] for imgId in imgIds if imgId in self.imgToQA ], [], ) else: anns = self.dataset['annotations'] anns = (anns if len(quesTypes) == 0 else [ann for ann in anns if ann['question_type'] in quesTypes]) anns = (anns if len(ansTypes) == 0 else [ann for ann in anns if ann['answer_type'] in ansTypes]) ids = [ann['question_id'] for ann in anns] return ids def getImgIds(self, quesIds=[], quesTypes=[], ansTypes=[]): """Get image ids that satisfy given filter conditions. default skips that filter. :param quesIds (int array) : get image ids for given question ids quesTypes (str array) : get image ids for given question types ansTypes (str array) : get image ids for given answer types :return: ids (int array) : integer array of image ids """ quesIds = quesIds if type(quesIds) == list else [quesIds] quesTypes = quesTypes if type(quesTypes) == list else [quesTypes] ansTypes = ansTypes if type(ansTypes) == list else [ansTypes] if len(quesIds) == len(quesTypes) == len(ansTypes) == 0: anns = self.dataset['annotations'] else: if not len(quesIds) == 0: anns = sum([ self.qa[quesId] for quesId in quesIds if quesId in self.qa ], []) else: anns = self.dataset['annotations'] anns = (anns if len(quesTypes) == 0 else [ann for ann in anns if ann['question_type'] in quesTypes]) anns = (anns if len(ansTypes) == 0 else [ann for ann in anns if ann['answer_type'] in ansTypes]) ids = [ann['image_id'] for ann in anns] return ids def loadQA(self, ids=[]): """Load questions and answers with the specified question ids. :param ids (int array) : integer ids specifying question ids :return: qa (object array) : loaded qa objects """ if type(ids) == list: return [self.qa[id] for id in ids] elif type(ids) == int: return [self.qa[ids]] def showQA(self, anns): """Display the specified annotations. :param anns (array of object): annotations to display :return: None """ if len(anns) == 0: return 0 for ann in anns: quesId = ann['question_id'] print('Question: %s' % (self.qqa[quesId]['question'])) for ans in ann['answers']: print('Answer %d: %s' % (ans['answer_id'], ans['answer'])) def loadRes(self, resFile, quesFile): """Load result file and return a result object. :param resFile (str) : file name of result file :return: res (obj) : result api object """ res = VQA() res.questions = json.load(open(quesFile)) res.dataset['info'] = copy.deepcopy(self.questions['info']) res.dataset['task_type'] = copy.deepcopy(self.questions['task_type']) res.dataset['data_type'] = copy.deepcopy(self.questions['data_type']) res.dataset['data_subtype'] = copy.deepcopy( self.questions['data_subtype']) res.dataset['license'] = copy.deepcopy(self.questions['license']) print('Loading and preparing results... ') time_t = datetime.datetime.utcnow() anns = json.load(open(resFile)) assert type(anns) == list, 'results is not an array of objects' annsQuesIds = [ann['question_id'] for ann in anns] assert set(annsQuesIds) == set( self.getQuesIds() ), 'Results do not correspond to current VQA set. Either the results do not have predictions for all question ids in annotation file or there is atleast one question id that does not belong to the question ids in the annotation file.' for ann in anns: quesId = ann['question_id'] if res.dataset['task_type'] == 'Multiple Choice': assert ( ann['answer'] in self.qqa[quesId]['multiple_choices'] ), 'predicted answer is not one of the multiple choices' qaAnn = self.qa[quesId] ann['image_id'] = qaAnn['image_id'] ann['question_type'] = qaAnn['question_type'] ann['answer_type'] = qaAnn['answer_type'] print('DONE (t=%0.2fs)' % ((datetime.datetime.utcnow() - time_t).total_seconds())) res.dataset['annotations'] = anns res.createIndex() return res ================================================ FILE: eval_mm/vqa_eval.py ================================================ """Copyright (c) 2022, salesforce.com, inc. All rights reserved. SPDX-License-Identifier: BSD-3-Clause For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause """ # coding=utf-8 __author__ = 'aagrawal' import re # This code is based on the code written by Tsung-Yi Lin for MSCOCO Python API available at the following link: # (https://github.com/tylin/coco-caption/blob/master/pycocoevalcap/eval.py). import sys class VQAEval: def __init__(self, vqa=None, vqaRes=None, n=2): self.n = n self.accuracy = {} self.evalQA = {} self.evalQuesType = {} self.evalAnsType = {} self.vqa = vqa self.vqaRes = vqaRes if vqa is not None: self.params = {'question_id': vqa.getQuesIds()} self.contractions = { 'aint': "ain't", 'arent': "aren't", 'cant': "can't", 'couldve': "could've", 'couldnt': "couldn't", "couldn'tve": "couldn't've", "couldnt've": "couldn't've", 'didnt': "didn't", 'doesnt': "doesn't", 'dont': "don't", 'hadnt': "hadn't", "hadnt've": "hadn't've", "hadn'tve": "hadn't've", 'hasnt': "hasn't", 'havent': "haven't", 'hed': "he'd", "hed've": "he'd've", "he'dve": "he'd've", 'hes': "he's", 'howd': "how'd", 'howll': "how'll", 'hows': "how's", "Id've": "I'd've", "I'dve": "I'd've", 'Im': "I'm", 'Ive': "I've", 'isnt': "isn't", 'itd': "it'd", "itd've": "it'd've", "it'dve": "it'd've", 'itll': "it'll", "let's": "let's", 'maam': "ma'am", 'mightnt': "mightn't", "mightnt've": "mightn't've", "mightn'tve": "mightn't've", 'mightve': "might've", 'mustnt': "mustn't", 'mustve': "must've", 'neednt': "needn't", 'notve': "not've", 'oclock': "o'clock", 'oughtnt': "oughtn't", "ow's'at": "'ow's'at", "'ows'at": "'ow's'at", "'ow'sat": "'ow's'at", 'shant': "shan't", "shed've": "she'd've", "she'dve": "she'd've", "she's": "she's", 'shouldve': "should've", 'shouldnt': "shouldn't", "shouldnt've": "shouldn't've", "shouldn'tve": "shouldn't've", "somebody'd": 'somebodyd', "somebodyd've": "somebody'd've", "somebody'dve": "somebody'd've", 'somebodyll': "somebody'll", 'somebodys': "somebody's", 'someoned': "someone'd", "someoned've": "someone'd've", "someone'dve": "someone'd've", 'someonell': "someone'll", 'someones': "someone's", 'somethingd': "something'd", "somethingd've": "something'd've", "something'dve": "something'd've", 'somethingll': "something'll", 'thats': "that's", 'thered': "there'd", "thered've": "there'd've", "there'dve": "there'd've", 'therere': "there're", 'theres': "there's", 'theyd': "they'd", "theyd've": "they'd've", "they'dve": "they'd've", 'theyll': "they'll", 'theyre': "they're", 'theyve': "they've", 'twas': "'twas", 'wasnt': "wasn't", "wed've": "we'd've", "we'dve": "we'd've", 'weve': "we've", 'werent': "weren't", 'whatll': "what'll", 'whatre': "what're", 'whats': "what's", 'whatve': "what've", 'whens': "when's", 'whered': "where'd", 'wheres': "where's", 'whereve': "where've", 'whod': "who'd", "whod've": "who'd've", "who'dve": "who'd've", 'wholl': "who'll", 'whos': "who's", 'whove': "who've", 'whyll': "why'll", 'whyre': "why're", 'whys': "why's", 'wont': "won't", 'wouldve': "would've", 'wouldnt': "wouldn't", "wouldnt've": "wouldn't've", "wouldn'tve": "wouldn't've", 'yall': "y'all", "yall'll": "y'all'll", "y'allll": "y'all'll", "yall'd've": "y'all'd've", "y'alld've": "y'all'd've", "y'all'dve": "y'all'd've", 'youd': "you'd", "youd've": "you'd've", "you'dve": "you'd've", 'youll': "you'll", 'youre': "you're", 'youve': "you've", } self.manualMap = { 'none': '0', 'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8', 'nine': '9', 'ten': '10', } self.articles = ['a', 'an', 'the'] self.periodStrip = re.compile('(?!<=\d)(\.)(?!\d)') self.commaStrip = re.compile('(\d)(,)(\d)') self.punct = [ ';', r'/', '[', ']', '"', '{', '}', '(', ')', '=', '+', '\\', '_', '-', '>', '<', '@', '`', ',', '?', '!', ] def evaluate(self, quesIds=None): if quesIds == None: quesIds = [quesId for quesId in self.params['question_id']] gts = {} res = {} for quesId in quesIds: gts[quesId] = self.vqa.qa[quesId] res[quesId] = self.vqaRes.qa[quesId] # ================================================= # Compute accuracy # ================================================= accQA = [] accQuesType = {} accAnsType = {} print('computing accuracy') step = 0 for quesId in quesIds: resAns = res[quesId]['answer'] resAns = resAns.replace('\n', ' ') resAns = resAns.replace('\t', ' ') resAns = resAns.strip() resAns = self.processPunctuation(resAns) resAns = self.processDigitArticle(resAns) gtAcc = [] gtAnswers = [ans['answer'] for ans in gts[quesId]['answers']] if len(set(gtAnswers)) > 1: for ansDic in gts[quesId]['answers']: ansDic['answer'] = self.processPunctuation( ansDic['answer']) for gtAnsDatum in gts[quesId]['answers']: otherGTAns = [ item for item in gts[quesId]['answers'] if item != gtAnsDatum ] matchingAns = [ item for item in otherGTAns if item['answer'] == resAns ] acc = min(1, float(len(matchingAns)) / 3) gtAcc.append(acc) quesType = gts[quesId]['question_type'] ansType = gts[quesId]['answer_type'] avgGTAcc = float(sum(gtAcc)) / len(gtAcc) accQA.append(avgGTAcc) if quesType not in accQuesType: accQuesType[quesType] = [] accQuesType[quesType].append(avgGTAcc) if ansType not in accAnsType: accAnsType[ansType] = [] accAnsType[ansType].append(avgGTAcc) self.setEvalQA(quesId, avgGTAcc) self.setEvalQuesType(quesId, quesType, avgGTAcc) self.setEvalAnsType(quesId, ansType, avgGTAcc) if step % 100 == 0: self.updateProgress(step / float(len(quesIds))) step = step + 1 self.setAccuracy(accQA, accQuesType, accAnsType) print('Done computing accuracy') def processPunctuation(self, inText): outText = inText for p in self.punct: if (p + ' ' in inText or ' ' + p in inText) or (re.search(self.commaStrip, inText) != None): outText = outText.replace(p, '') else: outText = outText.replace(p, ' ') outText = self.periodStrip.sub('', outText, re.UNICODE) return outText def processDigitArticle(self, inText): outText = [] tempText = inText.lower().split() for word in tempText: word = self.manualMap.setdefault(word, word) if word not in self.articles: outText.append(word) else: pass for wordId, word in enumerate(outText): if word in self.contractions: outText[wordId] = self.contractions[word] outText = ' '.join(outText) return outText def setAccuracy(self, accQA, accQuesType, accAnsType): self.accuracy['overall'] = round(100 * float(sum(accQA)) / len(accQA), self.n) self.accuracy['perQuestionType'] = { quesType: round( 100 * float(sum(accQuesType[quesType])) / len(accQuesType[quesType]), self.n, ) for quesType in accQuesType } self.accuracy['perAnswerType'] = { ansType: round( 100 * float(sum(accAnsType[ansType])) / len(accAnsType[ansType]), self.n) for ansType in accAnsType } def setEvalQA(self, quesId, acc): self.evalQA[quesId] = round(100 * acc, self.n) def setEvalQuesType(self, quesId, quesType, acc): if quesType not in self.evalQuesType: self.evalQuesType[quesType] = {} self.evalQuesType[quesType][quesId] = round(100 * acc, self.n) def setEvalAnsType(self, quesId, ansType, acc): if ansType not in self.evalAnsType: self.evalAnsType[ansType] = {} self.evalAnsType[ansType][quesId] = round(100 * acc, self.n) def updateProgress(self, progress): barLength = 20 status = '' if isinstance(progress, int): progress = float(progress) if not isinstance(progress, float): progress = 0 status = 'error: progress var must be float\r\n' if progress < 0: progress = 0 status = 'Halt...\r\n' if progress >= 1: progress = 1 status = 'Done...\r\n' block = int(round(barLength * progress)) text = '\rFinshed Percent: [{0}] {1}% {2}'.format( '#' * block + '-' * (barLength - block), int(progress * 100), status) sys.stdout.write(text) sys.stdout.flush() ================================================ FILE: finetune/ds_config_zero2.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "none", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: finetune/ds_config_zero3.json ================================================ { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "none", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 100, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } ================================================ FILE: finetune/finetune_ds.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True \ --deepspeed finetune/ds_config_zero3.json ================================================ FILE: finetune/finetune_lora_ds.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --use_lora \ --gradient_checkpointing \ --deepspeed finetune/ds_config_zero2.json ================================================ FILE: finetune/finetune_lora_single_gpu.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` MODEL="Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" export CUDA_VISIBLE_DEVICES=0 python finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --gradient_checkpointing \ --use_lora ================================================ FILE: finetune/finetune_qlora_ds.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` GPUS_PER_NODE=8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001 MODEL="Qwen/Qwen-VL-Chat-Int4" # Qwen/Qwen-VL-Chat-Int4 Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " # Remember to use --fp16 instead of --bf16 due to autogptq torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --fp16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --use_lora \ --q_lora \ --gradient_checkpointing \ --deepspeed finetune/ds_config_zero2.json ================================================ FILE: finetune/finetune_qlora_single_gpu.sh ================================================ #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=`pwd` MODEL="Qwen/Qwen-VL-Chat-Int4" # Qwen/Qwen-VL-Chat-Int4 Set the path if you do not want to load from huggingface directly # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations. # See the section for finetuning in README for more information. DATA="path_to_data" export CUDA_VISIBLE_DEVICES=0 # Remember to use --fp16 instead of --bf16 due to autogptq python finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --fp16 True \ --fix_vit True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 1e-5 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 2048 \ --lazy_preprocess True \ --gradient_checkpointing \ --use_lora \ --q_lora \ --deepspeed finetune/ds_config_zero2.json ================================================ FILE: finetune.py ================================================ # This code is based on the revised code from fastchat based on tatsu-lab/stanford_alpaca. from dataclasses import dataclass, field import json import math import logging import os from typing import Dict, Optional, List import torch from torch.utils.data import Dataset from deepspeed import zero from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus import transformers from transformers import Trainer, GPTQConfig, deepspeed from transformers.trainer_pt_utils import LabelSmoother from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from accelerate.utils import DistributedType IGNORE_TOKEN_ID = LabelSmoother.ignore_index @dataclass class ModelArguments: model_name_or_path: Optional[str] = field(default="Qwen/Qwen-7B") @dataclass class DataArguments: data_path: str = field( default=None, metadata={"help": "Path to the training data."} ) eval_data_path: str = field( default=None, metadata={"help": "Path to the evaluation data."} ) lazy_preprocess: bool = False @dataclass class TrainingArguments(transformers.TrainingArguments): cache_dir: Optional[str] = field(default=None) optim: str = field(default="adamw_torch") model_max_length: int = field( default=8192, metadata={ "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)." }, ) use_lora: bool = False fix_vit: bool = True @dataclass class LoraArguments: lora_r: int = 64 lora_alpha: int = 16 lora_dropout: float = 0.05 lora_target_modules: List[str] = field( default_factory=lambda: ["c_attn", "attn.c_proj", "w1", "w2"] ##["in_proj","out_proj","c_fc"] ) lora_weight_path: str = "" lora_bias: str = "none" q_lora: bool = False def maybe_zero_3(param): if hasattr(param, "ds_id"): assert param.ds_status == ZeroParamStatus.NOT_AVAILABLE with zero.GatheredParameters([param]): param = param.data.detach().cpu().clone() else: param = param.detach().cpu().clone() return param # Borrowed from peft.utils.get_peft_model_state_dict def get_peft_state_maybe_zero_3(named_params, bias): if bias == "none": to_return = {k: t for k, t in named_params if "lora_" in k} elif bias == "all": to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k} elif bias == "lora_only": to_return = {} maybe_lora_bias = {} lora_bias_names = set() for k, t in named_params: if "lora_" in k: to_return[k] = t bias_name = k.split("lora_")[0] + "bias" lora_bias_names.add(bias_name) elif "bias" in k: maybe_lora_bias[k] = t for k, t in maybe_lora_bias: if bias_name in lora_bias_names: to_return[bias_name] = t else: raise NotImplementedError to_return = {k: maybe_zero_3(v) for k, v in to_return.items()} return to_return local_rank = None def rank0_print(*args): if local_rank == 0: print(*args) def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str, bias="none"): """Collects the state dict and dump to disk.""" # check if zero3 mode enabled if deepspeed.is_deepspeed_zero3_enabled(): state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict() else: if trainer.args.use_lora: state_dict = get_peft_state_maybe_zero_3( trainer.model.named_parameters(), bias ) else: state_dict = trainer.model.state_dict() if trainer.args.should_save and trainer.args.local_rank == 0: trainer._save(output_dir, state_dict=state_dict) def preprocess( sources, tokenizer: transformers.PreTrainedTokenizer, max_len: int, system_message: str = "You are a helpful assistant." ) -> Dict: roles = {"user": "<|im_start|>user", "assistant": "<|im_start|>assistant"} im_start = tokenizer.im_start_id im_end = tokenizer.im_end_id nl_tokens = tokenizer('\n').input_ids _system = tokenizer('system').input_ids + nl_tokens _user = tokenizer('user').input_ids + nl_tokens _assistant = tokenizer('assistant').input_ids + nl_tokens # Apply prompt templates input_ids, targets = [], [] for i, source in enumerate(sources): if roles[source[0]["from"]] != roles["user"]: source = source[1:] input_id, target = [], [] system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens input_id += system target += [im_start] + [IGNORE_TOKEN_ID] * (len(system)-3) + [im_end] + nl_tokens assert len(input_id) == len(target) for j, sentence in enumerate(source): role = roles[sentence["from"]] _input_id = tokenizer(role).input_ids + nl_tokens + \ tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens input_id += _input_id if role == '<|im_start|>user': _target = [im_start] + [IGNORE_TOKEN_ID] * (len(_input_id)-3) + [im_end] + nl_tokens elif role == '<|im_start|>assistant': _target = [im_start] + [IGNORE_TOKEN_ID] * len(tokenizer(role).input_ids) + \ _input_id[len(tokenizer(role).input_ids)+1:-2] + [im_end] + nl_tokens else: raise NotImplementedError target += _target assert len(input_id) == len(target) input_id += [tokenizer.pad_token_id] * (max_len - len(input_id)) target += [IGNORE_TOKEN_ID] * (max_len - len(target)) input_ids.append(input_id[:max_len]) targets.append(target[:max_len]) input_ids = torch.tensor(input_ids, dtype=torch.int) targets = torch.tensor(targets, dtype=torch.int) return dict( input_ids=input_ids, labels=targets, attention_mask=input_ids.ne(tokenizer.pad_token_id), ) class SupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int): super(SupervisedDataset, self).__init__() rank0_print("Formatting inputs...") sources = [example["conversations"] for example in raw_data] data_dict = preprocess(sources, tokenizer, max_len) self.input_ids = data_dict["input_ids"] self.labels = data_dict["labels"] self.attention_mask = data_dict["attention_mask"] def __len__(self): return len(self.input_ids) def __getitem__(self, i) -> Dict[str, torch.Tensor]: return dict( input_ids=self.input_ids[i], labels=self.labels[i], attention_mask=self.attention_mask[i], ) class LazySupervisedDataset(Dataset): """Dataset for supervised fine-tuning.""" def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokenizer, max_len: int): super(LazySupervisedDataset, self).__init__() self.tokenizer = tokenizer self.max_len = max_len rank0_print("Formatting inputs...Skip in lazy mode") self.tokenizer = tokenizer self.raw_data = raw_data self.cached_data_dict = {} def __len__(self): return len(self.raw_data) def __getitem__(self, i) -> Dict[str, torch.Tensor]: if i in self.cached_data_dict: return self.cached_data_dict[i] ret = preprocess([self.raw_data[i]["conversations"]], self.tokenizer, self.max_len) ret = dict( input_ids=ret["input_ids"][0], labels=ret["labels"][0], attention_mask=ret["attention_mask"][0], ) self.cached_data_dict[i] = ret return ret def make_supervised_data_module( tokenizer: transformers.PreTrainedTokenizer, data_args, max_len, ) -> Dict: """Make dataset and collator for supervised fine-tuning.""" dataset_cls = ( LazySupervisedDataset if data_args.lazy_preprocess else SupervisedDataset ) rank0_print("Loading data...") train_json = json.load(open(data_args.data_path, "r")) train_dataset = dataset_cls(train_json, tokenizer=tokenizer, max_len=max_len) if data_args.eval_data_path: eval_json = json.load(open(data_args.eval_data_path, "r")) eval_dataset = dataset_cls(eval_json, tokenizer=tokenizer, max_len=max_len) else: eval_dataset = None return dict(train_dataset=train_dataset, eval_dataset=eval_dataset) def train(): global local_rank parser = transformers.HfArgumentParser( (ModelArguments, DataArguments, TrainingArguments, LoraArguments) ) ( model_args, data_args, training_args, lora_args, ) = parser.parse_args_into_dataclasses() if getattr(training_args, 'deepspeed', None) and getattr(lora_args, 'q_lora', False): training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED compute_dtype = ( torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32) ) local_rank = training_args.local_rank device_map = None world_size = int(os.environ.get("WORLD_SIZE", 1)) ddp = world_size != 1 if lora_args.q_lora: device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled(): logging.warning( "FSDP or ZeRO3 are not incompatible with QLoRA." ) # Set RoPE scaling factor config = transformers.AutoConfig.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, trust_remote_code=True, ) config.use_cache = False # Load model and tokenizer model = transformers.AutoModelForCausalLM.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=training_args.cache_dir, device_map=device_map, trust_remote_code=True, quantization_config=GPTQConfig( bits=4, disable_exllama=True ) if training_args.use_lora and lora_args.q_lora else None, ) if not training_args.use_lora: if training_args.fix_vit and hasattr(model,'transformer') and hasattr(model.transformer,'visual'): model.transformer.visual.requires_grad_(False) if hasattr(model.transformer.visual,'attn_pool'): model.transformer.visual.attn_pool.requires_grad_(True) tokenizer = transformers.AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", use_fast=False, trust_remote_code=True, ) tokenizer.pad_token_id = tokenizer.eod_id if training_args.use_lora: if lora_args.q_lora or "chat" in model_args.model_name_or_path.lower(): modules_to_save = None else: modules_to_save = ["wte", "lm_head"] lora_config = LoraConfig( r=lora_args.lora_r, lora_alpha=lora_args.lora_alpha, target_modules=lora_args.lora_target_modules, lora_dropout=lora_args.lora_dropout, bias=lora_args.lora_bias, task_type="CAUSAL_LM", modules_to_save=modules_to_save # This argument serves for adding new tokens. ) if lora_args.q_lora: model = prepare_model_for_kbit_training( model, use_gradient_checkpointing=training_args.gradient_checkpointing ) model = get_peft_model(model, lora_config) if training_args.gradient_checkpointing: model.enable_input_require_grads() # Load data data_module = make_supervised_data_module( tokenizer=tokenizer, data_args=data_args, max_len=training_args.model_max_length ) # Start trainner trainer = Trainer( model=model, tokenizer=tokenizer, args=training_args, **data_module ) trainer.train() trainer.save_state() safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir, bias=lora_args.lora_bias) if __name__ == "__main__": train() ================================================ FILE: openai_api.py ================================================ # coding=utf-8 # Implements API for Qwen-7B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat) # Usage: python openai_api.py # Visit http://localhost:8000/docs for documents. import re import copy import json import time from argparse import ArgumentParser from contextlib import asynccontextmanager from typing import Dict, List, Literal, Optional, Union import torch import uvicorn from fastapi import FastAPI, HTTPException from fastapi.middleware.cors import CORSMiddleware from pydantic import BaseModel, Field from sse_starlette.sse import EventSourceResponse from transformers import AutoTokenizer, AutoModelForCausalLM from transformers.generation import GenerationConfig @asynccontextmanager async def lifespan(app: FastAPI): # collects GPU memory yield if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.ipc_collect() app = FastAPI(lifespan=lifespan) app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) class ModelCard(BaseModel): id: str object: str = "model" created: int = Field(default_factory=lambda: int(time.time())) owned_by: str = "owner" root: Optional[str] = None parent: Optional[str] = None permission: Optional[list] = None class ModelList(BaseModel): object: str = "list" data: List[ModelCard] = [] class ChatMessage(BaseModel): role: Literal["user", "assistant", "system", "function"] content: Optional[str] function_call: Optional[Dict] = None class DeltaMessage(BaseModel): role: Optional[Literal["user", "assistant", "system"]] = None content: Optional[str] = None class ChatCompletionRequest(BaseModel): model: str messages: List[ChatMessage] functions: Optional[List[Dict]] = None temperature: Optional[float] = None top_p: Optional[float] = None max_length: Optional[int] = None stream: Optional[bool] = False stop: Optional[List[str]] = None class ChatCompletionResponseChoice(BaseModel): index: int message: ChatMessage finish_reason: Literal["stop", "length", "function_call"] class ChatCompletionResponseStreamChoice(BaseModel): index: int delta: DeltaMessage finish_reason: Optional[Literal["stop", "length"]] class ChatCompletionResponse(BaseModel): model: str object: Literal["chat.completion", "chat.completion.chunk"] choices: List[ Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice] ] created: Optional[int] = Field(default_factory=lambda: int(time.time())) @app.get("/v1/models", response_model=ModelList) async def list_models(): global model_args model_card = ModelCard(id="gpt-3.5-turbo") return ModelList(data=[model_card]) # To work around that unpleasant leading-\n tokenization issue! def add_extra_stop_words(stop_words): if stop_words: _stop_words = [] _stop_words.extend(stop_words) for x in stop_words: s = x.lstrip("\n") if s and (s not in _stop_words): _stop_words.append(s) return _stop_words return stop_words def trim_stop_words(response, stop_words): if stop_words: for stop in stop_words: idx = response.find(stop) if idx != -1: response = response[:idx] return response TOOL_DESC = """{name_for_model}: Call this tool to interact with the {name_for_human} API. What is the {name_for_human} API useful for? {description_for_model} Parameters: {parameters}""" REACT_INSTRUCTION = """Answer the following questions as best you can. You have access to the following APIs: {tools_text} Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [{tools_name_text}] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can be repeated zero or more times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin!""" _TEXT_COMPLETION_CMD = object() # # Temporarily, the system role does not work as expected. # We advise that you write the setups for role-play in your query, # i.e., use the user role instead of the system role. # # TODO: Use real system role when the model is ready. # def parse_messages(messages, functions): if all(m.role != "user" for m in messages): raise HTTPException( status_code=400, detail=f"Invalid request: Expecting at least one user message.", ) messages = copy.deepcopy(messages) default_system = "You are a helpful assistant." system = "" if messages[0].role == "system": system = messages.pop(0).content.lstrip("\n").rstrip() if system == default_system: system = "" if functions: tools_text = [] tools_name_text = [] for func_info in functions: name = func_info.get("name", "") name_m = func_info.get("name_for_model", name) name_h = func_info.get("name_for_human", name) desc = func_info.get("description", "") desc_m = func_info.get("description_for_model", desc) tool = TOOL_DESC.format( name_for_model=name_m, name_for_human=name_h, # Hint: You can add the following format requirements in description: # "Format the arguments as a JSON object." # "Enclose the code within triple backticks (`) at the beginning and end of the code." description_for_model=desc_m, parameters=json.dumps(func_info["parameters"], ensure_ascii=False), ) tools_text.append(tool) tools_name_text.append(name_m) tools_text = "\n\n".join(tools_text) tools_name_text = ", ".join(tools_name_text) system += "\n\n" + REACT_INSTRUCTION.format( tools_text=tools_text, tools_name_text=tools_name_text, ) system = system.lstrip("\n").rstrip() dummy_thought = { "en": "\nThought: I now know the final answer.\nFinal answer: ", "zh": "\nThought: 我会作答了。\nFinal answer: ", } _messages = messages messages = [] for m_idx, m in enumerate(_messages): role, content, func_call = m.role, m.content, m.function_call if content: content = content.lstrip("\n").rstrip() if role == "function": if (len(messages) == 0) or (messages[-1].role != "assistant"): raise HTTPException( status_code=400, detail=f"Invalid request: Expecting role assistant before role function.", ) messages[-1].content += f"\nObservation: {content}" if m_idx == len(_messages) - 1: messages[-1].content += "\nThought:" elif role == "assistant": if len(messages) == 0: raise HTTPException( status_code=400, detail=f"Invalid request: Expecting role user before role assistant.", ) last_msg = messages[-1].content last_msg_has_zh = len(re.findall(r"[\u4e00-\u9fff]+", last_msg)) > 0 if func_call is None: if functions: content = dummy_thought["zh" if last_msg_has_zh else "en"] + content else: f_name, f_args = func_call["name"], func_call["arguments"] if not content: if last_msg_has_zh: content = f"Thought: 我可以使用 {f_name} API。" else: content = f"Thought: I can use {f_name}." content = f"\n{content}\nAction: {f_name}\nAction Input: {f_args}" if messages[-1].role == "user": messages.append( ChatMessage(role="assistant", content=content.lstrip("\n").rstrip()) ) else: messages[-1].content += content elif role == "user": messages.append( ChatMessage(role="user", content=content.lstrip("\n").rstrip()) ) else: raise HTTPException( status_code=400, detail=f"Invalid request: Incorrect role {role}." ) query = _TEXT_COMPLETION_CMD if messages[-1].role == "user": query = messages[-1].content messages = messages[:-1] if len(messages) % 2 != 0: raise HTTPException(status_code=400, detail="Invalid request") history = [] # [(Q1, A1), (Q2, A2), ..., (Q_last_turn, A_last_turn)] for i in range(0, len(messages), 2): if messages[i].role == "user" and messages[i + 1].role == "assistant": usr_msg = messages[i].content.lstrip("\n").rstrip() bot_msg = messages[i + 1].content.lstrip("\n").rstrip() if system and (i == len(messages) - 2): usr_msg = f"{system}\n\nQuestion: {usr_msg}" system = "" for t in dummy_thought.values(): t = t.lstrip("\n") if bot_msg.startswith(t) and ("\nAction: " in bot_msg): bot_msg = bot_msg[len(t) :] history.append([usr_msg, bot_msg]) else: raise HTTPException( status_code=400, detail="Invalid request: Expecting exactly one user (or function) role before every assistant role.", ) if system: assert query is not _TEXT_COMPLETION_CMD query = f"{system}\n\nQuestion: {query}" return query, history def parse_response(response): func_name, func_args = "", "" i = response.rfind("\nAction:") j = response.rfind("\nAction Input:") k = response.rfind("\nObservation:") if 0 <= i < j: # If the text has `Action` and `Action input`, if k < j: # but does not contain `Observation`, # then it is likely that `Observation` is omitted by the LLM, # because the output text may have discarded the stop word. response = response.rstrip() + "\nObservation:" # Add it back. k = response.rfind("\nObservation:") func_name = response[i + len("\nAction:") : j].strip() func_args = response[j + len("\nAction Input:") : k].strip() if func_name: choice_data = ChatCompletionResponseChoice( index=0, message=ChatMessage( role="assistant", content=response[:i], function_call={"name": func_name, "arguments": func_args}, ), finish_reason="function_call", ) return choice_data z = response.rfind("\nFinal Answer: ") if z >= 0: response = response[z + len("\nFinal Answer: ") :] choice_data = ChatCompletionResponseChoice( index=0, message=ChatMessage(role="assistant", content=response), finish_reason="stop", ) return choice_data # completion mode, not chat mode def text_complete_last_message(history, stop_words_ids): im_start = "<|im_start|>" im_end = "<|im_end|>" prompt = f"{im_start}system\nYou are a helpful assistant.{im_end}" for i, (query, response) in enumerate(history): query = query.lstrip("\n").rstrip() response = response.lstrip("\n").rstrip() prompt += f"\n{im_start}user\n{query}{im_end}" prompt += f"\n{im_start}assistant\n{response}{im_end}" prompt = prompt[: -len(im_end)] _stop_words_ids = [tokenizer.encode(im_end)] if stop_words_ids: for s in stop_words_ids: _stop_words_ids.append(s) stop_words_ids = _stop_words_ids input_ids = torch.tensor([tokenizer.encode(prompt)]).to(model.device) output = model.generate(input_ids, stop_words_ids=stop_words_ids).tolist()[0] output = tokenizer.decode(output, errors="ignore") assert output.startswith(prompt) output = output[len(prompt) :] output = trim_stop_words(output, ["<|endoftext|>", im_end]) print(f"<completion>\n{prompt}\n\n{output}\n</completion>") return output @app.post("/v1/chat/completions", response_model=ChatCompletionResponse) async def create_chat_completion(request: ChatCompletionRequest): global model, tokenizer stop_words = add_extra_stop_words(request.stop) if request.functions: stop_words = stop_words or [] if "Observation:" not in stop_words: stop_words.append("Observation:") query, history = parse_messages(request.messages, request.functions) if request.stream: if request.functions: raise HTTPException( status_code=400, detail="Invalid request: Function calling is not yet implemented for stream mode.", ) # generate = predict(query, history, request.model, stop_words) # return EventSourceResponse(generate, media_type="text/event-stream") raise HTTPException(status_code=400, detail="Stream request is not supported currently.") stop_words_ids = [tokenizer.encode(s) for s in stop_words] if stop_words else None if query is _TEXT_COMPLETION_CMD: response = text_complete_last_message(history, stop_words_ids=stop_words_ids) else: response, _ = model.chat( tokenizer, query, history=history, stop_words_ids=stop_words_ids, append_history=False, top_p=request.top_p, temperature=request.temperature, ) print(f"<chat>\n{history}\n{query}\n\n{response}\n</chat>") response = trim_stop_words(response, stop_words) if request.functions: choice_data = parse_response(response) else: choice_data = ChatCompletionResponseChoice( index=0, message=ChatMessage(role="assistant", content=response), finish_reason="stop", ) return ChatCompletionResponse( model=request.model, choices=[choice_data], object="chat.completion" ) async def predict( query: str, history: List[List[str]], model_id: str, stop_words: List[str] ): global model, tokenizer choice_data = ChatCompletionResponseStreamChoice( index=0, delta=DeltaMessage(role="assistant"), finish_reason=None ) chunk = ChatCompletionResponse( model=model_id, choices=[choice_data], object="chat.completion.chunk" ) yield "{}".format(chunk.model_dump_json(exclude_unset=True)) current_length = 0 stop_words_ids = [tokenizer.encode(s) for s in stop_words] if stop_words else None if stop_words: # TODO: It's a little bit tricky to trim stop words in the stream mode. raise HTTPException( status_code=400, detail="Invalid request: custom stop words are not yet supported for stream mode.", ) response_generator = model.chat_stream( tokenizer, query, history=history, stop_words_ids=stop_words_ids ) for new_response in response_generator: if len(new_response) == current_length: continue new_text = new_response[current_length:] current_length = len(new_response) choice_data = ChatCompletionResponseStreamChoice( index=0, delta=DeltaMessage(content=new_text), finish_reason=None ) chunk = ChatCompletionResponse( model=model_id, choices=[choice_data], object="chat.completion.chunk" ) yield "{}".format(chunk.model_dump_json(exclude_unset=True)) choice_data = ChatCompletionResponseStreamChoice( index=0, delta=DeltaMessage(), finish_reason="stop" ) chunk = ChatCompletionResponse( model=model_id, choices=[choice_data], object="chat.completion.chunk" ) yield "{}".format(chunk.model_dump_json(exclude_unset=True)) yield "[DONE]" def _get_args(): parser = ArgumentParser() parser.add_argument( "-c", "--checkpoint-path", type=str, default="QWen/QWen-7B-Chat", help="Checkpoint name or path, default to %(default)r", ) parser.add_argument( "--cpu-only", action="store_true", help="Run demo with CPU only" ) parser.add_argument( "--server-port", type=int, default=8000, help="Demo server port." ) parser.add_argument( "--server-name", type=str, default="127.0.0.1", help="Demo server name. Default: 127.0.0.1, which is only visible from the local computer." " If you want other computers to access your server, use 0.0.0.0 instead.", ) args = parser.parse_args() return args if __name__ == "__main__": args = _get_args() tokenizer = AutoTokenizer.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, ) if args.cpu_only: device_map = "cpu" else: device_map = "auto" model = AutoModelForCausalLM.from_pretrained( args.checkpoint_path, device_map=device_map, trust_remote_code=True, resume_download=True, ).eval() model.generation_config = GenerationConfig.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, ) uvicorn.run(app, host=args.server_name, port=args.server_port, workers=1) ================================================ FILE: requirements.txt ================================================ transformers==4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib ================================================ FILE: requirements_openai_api.txt ================================================ fastapi uvicorn openai pydantic sse_starlette ================================================ FILE: requirements_web_demo.txt ================================================ gradio modelscope ================================================ FILE: touchstone/README.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> <a href="../touchstone/README_CN.md">中文</a> ｜ English ｜ <a href="../touchstone/README_JA.md">日本語</a>｜ <a href="../touchstone/README_KO.md">한국어</a> **TOUCHSTONE** is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By automating the evaluation process and converting multimodal information into text, our TouchStone allows for efficient and accurate assessment of dialogue quality, leveraging the power of advanced language models without the need for manual intervention. ## DATASET To evaluate the abilities of LVLMs, we construct a diverse and comprehensive dataset that covers five key dimensions: basic descriptive ability, visual recognition ability, visual comprehension ability, visual storytelling ability, and multi-image analysis ability. - **Basic Descriptive Ability** Image description involves the ability of a model to describe the information contained in an image, including simple and detailed descriptions. Simple descriptions are typically short phrases that describe the main subject and action of the image, while detailed descriptions provide more in-depth information about the image scene, their attributes, and relationships. - **Visual Recognition Ability** Image recognition is the task of recognizing objects or scenes within an image and inferring relevant information. This area can be further divided into several sub-tasks, including attribute QA, movie/TV recognition, art recognition, landmark recognition, celebrity recognition, emotion recognition, text recognition, object recognition, and structure content recognition. - **Visual Comprehension Ability** Image understanding involves the ability of a model to understand the meaning of an image and associated tasks. This area encompasses several sub-tasks, such as style appreciation, abstract image understanding, meme understanding, image analysis, chart analysis, general problem-solving, and reasoning QA. - **Visual Storytelling Ability** The visual storytelling ability is the process of literary creation based on visual content, including writing emails, poetry, stories, ads/commodity recommendations, and brainstorming. - **Multi-Image Analysis Ability** Multi-image analysis is the task of analyzing and comparing multiple images. This area includes tasks such as comparing two/multiple images, summarizing multiple image information, comparing commodities, and step-by-step analysis of images. <img src="../assets/touchstone_datasets.jpg" width="600"/> We comprehensively evaluate the model's ability from five dimensions. As shown in the figure above, an example of 27 subtasks is given. From perception to cognition to creativity, as the difficulty increases, the requirements for models are also getting higher and higher. Currently, LVLM capabilities are in their early stages. Our dataset contains 800+ questions and 27 categories. ## Methods We apply a powerful LLM as a judge to enable automated evaluation. To effectively comprehend the contents of an image, we manually substitute the actual image input with fine-grained textual annotations. By inputting these annotations and corresponding questions to a powerful LLM like GPT4, we obtain reference answers. For the evaluation of the LVLMs, we provide actual images and questions as input and obtain their respective answers. Finally, we employ GPT4 to score the answers generated by the LVLMs based on the fine-grained annotations and questions. The scoring instructions require the model to assess the usefulness, relevance, and accuracy of the answers, considering the annotations as the content of the images. To ensure fairness in the evaluation, each model's answer is compared against a consistent reference answer from GPT4. The average score of the model in all questions is taken as the final score. To eliminate the influence of answer position, we perform a second scoring round by swapping the positions of the answers and then compute the average of the two scores obtained. This approach aims to mitigate any bias introduced by the placement of the answers. <img src="../assets/touchstone_eval.png" width="600"/> ### Evaluation #### Evaluation in English-based Multimodal Dialogue | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### Evaluation in Chinese-based Multimodal Dialogue | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: touchstone/README_CN.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> 中文｜ <a href="../touchstone/README.md">English</a> ｜ <a href="../touchstone/README_JA.md">日本語</a> **TOUCHSTONE** 是一种针对多模态语言模型（LVLM）的自动化综合评估方法，评估不仅包括基本的认知和理解，还延伸到文学创作。通过人类注解将多模态信息转换为文本，我们的 TouchStone 可以利用SOTA的语言模型来自动化地完成对LVLMs的多模态对话质量评估。 ## 数据集为了评估 LVLMs 的能力，我们构建了一个多样化且全面的数据集，涵盖五个关键维度：基本描述能力、视觉识别能力、视觉理解能力、视觉叙事能力和多图分析能力。 - **基本描述能力** 图像描述考验模型总结图片信息的能力，包括简单描述和详细描述。简单描述通常是描述图像的主要内容和关系的简短短语，而详细描述则提供有关图像场景、其属性和关系的更深入的信息。 - **视觉识别能力** 图像识别考察模型提取图像中内容的属性以及关联到知识库的能力。为了考察这方面能力，测试的问题包括属性QA、影视识别、艺术识别、地标识别、名人识别、情感识别、文本识别、物体识别和结构内容识别。 - **视觉理解能力** 图像理解需要模型理解图像内容并完成推理进行相关任务。这方面包含了例如风格欣赏、抽象图像理解、模因理解、图像分析、图表分析、一般问题解决和推理问答等任务。 - **视觉叙事能力** 视觉叙事能力是基于视觉内容的文学创作能力，包括撰写电子邮件、诗歌、故事、广告/商品推荐、头脑风暴等。 - **多图分析能力** 多图分析是分析和比较多幅图像的任务。该领域包括比较两个/多个图像、总结多个图像信息、比较商品以及逐步分析图像等任务。 <img src="../assets/touchstone_datasets.jpg" width="600"/> 我们从五个维度综合评估了模型的能力。如上图所示，给出了27个子任务的示例。从感知到认知，再到创造力，随着难度的增加，对模型的要求也越来越高。目前，LVLM的能力还处于早期阶段。我们的数据集包含800+道题目、27个类别。 ## 测评方式我们应用SOTA的LLM进行自动化评估。为了有效地理解图像的内容，我们人工用细粒度的文本注释替换实际的图像输入。通过将这些注释和相应的问题输入到像GPT4这样强LLM中，我们可以获得参考答案。对于待测评的LVLM，我们提供实际图像和问题作为输入并获得各自的答案。最后，我们使用GPT4根据细粒度注释和问题对LVLM生成的答案进行评分。评分指令要求模型评估答案的有用性、相关性和准确性，并将人工注解视为图像的内容。为了确保评估的公平性，每个模型的答案都会与 GPT4生成的参考答案进行比较。模型在所有问题上的平均得分作为最终得分。为了消除答案位置的影响，我们通过交换答案的位置来进行第二轮评分，然后计算获得的两次分数的平均值。 <img src="../assets/touchstone_eval.png" width="600"/> ## 测评结果 #### 英文版本测评 | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### 中文版本测评 | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: touchstone/README_JA.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> <a href="touchstone/README_CN.md">中文</a> ｜ <a href="../touchstone/README.md">English</a>｜日本語 **TOUCHSTONE** は、マルチモーダル言語モデルの包括的な評価であり、基本的な認識や理解だけでなく、文学的な創作にまで及びます。評価プロセスを自動化し、マルチモーダル情報をテキストに変換することで、私達の TouchStone は、人手を介することなく高度な言語モデルの力を活用し、対話の質を効率的かつ正確に評価することができます。 ## DATASET LVLMの能力を評価するために、基本的な記述能力、視覚認識能力、視覚理解能力、視覚ストーリーテリング能力、複数画像解析能力の5つの主要な次元をカバーする多様で包括的なデータセットを構築する。 - **基本的描写力** 画像記述には、単純な記述と詳細な記述を含め、画像に含まれる情報を記述するモデルの能力が含まれる。単純な記述は、通常、画像の主な主題とアクションを記述する短いフレーズであり、詳細な記述は、画像のシーン、それらの属性、および関係についてのより詳細な情報を提供します。 - **視覚認識能力** 画像認識とは、画像内のオブジェクトやシーンを認識し、関連情報を推論するタスクである。この分野はさらに、属性QA、映画/テレビ認識、アート認識、ランドマーク認識、有名人認識、感情認識、テキスト認識、オブジェクト認識、構造コンテンツ認識など、いくつかのサブタスクに分けることができる。 - **視覚理解能力** 画像理解とは、モデルが画像の意味や関連するタスクを理解する能力のことである。この分野には、スタイル理解、抽象画像理解、ミーム理解、画像分析、チャート分析、一般的な問題解決、推論QAなど、いくつかのサブタスクが含まれる。 - **視覚的ストーリーテリング能力** ビジュアルストーリーテリング能力とは、メール、詩、物語、広告／商品推薦、ブレーンストーミングの執筆など、ビジュアルコンテンツに基づいた文学創作のプロセスである。 - **マルチ画像解析能力** 複数画像解析とは、複数の画像を解析・比較する作業である。この分野には、2つまたは複数の画像を比較する、複数の画像情報を要約する、商品を比較する、画像を段階的に分析するなどのタスクが含まれます。 <img src="../assets/touchstone_datasets.jpg" width="600"/> モデルの能力を 5 つの次元から総合的に評価する。上図のように、27 のサブタスクの例を示す。知覚から認知、創造性まで、難易度が上がるにつれて、モデルに求められる要件もどんどん高くなっている。現在、LVLM の機能は初期段階にある。我々のデータセットには 800 以上の質問と 27 のカテゴリーが含まれている。 ## 方法自動評価を可能にするために、強力な LLM を判定器として適用する。画像の内容を効果的に理解するために、実際の画像入力をきめ細かいテキスト注釈に手動で置き換える。これらの注釈と対応する質問を GPT4 のような強力な LLM に入力することで、参照解答を得る。 LVLMの評価には、実際の画像と質問を入力として与え、それぞれの回答を得る。最後に、GPT4を用いて、LVLMが生成した回答を、細かいアノテーションと質問に基づいてスコアリングする。スコアリングの指示は、注釈を画像の内容とみなして、回答の有用性、関連性、正確性を評価するようモデルに要求する。評価の公平性を確保するため、各モデルの回答はGPT4の一貫した参照回答と比較されます。全問題におけるモデルの平均スコアを最終スコアとする。解答位置の影響を排除するために、解答位置を入れ替えて2回目の採点ラウンドを行い、得られた2つのスコアの平均を計算します。このアプローチは、解答の配置によって生じるバイアスを軽減することを目的としています。 <img src="../assets/touchstone_eval.png" width="600"/> ### 評価 #### 英語ベースのマルチモーダル対話における評価 | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### 中国語ベースのマルチモーダル対話における評価 | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: touchstone/README_KO.md ================================================ <img src="../assets/touchstone_logo.png" width="300"/> <a href="../touchstone/README_CN.md">中文</a> ｜ English ｜ <a href="../touchstone/README_JA.md">日本語</a> ｜ <a href="../touchstone/README_KO.md">한국어</a> **터치스톤, TOUCHSTONE**은 기본적인 인식과 이해력뿐만 아니라 문학 창작까지 아우르는 종합적인 멀티모달 언어 모델 평가입니다. 평가 프로세스를 자동화하고 멀티모달 정보를 텍스트로 변환하는 터치스톤은 수동 개입 없이도 고급 언어 모델의 성능을 활용하여 대화 품질을 효율적이고 정확하게 평가할 수 있도록 지원합니다. ## DATASET 머신러닝의 능력을 평가하기 위해 기본 설명 능력, 시각 인식 능력, 시각 이해 능력, 시각 스토리텔링 능력, 다중 이미지 분석 능력 등 5가지 주요 모달을 포괄하는 다양하고 광범위한 데이터 세트를 구축합니다. - **기본 설명 능력, Basic Descriptive Ability** 이미지 설명에는 단순 설명과 상세 설명을 포함하여 이미지에 포함된 정보를 설명하는 모델의 능력이 포함됩니다. 단순 설명은 일반적으로 이미지의 주요 주제와 동작을 설명하는 짧은 문구로 상세 설명은 이미지 장면, 속성 및 관계에 대한 보다 심층적인 정보를 제공합니다. - **시각적 인식 능력, Visual Recognition Ability** 이미지 인식은 이미지 내의 사물이나 장면을 인식하고 관련 정보를 추론하는 작업입니다. 이 영역은 속성 QA, 영화/TV 인식, 예술 인식, 랜드마크 인식, 유명인 인식, 감정 인식, 텍스트 인식, 사물 인식, 구조물 내용 인식 등 여러 하위 작업으로 세분화할 수 있습니다. - **시각적 이해 능력, Visual Comprehension Ability** 이미지 이해에는 이미지의 의미와 관련 작업을 이해하는 모델의 능력이 포함됩니다. 이 영역에는 스타일 감상, 추상적 이미지 이해, 밈 이해, 이미지 분석, 차트 분석, 일반적인 문제 해결, 추론 QA와 같은 여러 하위 작업이 포함됩니다. - **시각적 스토리텔링 능력, Visual Storytelling Ability** 시각적 스토리텔링 능력은 이메일, 시, 스토리, 광고/상품 추천, 브레인스토밍 등 시각적 콘텐츠를 기반으로 문학적 창작을 하는 과정입니다. - **다중 이미지 분석 능력, Multi-Image Analysis Ability** 다중 이미지 분석은 여러 이미지를 분석하고 비교하는 작업입니다. 이 영역에는 두 개/여러 개의 이미지 비교, 여러 이미지 정보 요약, 상품 비교, 이미지의 단계별 분석 등의 작업이 포함됩니다. <img src="../assets/touchstone_datasets.jpg" width="600"/> 5가지 측면에서 모델의 능력을 종합적으로 평가합니다. 위 그림과 같이 27개의 하위 과제를 예로 들었습니다. 지각부터 인지, 창의력까지 난이도가 높아질수록 모델에 대한 요구 사항도 점점 더 높아지고 있습니다. 현재 LVLM 기능은 초기 단계에 있습니다. 데이터 세트에는 800개 이상의 질문과 27개 카테고리가 포함되어 있습니다. ## Methods 당사는 자동화된 평가를 위해 강력한 LLM을 심사자로 적용합니다. 이미지의 내용을 효과적으로 이해하기 위해 실제 이미지 입력을 세분화된 텍스트 주석으로 수동으로 대체합니다. 이러한 주석과 해당 질문을 GPT4와 같은 강력한 LLM에 입력하면 참조 답변을 얻을 수 있습니다. LVLM의 평가를 위해 실제 이미지와 질문을 입력으로 제공하고 각각의 답변을 얻습니다. 마지막으로, 세분화된 주석과 질문을 기반으로 LVLM이 생성한 답변에 GPT4를 사용하여 점수를 매깁니다. 채점 지침에 따라 모델은 주석을 이미지의 콘텐츠로 간주하여 답변의 유용성, 관련성 및 정확성을 평가해야 합니다. 평가의 공정성을 보장하기 위해 각 모델의 답변은 GPT4의 일관된 참조 답변과 비교됩니다. 모든 문제에서 모델의 평균 점수가 최종 점수로 사용됩니다. 답안 위치의 영향을 제거하기 위해 답안 위치를 바꿔서 두 번째 채점 라운드를 수행한 다음 얻은 두 점수의 평균을 계산합니다. 이 접근 방식은 답안 배치로 인해 발생하는 편향을 완화하는 것을 목표로 합니다. <img src="../assets/touchstone_eval.png" width="600"/> ### Evaluation #### Evaluation in English-based Multimodal Dialogue | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | #### Evaluation in Chinese-based Multimodal Dialogue | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | ================================================ FILE: web_demo_mm.py ================================================ # Copyright (c) Alibaba Cloud. # # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """A simple web interactive chat demo based on gradio.""" from argparse import ArgumentParser from pathlib import Path import copy import gradio as gr import os import re import secrets import tempfile from modelscope import ( snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig ) DEFAULT_CKPT_PATH = 'qwen/Qwen-VL-Chat' BOX_TAG_PATTERN = r"<box>([\s\S]*?)</box>" PUNCTUATION = "！？。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏." def _get_args(): parser = ArgumentParser() parser.add_argument("-c", "--checkpoint-path", type=str, default=DEFAULT_CKPT_PATH, help="Checkpoint name or path, default to %(default)r") parser.add_argument("--cpu-only", action="store_true", help="Run demo with CPU only") parser.add_argument("--share", action="store_true", default=False, help="Create a publicly shareable link for the interface.") parser.add_argument("--inbrowser", action="store_true", default=False, help="Automatically launch the interface in a new tab on the default browser.") parser.add_argument("--server-port", type=int, default=8000, help="Demo server port.") parser.add_argument("--server-name", type=str, default="127.0.0.1", help="Demo server name.") args = parser.parse_args() return args def _load_model_tokenizer(args): tokenizer = AutoTokenizer.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, revision='master', ) if args.cpu_only: device_map = "cpu" else: device_map = "cuda" model = AutoModelForCausalLM.from_pretrained( args.checkpoint_path, device_map=device_map, trust_remote_code=True, resume_download=True, revision='master', ).eval() model.generation_config = GenerationConfig.from_pretrained( args.checkpoint_path, trust_remote_code=True, resume_download=True, revision='master', ) return model, tokenizer def _parse_text(text): lines = text.split("\n") lines = [line for line in lines if line != ""] count = 0 for i, line in enumerate(lines): if "```" in line: count += 1 items = line.split("`") if count % 2 == 1: lines[i] = f'<pre><code class="language-{items[-1]}">' else: lines[i] = f" </code></pre>" else: if i > 0: if count % 2 == 1: line = line.replace("`", r"\`") line = line.replace("<", "<") line = line.replace(">", ">") line = line.replace(" ", " ") line = line.replace("*", "*") line = line.replace("_", "_") line = line.replace("-", "-") line = line.replace(".", ".") line = line.replace("!", "!") line = line.replace("(", "(") line = line.replace(")", ")") line = line.replace("$", "$") lines[i] = " " + line text = "".join(lines) return text def _remove_image_special(text): text = text.replace('<ref>', '').replace('</ref>', '') return re.sub(r'<box>.*?(</box>|$)', '', text) def _launch_demo(args, model, tokenizer): uploaded_file_dir = os.environ.get("GRADIO_TEMP_DIR") or str( Path(tempfile.gettempdir()) / "gradio" ) def predict(_chatbot, task_history): chat_query = _chatbot[-1][0] query = task_history[-1][0] print("User: " + _parse_text(query)) history_cp = copy.deepcopy(task_history) full_response = "" history_filter = [] pic_idx = 1 pre = "" for i, (q, a) in enumerate(history_cp): if isinstance(q, (tuple, list)): q = f'Picture {pic_idx}: <img>{q[0]}</img>' pre += q + '\n' pic_idx += 1 else: pre += q history_filter.append((pre, a)) pre = "" history, message = history_filter[:-1], history_filter[-1][0] # response, history = model.chat(tokenizer, message, history=history) for response in model.chat_stream(tokenizer, message, history=history): _chatbot[-1] = (_parse_text(chat_query), _remove_image_special(_parse_text(response))) yield _chatbot full_response = _parse_text(response) response = full_response history.append((message, response)) image = tokenizer.draw_bbox_on_latest_picture(response, history) if image is not None: temp_dir = secrets.token_hex(20) temp_dir = Path(uploaded_file_dir) / temp_dir temp_dir.mkdir(exist_ok=True, parents=True) name = f"tmp{secrets.token_hex(5)}.jpg" filename = temp_dir / name image.save(str(filename)) _chatbot.append((None, (str(filename),))) else: _chatbot[-1] = (_parse_text(chat_query), response) # full_response = _parse_text(response) task_history[-1] = (query, full_response) print("Qwen-VL-Chat: " + _parse_text(full_response)) yield _chatbot def regenerate(_chatbot, task_history): if not task_history: return _chatbot item = task_history[-1] if item[1] is None: return _chatbot task_history[-1] = (item[0], None) chatbot_item = _chatbot.pop(-1) if chatbot_item[0] is None: _chatbot[-1] = (_chatbot[-1][0], None) else: _chatbot.append((chatbot_item[0], None)) return predict(_chatbot, task_history) def add_text(history, task_history, text): task_text = text if len(text) >= 2 and text[-1] in PUNCTUATION and text[-2] not in PUNCTUATION: task_text = text[:-1] history = history + [(_parse_text(text), None)] task_history = task_history + [(task_text, None)] return history, task_history, "" def add_file(history, task_history, file): history = history + [((file.name,), None)] task_history = task_history + [((file.name,), None)] return history, task_history def reset_user_input(): return gr.update(value="") def reset_state(task_history): task_history.clear() return [] with gr.Blocks() as demo: gr.Markdown("""\ <img src="https://modelscope.cn/api/v1/models/qwen/Qwen-7B-Chat/repo? Revision=master&FilePath=assets/logo.jpeg&View=true" style="height: 80px"/>""") gr.Markdown("""<center>Qwen-VL-Chat Bot</center>""") gr.Markdown( """\ <center>This WebUI is based on Qwen-VL-Chat, developed by Alibaba Cloud. \ (本WebUI基于Qwen-VL-Chat打造，实现聊天机器人功能。)</center>""") gr.Markdown("""\ <center>Qwen-VL <a href="https://modelscope.cn/models/qwen/Qwen-VL/summary">🤖 </a> | <a href="https://huggingface.co/Qwen/Qwen-VL">🤗</a> ｜ Qwen-VL-Chat <a href="https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary">🤖 </a> | <a href="https://huggingface.co/Qwen/Qwen-VL-Chat">🤗</a> ｜ <a href="https://github.com/QwenLM/Qwen-VL">Github</a></center>""") chatbot = gr.Chatbot(label='Qwen-VL-Chat', elem_classes="control-height", height=750) query = gr.Textbox(lines=2, label='Input') task_history = gr.State([]) with gr.Row(): empty_bin = gr.Button("🧹 Clear History (清除历史)") submit_btn = gr.Button("🚀 Submit (发送)") regen_btn = gr.Button("🤔️ Regenerate (重试)") addfile_btn = gr.UploadButton("📁 Upload (上传文件)", file_types=["image"]) submit_btn.click(add_text, [chatbot, task_history, query], [chatbot, task_history]).then( predict, [chatbot, task_history], [chatbot], show_progress=True ) submit_btn.click(reset_user_input, [], [query]) empty_bin.click(reset_state, [task_history], [chatbot], show_progress=True) regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress=True) addfile_btn.upload(add_file, [chatbot, task_history, addfile_btn], [chatbot, task_history], show_progress=True) gr.Markdown("""\ Note: This demo is governed by the original license of Qwen-VL. \ We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, \ including hate speech, violence, pornography, deception, etc. \ (注：本演示受Qwen-VL的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，\ 包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)""") demo.queue().launch( share=args.share, inbrowser=args.inbrowser, server_port=args.server_port, server_name=args.server_name, ) def main(): args = _get_args() model, tokenizer = _load_model_tokenizer(args) _launch_demo(args, model, tokenizer) if __name__ == '__main__': main()