Repository: zai-org/CogVLM
Branch: main
Commit: f7283b2c8d26
Files: 50
Total size: 370.8 KB
Directory structure:
gitextract_7gke4omx/
├── .deepspeed_env
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.yaml
│ │ └── feature-request.yaml
│ └── PULL_REQUEST_TEMPLATE/
│ └── pr_template.md
├── .gitignore
├── LICENSE
├── MODEL_LICENSE
├── README.md
├── README_zh.md
├── assets/
│ └── WECHAT.md
├── basic_demo/
│ ├── cli_demo_hf.py
│ ├── cli_demo_sat.py
│ └── web_demo.py
├── composite_demo/
│ ├── client.py
│ ├── conversation.py
│ ├── demo_agent_cogagent.py
│ ├── demo_chat_cogagent.py
│ ├── demo_chat_cogvlm.py
│ ├── main.py
│ └── utils.py
├── dataset.md
├── dataset_zh.md
├── finetune_demo/
│ ├── evaluate_cogagent.sh
│ ├── evaluate_cogagent_demo.py
│ ├── evaluate_cogvlm.sh
│ ├── evaluate_cogvlm_demo.py
│ ├── finetune_cogagent_demo.py
│ ├── finetune_cogagent_lora.sh
│ ├── finetune_cogvlm_demo.py
│ ├── finetune_cogvlm_lora.sh
│ └── test_config_bf16.json
├── openai_demo/
│ ├── openai_api.py
│ └── openai_api_request.py
├── requirements.txt
└── utils/
├── __init__.py
├── merge_model.py
├── models/
│ ├── __init__.py
│ ├── cogagent_model.py
│ ├── cogvlm_model.py
│ ├── eva_clip_L_hf.py
│ ├── eva_clip_model.py
│ └── mixin.py
├── split_dataset.py
└── utils/
├── __init__.py
├── chat.py
├── dataset.py
├── grounding_parser.py
├── language.py
├── template.py
└── vision.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .deepspeed_env
================================================
SAT_HOME=~/.sat_models
LOCAL_WORLD_SIZE=8
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.yaml
================================================
name: "\U0001F41B Bug Report"
description: Submit a bug report to help us improve ChatGLM3 / 提交一个 Bug 问题报告来帮助我们改进 ChatGLM3
body:
- type: textarea
id: system-info
attributes:
label: System Info / 系統信息
description: Your operating environment / 您的运行环境信息
placeholder: Includes Cuda version, Transformers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本,Transformers版本,Python版本,操作系统,硬件信息(如果您怀疑是硬件方面的问题)...
validations:
required: true
- type: textarea
id: who-can-help
attributes:
label: Who can help? / 谁可以帮助到您?
description: |
Your issue will be replied to more quickly if you can figure out the right person to tag with @
All issues are read by one of the maintainers, so if you don't know who to tag, just leave this blank and our maintainer will ping the right person.
Please tag fewer than 3 people.
如果您能找到合适的标签 @,您的问题会更快得到回复。
所有问题都会由我们的维护者阅读,如果您不知道该标记谁,只需留空,我们的维护人员会找到合适的开发组成员来解决问题。
标记的人数应该不超过 1 个人。
Related demo leader / 相关demo负责人 :
- finetune_demo: @1049451037
- composite_demo: @zR
- openai_demo: @zR
If it's not a bug in these three subsections, you may not specify the helper. Our maintainer will find the right person in the development group to solve the problem.
如果不是这三个子版块的bug,您可以不指明帮助者,我们的维护人员会找到合适的开发组成员来解决问题。
placeholder: "@Username ..."
- type: checkboxes
id: information-scripts-examples
attributes:
label: Information / 问题信息
description: 'The problem arises when using: / 问题出现在'
options:
- label: "The official example scripts / 官方的示例脚本"
- label: "My own modified scripts / 我自己修改的脚本和任务"
- type: textarea
id: reproduction
validations:
required: true
attributes:
label: Reproduction / 复现过程
description: |
Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
If you have code snippets, error messages, stack traces, please provide them here as well.
Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
如果您有代码片段、错误信息、堆栈跟踪,也请在此提供。
请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
请勿使用截图,因为截图难以阅读,而且(更重要的是)不允许他人复制粘贴您的代码。
placeholder: |
Steps to reproduce the behavior/复现Bug的步骤:
1.
2.
3.
- type: textarea
id: expected-behavior
validations:
required: true
attributes:
label: Expected behavior / 期待表现
description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
================================================
FILE: .github/ISSUE_TEMPLATE/feature-request.yaml
================================================
name: "\U0001F680 Feature request"
description: Submit a request for a new ChatGLM3 feature / 提交一个新的 ChatGLM3 的功能建议
labels: [ "feature" ]
body:
- type: textarea
id: feature-request
validations:
required: true
attributes:
label: Feature request / 功能建议
description: |
A brief description of the functional proposal. Links to corresponding papers and code are desirable.
对功能建议的简述。最好提供对应的论文和代码链接
- type: textarea
id: motivation
validations:
required: true
attributes:
label: Motivation / 动机
description: |
Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
您提出建议的动机。如果该动机与另一个 GitHub 问题有关,请在此处提供对应的链接。
- type: textarea
id: contribution
validations:
required: true
attributes:
label: Your contribution / 您的贡献
description: |
Your PR link or any other link you can help with.
您的PR链接或者其他您能提供帮助的链接。
================================================
FILE: .github/PULL_REQUEST_TEMPLATE/pr_template.md
================================================
# Raise valuable PR / 提出有价值的PR
## Caution/ 注意事项:
Users should keep the following points in mind when submitting PRs:
1. The proposed PR should be about this project.
2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
用户在提交PR时候应该注意以下几点:
1. 提出的PR应该是关于本项目的。
2. 提出的PR应该具有针对性,如果具有多个不同的想法和优化方案,应该分配到不同的PR中。
## 不应该提出的PR / PRs that should not be proposed
If a developer proposes a PR about any of the following, it may be closed or Rejected.
1. those that don't describe improvement options.
2. multiple issues of different types combined in one PR.
3. The proposed PR is highly duplicative of already existing PRs.
如果开发者提出关于以下方面的PR,则可能会被直接关闭或拒绝通过。
1. 没有说明改进方案的。
2. 多个不同类型的问题合并在一个PR中的。
3. 提出的PR与已经存在的PR高度重复的。
# 检查您的PR
- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分?
- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过?如果是,请添加链接。
- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档?这里是文档指南,这里是文档格式化技巧。
- [ ] Did you write new required tests? / 您是否编写了新的必要测试?
- [ ] Are your PRs for only one issue / 您的PR是否仅针对一个问题
================================================
FILE: .gitignore
================================================
.hypothesis/
__pycache__
output.png
fewshot-data/
checkpoints/
records.db
server.py
examples/*grounding.png
archive*
hostfile
runs/
*.idea/
.DS_Store
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2024 CogVLM team @ Zhipu AI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: MODEL_LICENSE
================================================
The CogVLM License
1. Definitions
“Licensor” means the CogVLM Model Team that distributes its Software.
“Software” means the CogVLM model parameters made available under this license.
2. License Grant
Under the terms and conditions of this license, the Licensor hereby grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license.
This license permits you to use all open-source models in this repository for academic research free. Users who wish to use the models for commercial purposes must register [here](https://open.bigmodel.cn/mla/form).
Registered users may use the models for commercial activities free of charge, but must comply with all terms and conditions of this license.
The license notice shall be included in all copies or substantial portions of the Software.
3. Restriction
You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
4. Disclaimer
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
5. Limitation of Liability
EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
6. Dispute Resolution
This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
7. Llama2 and EVA-CLIP2 License
For CogVLM-17B version, Llama2 license conditions (https://ai.meta.com/llama/license/) and EVA license conditions (MIT, https://github.com/baaivision/EVA/blob/master/LICENSE) Also applies to model weights.
1. 定义
“许可方”是指分发其软件的 CogVLM 模型团队。
“软件”是指根据本许可提供的 CogVLM 模型参数。
2. 许可授予
根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
本许可允许您免费使用本仓库中的所有开源模型进行学术研究,对于希望将模型用于商业目的的用户,需在[这里](https://open.bigmodel.cn/mla/form)完成登记。
经过登记的用户可以免费使用本模型进行商业活动,但必须遵守本许可的所有条款和条件。
上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
3.限制
您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
4.免责声明
本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
5. 责任限制
除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
6.争议解决
本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 license@zhipuai.cn 与我们联系。
7. Llama2 和 EVA-CLIP2 许可
针对 CogVLM-17B 版本, Llama2 许可条件 (https://ai.meta.com/llama/license/) 和 EVA 许可条件 (MIT, https://github.com/baaivision/EVA/blob/master/LICENSE) 同时适用于模型权重。
================================================
FILE: README.md
================================================
# CogVLM & CogAgent
📗 [中文版README](./README_zh.md)
🌟 **Jump to detailed introduction: [Introduction to CogVLM](#introduction-to-cogvlm),
🆕 [Introduction to CogAgent](#introduction-to-cogagent)**
📔 For more detailed usage information, please refer to: [CogVLM & CogAgent's technical documentation (in Chinese)](https://zhipu-ai.feishu.cn/wiki/LXQIwqo1OiIVTykMh9Lc3w1Fn7g)
CogVLM
📖 Paper: CogVLM: Visual Expert for Pretrained Language Models
CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, supporting image understanding and multi-turn dialogue with a resolution of 490*490.
CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.
|
CogAgent
📖 Paper: CogAgent: A Visual Language Model for GUI Agents
CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, supporting image understanding at a resolution of 1120*1120. On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities.
CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It significantly surpasses existing models on GUI operation datasets including AITW and Mind2Web.
|
|
🌐 Web Demo for both CogVLM2: this link
|
**Table of Contents**
- [CogVLM \& CogAgent](#cogvlm--cogagent)
- [Release](#release)
- [Get Started](#get-started)
- [Option 1: Inference Using Web Demo.](#option-1-inference-using-web-demo)
- [Option 2:Deploy CogVLM / CogAgent by yourself](#option-2deploy-cogvlm--cogagent-by-yourself)
- [Situation 2.1 CLI (SAT version)](#situation-21-cli-sat-version)
- [Situation 2.2 CLI (Huggingface version)](#situation-22-cli-huggingface-version)
- [Situation 2.3 Web Demo](#situation-23-web-demo)
- [Option 3:Finetuning CogAgent / CogVLM](#option-3finetuning-cogagent--cogvlm)
- [Option 4: OpenAI Vision format](#option-4-openai-vision-format)
- [Hardware requirement](#hardware-requirement)
- [Model checkpoints](#model-checkpoints)
- [Introduction to CogVLM](#introduction-to-cogvlm)
- [Examples](#examples)
- [Introduction to CogAgent](#introduction-to-cogagent)
- [GUI Agent Examples](#gui-agent-examples)
- [Cookbook](#cookbook)
- [Task Prompts](#task-prompts)
- [Which --version to use](#which---version-to-use)
- [FAQ](#faq)
- [License](#license)
- [Citation \& Acknowledgements](#citation--acknowledgements)
## Release
- 🔥🔥🔥 **News**: ```2024/5/20```: We released the **next generation of model, [CogVLM2](https://github.com/THUDM/CogVLM2)**, which is based on llama3-8b and on the par of (or better than) GPT-4V in most cases! DOWNLOAD and TRY!
- 🔥🔥 **News**: ```2024/4/5```: [CogAgent](https://arxiv.org/abs/2312.08914) was selected as a CVPR 2024 Highlights!
- 🔥 **News**: ```2023/12/26```: We have released the [CogVLM-SFT-311K](dataset.md) dataset,
which contains over 150,000 pieces of data that we used for **CogVLM v1.0 only** training. Welcome to follow and use.
- **News**: ```2023/12/18```: **New Web UI Launched!** We have launched a new web UI based on Streamlit,
users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience.
- **News**: ```2023/12/15```: **CogAgent Officially Launched!** CogAgent is an image understanding model developed
based on CogVLM. It features **visual-based GUI Agent capabilities** and has further enhancements in image
understanding. It supports image input with a resolution of 1120*1120, and possesses multiple abilities including
multi-turn dialogue with images, GUI Agent, Grounding, and more.
- **News**: ```2023/12/8``` We have updated the checkpoint of cogvlm-grounding-generalist to
cogvlm-grounding-generalist-v1.1, with image augmentation during training, therefore more robust.
See [details](#introduction-to-cogvlm).
- **News**: ```2023/12/7``` CogVLM supports **4-bit quantization** now! You can inference with just **11GB** GPU memory!
- **News**: ```2023/11/20``` We have updated the checkpoint of cogvlm-chat to cogvlm-chat-v1.1, unified the versions of
chat and VQA, and refreshed the SOTA on various datasets. See [details](#introduction-to-cogvlm)
- **News**: ```2023/11/20``` We release **[cogvlm-chat](https://huggingface.co/THUDM/cogvlm-chat-hf)**, **[cogvlm-grounding-generalist](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)/[base](https://huggingface.co/THUDM/cogvlm-grounding-base-hf)**, **[cogvlm-base-490](https://huggingface.co/THUDM/cogvlm-base-490-hf)/[224](https://huggingface.co/THUDM/cogvlm-base-224-hf)** on 🤗Huggingface. you can infer with transformers in [a few lines of code](#situation-22-cli-huggingface-version)now!
- ```2023/10/27``` CogVLM bilingual version is available [online](https://chatglm.cn/)! Welcome to try it out!
- ```2023/10/5``` CogVLM-17B released。
## Get Started
### Option 1: Inference Using Web Demo.
* Click here to enter [CogVLM2 Demo](http://36.103.203.44:7861/)。
If you need to use Agent and Grounding functions, please refer to [Cookbook - Task Prompts](#task-prompts)
### Option 2:Deploy CogVLM / CogAgent by yourself
We support two GUIs for model inference, **CLI** and **web demo** . If you want to use it in your python code, it is
easy to modify the CLI scripts for your case.
First, we need to install the dependencies.
```bash
# CUDA >= 11.8
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
**All code for inference is located under the ``basic_demo/`` directory. Please switch to this directory first before
proceeding with further operations.**
#### Situation 2.1 CLI (SAT version)
Run CLI demo via:
```bash
# CogAgent
python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogagent-vqa --version chat_old --bf16 --stream_chat
# CogVLM
python cli_demo_sat.py --from_pretrained cogvlm-chat --version chat_old --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-grounding-generalist --version base --bf16 --stream_chat
```
The program will automatically download the sat model and interact in the command line. You can generate replies by
entering instructions and pressing enter.
Enter `clear` to clear the conversation history and `stop` to stop the program.
We also support model parallel inference, which splits model to multiple (2/4/8) GPUs. `--nproc-per-node=[n]` in the
following command controls the number of used GPUs.
```
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16
```
- If you want to manually download the weights, you can replace the path after ``--from_pretrained`` with the model
path.
- Our model supports SAT's **4-bit quantization** and **8-bit quantization**.
You can change ``--bf16`` to ``--fp16``, or ``--fp16 --quant 4``, or ``--fp16 --quant 8``.
For example
```bash
python cli_demo_sat.py --from_pretrained cogagent-chat --fp16 --quant 8 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-chat-v1.1 --fp16 --quant 4 --stream_chat
# In SAT version,--quant should be used with --fp16
```
- The program provides the following hyperparameters to control the generation process:
```
usage: cli_demo_sat.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE]
optional arguments:
-h, --help show this help message and exit
--max_length MAX_LENGTH
max length of the total sequence
--top_p TOP_P top p for nucleus sampling
--top_k TOP_K top k for top k sampling
--temperature TEMPERATURE
temperature for sampling
```
- Click [here](#which---version-to-use) to view the correspondence between different models and the ``--version``
parameter.
#### Situation 2.2 CLI (Huggingface version)
Run CLI demo via:
```bash
# CogAgent
python cli_demo_hf.py --from_pretrained THUDM/cogagent-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogagent-vqa-hf --bf16
# CogVLM
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-grounding-generalist-hf --bf16
```
- If you want to manually download the weights, you can replace the path after ``--from_pretrained`` with the model
path.
- You can change ``--bf16`` to ``--fp16``, or ``--quant 4``. For example, our model supports Huggingface's **4-bit
quantization**:
```bash
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --quant 4
```
#### Situation 2.3 Web Demo
We also offer a local web demo based on Gradio. First, install Gradio by running: `pip install gradio`. Then download
and enter this repository and run `web_demo.py`. See the next section for detailed usage:
```bash
python web_demo.py --from_pretrained cogagent-chat --version chat --bf16
python web_demo.py --from_pretrained cogagent-vqa --version chat_old --bf16
python web_demo.py --from_pretrained cogvlm-chat-v1.1 --version chat_old --bf16
python web_demo.py --from_pretrained cogvlm-grounding-generalist --version base --bf16
```
The GUI of the web demo looks like:
### Option 3:Finetuning CogAgent / CogVLM
You may want to use CogVLM in your own task, which needs a **different output style or domain knowledge**. **All code
for finetuning is located under the ``finetune_demo/`` directory.**
We here provide a finetuning example for **Captcha Recognition** using lora.
1. Start by downloading the [Captcha Images dataset](https://www.kaggle.com/datasets/aadhavvignesh/captcha-images). Once
downloaded, extract the contents of the ZIP file.
2. To create a train/validation/test split in the ratio of 80/5/15, execute the following:
```bash
python utils/split_dataset.py
```
3. Start the fine-tuning process with this command:
```bash
bash finetune_demo/finetune_(cogagent/cogvlm)_lora.sh
```
4. Merge the model to `model_parallel_size=1`: (replace the 4 below with your training `MP_SIZE`)
```bash
torchrun --standalone --nnodes=1 --nproc-per-node=4 utils/merge_model.py --version base --bf16 --from_pretrained ./checkpoints/merged_lora_(cogagent/cogvlm490/cogvlm224)
```
5. Evaluate the performance of your model.
```bash
bash finetune_demo/evaluate_(cogagent/cogvlm).sh
```
### Option 4: OpenAI Vision format
We provide the same API examples as `GPT-4V`, which you can view in `openai_demo`.
1. First, start the node
```
python openai_demo/openai_api.py
```
2. Next, run the request example node, which is an example of a continuous dialogue
```
python openai_demo/openai_api_request.py
```
3. You will get output similar to the following
```
This image showcases a tranquil natural scene with a wooden pathway leading through a field of lush green grass. In the distance, there are trees and some scattered structures, possibly houses or small buildings. The sky is clear with a few scattered clouds, suggesting a bright and sunny day.
```
### Hardware requirement
* Model Inference:
For INT4 quantization: 1 * RTX 3090(24G) (CogAgent takes ~ 12.6GB, CogVLM takes ~ 11GB)
For FP16: 1 * A100(80G) or 2 * RTX 3090(24G)
* Finetuning:
For FP16: 4 * A100(80G) *[Recommend]* or 8* RTX 3090(24G).
### Model checkpoints
If you run the `basic_demo/cli_demo*.py` from the code repository, it will automatically download SAT or Hugging Face
weights. Alternatively, you can choose to manually download the necessary weights.
- CogAgent
| Model name | Input resolution | Introduction | Huggingface model | SAT model |
| :-----------: | :----: | :----------------------------------------------------------: | :------: | :-------: |
| cogagent-chat | 1120 | Chat version of CogAgent. Supports GUI Agent, multiple-round chat and visual grounding. | [HF link](https://huggingface.co/THUDM/cogagent-chat-hf)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/cogagent-chat-hf) | [HF link](https://huggingface.co/THUDM/CogAgent/tree/main)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/CogAgent) |
| cogagent-vqa | 1120 | VQA version of CogAgent. Has stronger capabilities in single-turn visual dialogue. Recommended for VQA benchmarks. | [HF link](https://huggingface.co/THUDM/cogagent-vqa-hf)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/cogagent-vqa-hf) | [HF link](https://huggingface.co/THUDM/CogAgent/tree/main)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/CogAgent) |
c
- CogVLM
| Model name | Input resolution | Introduction | Huggingface model | SAT model |
| :-------------------------: | :----: | :-------------------------------------------------------: | :------: | :-------: |
| cogvlm-chat-v1.1 | 490 | Supports multiple rounds of chat and vqa simultaneously, with different prompts. | [HF link](https://huggingface.co/THUDM/cogvlm-chat-hf)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/cogvlm-chat-hf) | [HF link](https://huggingface.co/THUDM/CogVLM/tree/main)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/CogVLM) |
| cogvlm-base-224 | 224 | The original checkpoint after text-image pretraining. | [HF link](https://huggingface.co/THUDM/cogvlm-base-224-hf)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/cogvlm-base-224-hf) | [HF link](https://huggingface.co/THUDM/CogVLM/tree/main)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/CogVLM) |
| cogvlm-base-490 | 490 | Amplify the resolution to 490 through position encoding interpolation from `cogvlm-base-224`. | [HF link](https://huggingface.co/THUDM/cogvlm-base-490-hf)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/cogvlm-base-490-hf) | [HF link](https://huggingface.co/THUDM/CogVLM/tree/main)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/CogVLM) |
| cogvlm-grounding-generalist | 490 | This checkpoint supports different visual grounding tasks, e.g. REC, Grounding Captioning, etc. | [HF link](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/cogvlm-grounding-generalist-hf) | [HF link](https://huggingface.co/THUDM/CogVLM/tree/main)
[OpenXLab link](https://openxlab.org.cn/models/detail/THUDM/CogVLM) |
## Introduction to CogVLM
- CogVLM is a powerful **open-source visual language model** (**VLM**). CogVLM-17B has 10 billion vision parameters and
7 billion language parameters.
- CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k
captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2,
OKVQA, TextVQA, COCO captioning, etc., **surpassing or matching PaLI-X 55B**. CogVLM can
also [chat with you](http://36.103.203.44:7861) about images.
Click to view results on MM-VET, POPE, TouchStone.
| Method |
LLM |
MM-VET |
POPE(adversarial) |
TouchStone |
| BLIP-2 |
Vicuna-13B |
22.4 |
- |
- |
| Otter |
MPT-7B |
24.7 |
- |
- |
| MiniGPT4 |
Vicuna-13B |
24.4 |
70.4 |
531.7 |
| InstructBLIP |
Vicuna-13B |
25.6 |
77.3 |
552.4 |
| LLaMA-Adapter v2 |
LLaMA-7B |
31.4 |
- |
590.1 |
| LLaVA |
LLaMA2-7B |
28.1 |
66.3 |
602.7 |
| mPLUG-Owl |
LLaMA-7B |
- |
66.8 |
605.4 |
| LLaVA-1.5 |
Vicuna-13B |
36.3 |
84.5 |
- |
| Emu |
LLaMA-13B |
36.3 |
- |
- |
| Qwen-VL-Chat |
- |
- |
- |
645.2 |
| DreamLLM |
Vicuna-7B |
35.9 |
76.5 |
- |
| CogVLM |
Vicuna-7B |
52.8 |
87.6 |
742.0 |
Click to view results of cogvlm-grounding-generalist-v1.1.
|
RefCOCO |
|
|
RefCOCO+ |
|
|
RefCOCOg |
|
Visual7W |
|
val |
testA |
testB |
val |
testA |
testB |
val |
test |
test |
| cogvim-grounding-generalist |
92.51 |
93.95 |
88.73 |
87.52 |
91.81 |
81.43 |
89.46 |
90.09 |
90.96 |
| cogvim-grounding-generalist-v1.1 |
**92.76** |
**94.75** |
**88.99** |
**88.68** |
**92.91** |
**83.39** |
**89.75** |
**90.79** |
**91.05** |
### Examples
* CogVLM can accurately describe images in details with **very few hallucinations**.
Click for comparison with LLAVA-1.5 and MiniGPT-4.
* CogVLM can understand and answer various types of questions, and has a **visual grounding** version.
* CogVLM sometimes captures more detailed content than GPT-4V(ision).
Click to expand more examples.

## Introduction to CogAgent
CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters
and 7 billion language parameters
CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including VQAv2,
OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It significantly surpasses existing models on GUI
operation datasets such as AITW and Mind2Web.
In addition to all the features already present in CogVLM (visual multi-round dialogue, visual grounding), CogAgent:
1. Supports higher resolution visual input and dialogue question-answering. **It supports ultra-high-resolution image
inputs of 1120x1120.**
2. **Possesses the capabilities of a visual Agent**, being able to return a plan, next action, and specific operations
with coordinates for any given task on any GUI screenshot.
3. **Enhanced GUI-related question-answering capabilities**, allowing it to handle questions about any GUI screenshot,
such as web pages, PC apps, mobile applications, etc.
4. Enhanced capabilities in OCR-related tasks through improved pre-training and fine-tuning.
### GUI Agent Examples
## Cookbook
### Task Prompts
1. **General Multi-Round Dialogue**: Say whatever you want.
2. **GUI Agent Task**: Use the [Agent template](https://github.com/THUDM/CogVLM/blob/main/utils/utils/template.py#L761)
and replace \ with the task instruction enclosed in double quotes. This query can make CogAgent infer Plan and
Next Action. If adding ``(with grounding)`` at the end of the query, the model will return a formalized action
representation with coordinates.
For example, to ask the model how to complete the task "Search for CogVLM" on a current GUI screenshot, follow these
steps:
1. Randomly select a template from
the [Agent template](https://github.com/THUDM/CogVLM/blob/main/utils/utils/template.py#L761). Here, we
choose ``What steps do I need to take to ?``.
2. Replace with the task instruction enclosed in double quotes, for
example, ``What steps do I need to take to "Search for CogVLM"?`` . Inputting this to the model yields:
> Plan: 1. Type 'CogVLM' into the Google search bar. 2. Review the search results that appear. 3. Click on a relevant
> result to read more about CogVLM or access further resources.
>
> Next Action: Move the cursor to the Google search bar, and type 'CogVLM' into it.
3. If adding ``(with grounding)`` at the end, i.e. changing the input
to ``What steps do I need to take to "Search for CogVLM"?(with grounding)``, the output of CogAgent would be:
> Plan: 1. Type 'CogVLM' into the Google search bar. 2. Review the search results that appear. 3. Click on a relevant
> result to read more about CogVLM or access further resources.
>
> Next Action: Move the cursor to the Google search bar, and type 'CogVLM' into it.
> Grounded Operation:[combobox] Search -> TYPE: CogVLM at the box [[212,498,787,564]]
Tip: For GUI Agent tasks, it is recommended to conduct only single-round dialogues for each image for better results.
3. **Visual Grounding**. Three modes of grounding are supported:
- Image description with grounding coordinates (bounding box). Use any template
from [caption_with_box template](https://github.com/THUDM/CogVLM/blob/main/utils/utils/template.py#L537) as model
input. For example:
> Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?
- Returning grounding coordinates (bounding box) based on the description of objects. Use any template
from [caption2box template](https://github.com/THUDM/CogVLM/blob/main/utils/utils/template.py#L345),
replacing ```` with the object's description. For example:
> Can you point out *children in blue T-shirts* in the image and provide the bounding boxes of their location?
- Providing a description based on bounding box coordinates. Use a template
from [box2caption template](https://github.com/THUDM/CogVLM/blob/main/utils/utils/template.py#L400),
replacing ```` with the position coordinates. For example:
> Tell me what you see within the designated area *[[086,540,400,760]]* in the picture.
**Format of coordination:** The bounding box coordinates in the model's input and output use the
format ``[[x1, y1, x2, y2]]``, with the origin at the top left corner, the x-axis to the right, and the y-axis
downward. (x1, y1) and (x2, y2) are the top-left and bottom-right corners, respectively, with values as relative
coordinates multiplied by 1000 (prefixed with zeros to three digits).
### Which --version to use
Due to differences in model functionalities, different model versions may have distinct ``--version`` specifications for
the text processor, meaning the format of the prompts used varies.
| model name | --version |
|:---------------------------:|:---------:|
| cogagent-chat | chat |
| cogagent-vqa | chat_old |
| cogvlm-chat | chat_old |
| cogvlm-chat-v1.1 | chat_old |
| cogvlm-grounding-generalist | base |
| cogvlm-base-224 | base |
| cogvlm-base-490 | base |
### FAQ
* If you have trouble in accessing huggingface.co, you can add `--local_tokenizer /path/to/vicuna-7b-v1.5` to load the
tokenizer.
* If you have trouble in automatically downloading model with 🔨[SAT](https://github.com/THUDM/SwissArmyTransformer), try
downloading from 🤖[modelscope](https://www.modelscope.cn/models/ZhipuAI/CogVLM/summary) or
🤗[huggingface](https://huggingface.co/THUDM/CogVLM) or 💡[wisemodel](https://www.wisemodel.cn/models/ZhipuAI/CogVLM)
manually.
* Download model using 🔨[SAT](https://github.com/THUDM/SwissArmyTransformer), the model will be saved to the default
location `~/.sat_models`. Change the default location by setting the environment variable `SAT_HOME`. For example, if
you want to save the model to `/path/to/my/models`, you can run `export SAT_HOME=/path/to/my/models` before running
the python command.
## License
The code in this repository is open source under the [Apache-2.0 license](./LICENSE), while the use of the CogVLM model
weights must comply with the [Model License](./MODEL_LICENSE).
## Citation & Acknowledgements
If you find our work helpful, please consider citing the following papers
```
@misc{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{hong2023cogagent,
title={CogAgent: A Visual Language Model for GUI Agents},
author={Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2312.08914},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
In the instruction fine-tuning phase of the CogVLM, there are some English image-text data from
the [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLAVA](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR)
and [Shikra](https://github.com/shikras/shikra) projects, as well as many classic cross-modal work datasets. We
sincerely thank them for their contributions.
================================================
FILE: README_zh.md
================================================
# CogVLM & CogAgent
📗 [README in English](./README.md)
🌟 **跳转到详细介绍: [CogVLM介绍](#introduction-to-cogvlm),
🆕 [CogAgent的介绍](#introduction-to-cogagent)**
📔 如需获取更详细的使用信息,请参阅: [CogVLM&CogAgent技术文档](https://zhipu-ai.feishu.cn/wiki/LXQIwqo1OiIVTykMh9Lc3w1Fn7g)
CogVLM
📖 Paper: CogVLM: Visual Expert for Pretrained Language Models
CogVLM 是一个强大的开源视觉语言模型(VLM)。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数,支持490*490分辨率的图像理解和多轮对话。
CogVLM-17B 17B在10个经典的跨模态基准测试中取得了最先进的性能包括NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA 和 TDIUC 基准测试。
|
CogAgent
📖 Paper: CogAgent: A Visual Language Model for GUI Agents
CogAgent 是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数, 支持1120*1120分辨率的图像理解。在CogVLM的能力之上,它进一步拥有了GUI图像Agent的能力。
CogAgent-18B 在9个经典的跨模态基准测试中实现了最先进的通用性能,包括 VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, 和 POPE 测试基准。它在包括AITW和Mind2Web在内的GUI操作数据集上显著超越了现有的模型。
|
|
🌐 CogVLM2 在线体验: this link
|
**目录**
- [CogVLM \& CogAgent](#cogvlm--cogagent)
- [Release](#发布)
- [开始使用](#开始使用)
- [选项1:使用网页演示进行推理](#选项1使用网页演示进行推理)
- [选项2:自行部署CogVLM / CogAgent](#选项2自行部署cogvlm--cogagent)
- [Situation 2.1 CLI (SAT version)](#situation-21-cli-sat-version)
- [Situation 2.2 CLI (Huggingface version)](#situation-22-cli-huggingface-version)
- [Situation 2.3 Web Demo](#situation-23-web-demo)
- [选项3:微调 CogAgent / CogVLM](#选项3微调-cogagent--cogvlm)
- [选项4:OpenAI格式](#选项4OpenAI格式)
- [硬件需求](#硬件需求)
- [Model checkpoints](#model-checkpoints)
- [Introduction to CogVLM](#introduction-to-cogvlm)
- [示例](#示例)
- [Introduction to CogAgent](#introduction-to-cogagent)
- [GUI Agent Examples](#gui-agent-examples)
- [Cookbook](#cookbook)
- [Task Prompts](#task-prompts)
- [选择适合的模型](#选择适合的模型)
- [License](#license)
- [Citation \& Acknowledgements](#citation--acknowledgements)
## 发布
- 🔥🔥🔥 **News**: ```2024/4/5```: [CogAgent](https://arxiv.org/abs/2312.08914) 成功被评选为CVPR 2024 Highlights!
- 🔥🔥 **News**: ```2023/12/26```:我们公开了 [CogVLM-SFT-311K](dataset_zh.md) 数据集,它包含了超过15万条我们用于训练 **CogVLM v1.0(仅该模型)** 的数据。欢迎关注和使用。
- 🔥 **News**: ```2023/12/18```: **新的Streamlit用户界面**已经上线!我们已经基于Streamlit推出了新的网页用户界面,用户可以在我们的界面上轻松与CogVLM,CogAgent交谈。带来更好的用户体验。
- 🔥 **News**: ```2023/12/15```: **CogAgent 正式发布!** CogAgent是基于CogVLM开发的图像理解模型。它具有基于视觉的GUI
Agent功能,并在图像理解方面进行了进一步的增强。它支持分辨率为1120*1120的图像输入,并具有包括与图像进行多轮对话、GUI
Agent、Grounding等多种能力。
- **News**: ```2023/12/8```:
我们已将cogvlm-grounding-generalist的检查点更新为cogvlm-grounding-generalist-v1.1,训练过程中增加了图像增强,因此更加稳健。查看[详情](#introduction-to-cogvlm)。
- **News**: ```2023/12/7``` CogVLM现在支持**4-bit**量化!您只需要11GB的GPU内存就可以进行推理!
- **News**: ```2023/11/20```我们已将cogvlm-chat的检查点更新为cogvlm-chat-v1.1,统一了聊天和VQA的版本,并刷新了各种数据集上的SOTA,查看[详情](#introduction-to-cogvlm)。
- **News**: ```2023/11/20``` 我们在🤗Huggingface上发布了 **[cogvlm-chat](https://huggingface.co/THUDM/cogvlm-chat-hf)**, **[cogvlm-grounding-generalist](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)/[base](https://huggingface.co/THUDM/cogvlm-grounding-base-hf)**, **[cogvlm-base-490](https://huggingface.co/THUDM/cogvlm-base-490-hf)/[224](https://huggingface.co/THUDM/cogvlm-base-224-hf)**,使用transformers 快速 [推理](#situation-22-cli-huggingface-version)。
- ```2023/10/27``` CogVLM双语版本已经在线上可用!欢迎[试用](https://chatglm.cn/)。
- ```2023/10/5``` CogVLM-17B v1.0 发布。
## 开始使用
### 选项1:使用网页演示进行推理
* 点击此处进入 [CogVLM2 Web Demo](http://36.103.203.44:7861/)。
如果您需要使用代理和接地功能,请参考[Cookbook - Task Prompts](#task-prompts)。
### 选项2:自行部署CogVLM / CogAgent
我们支持两种模型推理的图形用户界面,命令行界面和网络演示。如果你想在你的Python代码中使用它,修改命令行脚本以适应你的情况。
首先,我们需要安装依赖项。
```bash
# CUDA >= 11.8
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
**所有的推理代码都位于 `basic_demo/` 目录下。请在进行进一步操作之前,先切换到这个目录。**
#### Situation 2.1 CLI (SAT version)
通过以下方式运行CLI演示:
```bash
# CogAgent
python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogagent-vqa --version chat_old --bf16 --stream_chat
# CogVLM
python cli_demo_sat.py --from_pretrained cogvlm-chat --version chat_old --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-grounding-generalist --version base --bf16 --stream_chat
```
该程序将自动下载卫星模型并在命令行中进行交互。您可以通过输入指令并按回车来生成回复。输入`clear` 以清除对话历史,输入`stop` 以停止程序。
我们也支持模型并行推理,该推理将模型分割到多个(2/4/8)GPU上。使用 `--nproc-per-node=[n]` 控制使用的GPU数量。
```
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16
```
- 如果你想手动下载权重,你可以用模型路径替换 ``--from_pretrained`` 后的路径。
- 我们的模型支持SAT的4位量化和8位量化。你可以将 ``--bf16`` 更改为 ``--fp16``, 或 ``--fp16 --quant 4``, 或 ``--fp16 --quant 8``.
例如
```bash
python cli_demo_sat.py --from_pretrained cogagent-chat --fp16 --quant 8 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-chat-v1.1 --fp16 --quant 4 --stream_chat
# In SAT version,--quant should be used with --fp16
```
- 该程序提供以下超参数来控制生成过程:
```
usage: cli_demo_sat.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE]
optional arguments:
-h, --help show this help message and exit
--max_length MAX_LENGTH
max length of the total sequence
--top_p TOP_P top p for nucleus sampling
--top_k TOP_K top k for top k sampling
--temperature TEMPERATURE
temperature for sampling
```
- 点击 [这里](#which---version-to-use) 查看不同模型与 ``--version`` 参数之间的对应关系的对应关系。
#### Situation 2.2 CLI (Huggingface version)
通过以下方式运行CLI演示:
```bash
# CogAgent
python cli_demo_hf.py --from_pretrained THUDM/cogagent-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogagent-vqa-hf --bf16
# CogVLM
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-grounding-generalist --bf16
```
- 如果你想手动下载权重,你可以将 ``--from_pretrained`` 后的路径替换为模型路径。
- 你可以将 ``--bf16`` 更改为 ``--fp16``, 或者 ``--quant 4``。例如,我们的模型支持Huggingface的**4-bit quantization**:
```bash
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --quant 4
```
#### Situation 2.3 Web Demo
我们还提供了一个基于Gradio的本地网络演示。首先,通过运行 `pip install gradio` 来安装Gradio。然后下载并进入这个仓库,运行 `web_demo.py`。
详细的使用方法请参见下一节:
```bash
python web_demo.py --from_pretrained cogagent-chat --version chat --bf16
python web_demo.py --from_pretrained cogagent-vqa --version chat_old --bf16
python web_demo.py --from_pretrained cogvlm-chat-v1.1 --version chat_old --bf16
python web_demo.py --from_pretrained cogvlm-grounding-generalist --version base --bf16
```
网页演示的图形用户界面如下:
### 选项3:微调 CogAgent / CogVLM
你可能想在你自己的任务中使用CogVLM,这需要 **不同的输出风格或领域知识**. **所有用于微调的代码都位于 ``finetune_demo/`` 目录中。**
我们在这里提供了一个使用lora进行 **验证码识别** 的微调示例。
1. 首先下载 [Captcha Images](https://www.kaggle.com/datasets/aadhavvignesh/captcha-images)数据集。下载完成后,解压ZIP文件的内容。
2. 要创建一个以80/5/15的比例进行训练/验证/测试划分,请执行以下操作:
```bash
python utils/split_dataset.py
```
3. 使用此命令开始微调:
```bash
bash finetune_demo/finetune_(cogagent/cogvlm)_lora.sh
```
4. 将模型合并到 `model_parallel_size=1`: (用你的训练 `MP_SIZE` 替换下面的4)
```bash
torchrun --standalone --nnodes=1 --nproc-per-node=4 utils/merge_model.py --version base --bf16 --from_pretrained ./checkpoints/merged_lora_(cogagent/cogvlm490/cogvlm224)
```
5. 估你的模型的性能。
```bash
bash finetune_demo/evaluate_(cogagent/cogvlm).sh
```
### 选项4:OpenAI格式
We provide the same API examples as `GPT-4V`, which you can view in `openai_demo`.
1. 首先,启动节点
```
python openai_demo/openai_api.py
```
2. 接下来,运行请求示例节点,这是一个连续对话的例子
```
python openai_demo/openai_api_request.py
```
3. 你将得到类似于以下的输出
```
This image showcases a tranquil natural scene with a wooden pathway leading through a field of lush green grass. In the distance, there are trees and some scattered structures, possibly houses or small buildings. The sky is clear with a few scattered clouds, suggesting a bright and sunny day.
```
### 硬件需求
* 模型推理:
For INT4 quantization: 1 * RTX 3090(24G) (CogAgent takes ~ 12.6GB, CogVLM takes ~ 11GB)
For FP16: 1 * A100(80G) or 2 * RTX 3090(24G)
* 微调:
For FP16: 4 * A100(80G) *[Recommend]* or 8* RTX 3090(24G).
### Model checkpoints
如果你从代码仓库运行 `basic_demo/cli_demo*.py`,它将自动下载SAT或Hugging Face的权重。或者,你也可以选择手动下载必要的权重。
- CogAgent
| 模型名称 | 输入分辨率 | 介绍 | Huggingface model | SAT model |
| :-----------: | :----: | :----------------------------------------------------------: | :------: | :-------: |
| cogagent-chat | 1120 | CogAgent的聊天版本。支持GUI代理,多轮聊天和视觉定位。 | [link](https://huggingface.co/THUDM/cogagent-chat-hf) | [link](https://huggingface.co/THUDM/CogAgent/tree/main) |
| cogagent-vqa | 1120 | CogAgent的VQA版本。在单轮视觉对话中具有更强的能力。推荐用于VQA基准测试。 | [link](https://huggingface.co/THUDM/cogagent-vqa-hf) | [link](https://huggingface.co/THUDM/CogAgent/tree/main) |
- CogVLM
| 模型名称 | 输入分辨率 | 介绍 | Huggingface model | SAT model |
| :-------------------------: | :----: |:-----------------------------------------------------------------------------------------------:| :------: | :-------: |
| cogvlm-chat-v1.1 | 490 | 支持同时进行多轮聊天和视觉问答,支持自由的提示词。 | [link](https://huggingface.co/THUDM/cogvlm-chat-hf) | [link](https://huggingface.co/THUDM/CogVLM/tree/main) |
| cogvlm-base-224 | 224 | 文本-图像预训练后的原始检查点。 | [link](https://huggingface.co/THUDM/cogvlm-base-224-hf) | [link](https://huggingface.co/THUDM/CogVLM/tree/main) |
| cogvlm-base-490 | 490 | 通过从 cogvlm-base-224 进行位置编码插值,将分辨率提升到490。 | [link](https://huggingface.co/THUDM/cogvlm-base-490-hf) | [link](https://huggingface.co/THUDM/CogVLM/tree/main) |
| cogvlm-grounding-generalist | 490 | 此检查点支持不同的视觉定位任务,例如REC,定位字幕等。 | [link](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | [link](https://huggingface.co/THUDM/CogVLM/tree/main) |
## Introduction to CogVLM
- CogVLM是一个强大的开源视觉语言模型(VLM)。CogVLM-17B拥有100亿的视觉参数和70亿的语言参数。
- CogVLM-17B在10个经典的跨模态基准测试中取得了最佳性能,包括 NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, 并在 VQAv2, OKVQA, TextVQA, COCO 字幕等方面排名第二., **超越或匹敌 PaLI-X 55B**. CogVLM还可以和你聊关于图片的话题。
点击查看MM-VET,POPE,TouchStone的结果。
| Method |
LLM |
MM-VET |
POPE(adversarial) |
TouchStone |
| BLIP-2 |
Vicuna-13B |
22.4 |
- |
- |
| Otter |
MPT-7B |
24.7 |
- |
- |
| MiniGPT4 |
Vicuna-13B |
24.4 |
70.4 |
531.7 |
| InstructBLIP |
Vicuna-13B |
25.6 |
77.3 |
552.4 |
| LLaMA-Adapter v2 |
LLaMA-7B |
31.4 |
- |
590.1 |
| LLaVA |
LLaMA2-7B |
28.1 |
66.3 |
602.7 |
| mPLUG-Owl |
LLaMA-7B |
- |
66.8 |
605.4 |
| LLaVA-1.5 |
Vicuna-13B |
36.3 |
84.5 |
- |
| Emu |
LLaMA-13B |
36.3 |
- |
- |
| Qwen-VL-Chat |
- |
- |
- |
645.2 |
| DreamLLM |
Vicuna-7B |
35.9 |
76.5 |
- |
| CogVLM |
Vicuna-7B |
52.8 |
87.6 |
742.0 |
点击查看cogvlm-grounding-generalist-v1.1的结果。
|
RefCOCO |
|
|
RefCOCO+ |
|
|
RefCOCOg |
|
Visual7W |
|
val |
testA |
testB |
val |
testA |
testB |
val |
test |
test |
| cogvim-grounding-generalist |
92.51 |
93.95 |
88.73 |
87.52 |
91.81 |
81.43 |
89.46 |
90.09 |
90.96 |
| cogvim-grounding-generalist-v1.1 |
**92.76** |
**94.75** |
**88.99** |
**88.68** |
**92.91** |
**83.39** |
**89.75** |
**90.79** |
**91.05** |
### 示例
* CogVLM能够准确地详细描述图像,几乎不会产生幻觉。
点击以与LLAVA-1.5和MiniGPT-4进行比较。.
* CogVLM能理解并回答各种类型的问题,并且有一个视觉基础版本。
* CogVLM有时比GPT-4V(ision)捕获更详细的内容。
点击以展开更多示例。

## Introduction to CogAgent
CogAgent是一个基于CogVLM改进的开源视觉语言模型。CogAgent-18B拥有110亿的视觉参数和70亿的语言参数。
CogAgent-18B在9个经典的跨模态基准测试中实现了最先进的全能性能,包括VQAv2、OK-VQ、TextVQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet和POPE。它在如AITW和Mind2Web等GUI操作数据集上显著超越了现有的模型。
除了CogVLM已有的所有功能(视觉多轮对话,视觉定位)之外,CogAgent:
1. 支持**更高分辨率**的视觉输入和对话式问答。它支持超高分辨率的图像输入,达到**1120x1120**。
2. **拥有视觉Agent的能力**,能够在任何图形用户界面截图上,为任何给定任务返回一个计划,下一步行动,以及带有坐标的特定操作。
3. **增强了与图形用户界面相关的问答能力**,使其能够处理关于任何图形用户界面截图的问题,例如网页、PC应用、移动应用等。
4. 通过改进预训练和微调,提高了OCR相关任务的能力。
### GUI Agent Examples
## Cookbook
### Task Prompts
1. **通用多轮对话**: 随便你说什么.
2. **GUI代理任务**: 使用 [代理模板](https://github.com/THUDM/CogVLM/blob/main/utils/utils/template.py#L761)并用双引号括起来的任务指令替换 `\`。这个查询可以让CogAgent推断出计划和下一步行动。如果在查询的末尾添加`(with grounding)` 模型将返回一个带有坐标的正式化动作表示。
例如,要询问模型如何完成"在当前GUI截图上搜索CogVLM"的任务,请按照以下步骤操作:
1. 从代理模板中随机选择一个模板。这里,我们选择了``What steps do I need to take to ?``.
2. 请用双引号中的任务指令替换,例如, ``What steps do I need to take to "Search for CogVLM"?``。将此输入到模型会产生:
> Plan: 1. Type 'CogVLM' into the Google search bar. 2. Review the search results that appear. 3. Click on a relevant
> result to read more about CogVLM or access further resources.
>
> Next Action: Move the cursor to the Google search bar, and type 'CogVLM' into it.
3. 如果在末尾添加 ``(with grounding)`` 即将输入改为``What steps do I need to take to "Search for CogVLM"?(with grounding)``,那么CogAgent的输出将会是:
> Plan: 1. Type 'CogVLM' into the Google search bar. 2. Review the search results that appear. 3. Click on a relevant
> result to read more about CogVLM or access further resources.
>
> Next Action: Move the cursor to the Google search bar, and type 'CogVLM' into it.
> Grounded Operation:[combobox] Search -> TYPE: CogVLM at the box [[212,498,787,564]]
提示:对于GUI代理任务,建议每个图像只进行一轮对话以获得更好的结果。
3. **视觉定位**. T支持三种定位模式:
- 带有定位坐标(边界框)的图像描述。使用caption_with_box模板中的任何模板作为模型输入。例如:
> Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?
- 根据物体的描述返回接地坐标(边界框)。使用caption2box模板中的任何模板,将 替换为物体的描述。例如:
> Can you point out *children in blue T-shirts* in the image and provide the bounding boxes of their location?
- 根据边界框坐标提供描述。使用box2caption模板中的模板,将 替换为位置坐标。例如:
> Tell me what you see within the designated area *[[086,540,400,760]]* in the picture.
**坐标格式:** 模型的输入和输出中的边界框坐标使用 `[[x1, y1, x2, y2]]` 格式,原点位于左上角,x轴向右,y轴向下。 (x1, y1) 和 (x2, y2) 分别是左上角和右下角,其值为相对坐标乘以1000(前缀为零,三位数)。
### 选择适合的模型
由于模型功能的差异,不同的模型版本可能会有不同的文本处理器 `--version`,这意味着使用的提示格式会有所不同。
| model name | --version |
|:---------------------------:|:---------:|
| cogagent-chat | chat |
| cogagent-vqa | chat_old |
| cogvlm-chat | chat_old |
| cogvlm-chat-v1.1 | chat_old |
| cogvlm-grounding-generalist | base |
| cogvlm-base-224 | base |
| cogvlm-base-490 | base |
### 常见问题
* 如果你在访问huggingface.co时遇到问题,你可以添加 `--local_tokenizer /path/to/vicuna-7b-v1.5` 来加载分词器。
* 如果你在使用🔨 [SAT](https://github.com/THUDM/SwissArmyTransformer)自动下载模型时遇到问题 , 尝试从 🤖[modelscope](https://www.modelscope.cn/models/ZhipuAI/CogVLM/summary) 或
🤗[huggingface](https://huggingface.co/THUDM/CogVLM) or 💡[wisemodel](https://www.wisemodel.cn/models/ZhipuAI/CogVLM) 手动下载。
* 使用🔨 SAT下载模型,模型将被保存到默认位置 `~/.sat_models` 。通过设置环境变量 `SAT_HOME` 来更改默认位置。例如,如果你想将模型保存到 `/path/to/my/models` ,你可以在运行python命令之前运行 `export SAT_HOME=/path/to/my/models`。
## License
此仓库中的代码是在[Apache-2.0 license](./LICENSE)的开源代码,而使用CogVLM模型权重必须遵守[模型许可](./MODEL_LICENSE).
## Citation & Acknowledgements
如果你发现我们的工作对你有所帮助,请引用以下论文
```
@misc{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{hong2023cogagent,
title={CogAgent: A Visual Language Model for GUI Agents},
author={Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2312.08914},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
在CogVLM的指令微调阶段,我们使用了来自 [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLAVA](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR) 和 [Shikra](https://github.com/shikras/shikra)项目的一些英文图像-文本数据,以及许多经典的跨模态工作数据集。我们衷心感谢他们的贡献。
================================================
FILE: assets/WECHAT.md
================================================
扫码关注公众号,加入「ChatGLM交流群」
Scan the QR code to follow the official account and join the "ChatGLM Discussion Group"
================================================
FILE: basic_demo/cli_demo_hf.py
================================================
"""
This is a demo for using CogAgent and CogVLM in CLI
Make sure you have installed vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5), full checkpoint of vicuna-7b-v1.5 LLM is not required.
In this demo, We us chat template, you can use others to replace such as 'vqa'.
Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
"""
import argparse
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
parser = argparse.ArgumentParser()
parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")
args = parser.parse_args()
MODEL_PATH = args.from_pretrained
TOKENIZER_PATH = args.local_tokenizer
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)
if args.bf16:
torch_type = torch.bfloat16
else:
torch_type = torch.float16
print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))
if args.quant:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch_type,
low_cpu_mem_usage=True,
load_in_4bit=True,
trust_remote_code=True
).eval()
else:
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch_type,
low_cpu_mem_usage=True,
load_in_4bit=args.quant is not None,
trust_remote_code=True
).to(DEVICE).eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
while True:
image_path = input("image path >>>>> ")
if image_path == '':
print('You did not enter image path, the following will be a plain text conversation.')
image = None
text_only_first_query = True
else:
image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("Human:")
if query == "clear":
break
if image is None:
if text_only_first_query:
query = text_only_template.format(query)
text_only_first_query = False
else:
old_prompt = ''
for _, (old_query, response) in enumerate(history):
old_prompt += old_query + " " + response + "\n"
query = old_prompt + "USER: {} ASSISTANT:".format(query)
if image is None:
input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, template_version='base')
else:
input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]] if image is not None else None,
}
if 'cross_images' in input_by_model and input_by_model['cross_images']:
inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]
# add any transformers params here.
gen_kwargs = {"max_length": 2048,
"do_sample": False} # "temperature": 0.9
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0])
response = response.split("")[0]
print("\nCog:", response)
history.append((query, response))
================================================
FILE: basic_demo/cli_demo_sat.py
================================================
# -*- encoding: utf-8 -*-
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import torch
import argparse
from sat.model.mixins import CachedAutoregressiveMixin
from sat.quantization.kernels import quantize
from sat.model import AutoModel
from utils.utils import chat, llama2_tokenizer, llama2_text_processor_inference, get_image_processor
from utils.models import CogAgentModel, CogVLMModel
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--max_length", type=int, default=2048, help='max length of the total sequence')
parser.add_argument("--top_p", type=float, default=0.4, help='top p for nucleus sampling')
parser.add_argument("--top_k", type=int, default=1, help='top k for top k sampling')
parser.add_argument("--temperature", type=float, default=.8, help='temperature for sampling')
parser.add_argument("--chinese", action='store_true', help='Chinese interface')
parser.add_argument("--version", type=str, default="chat", choices=['chat', 'vqa', 'chat_old', 'base'], help='version of language process. if there is \"text_processor_version\" in model_config.json, this option will be overwritten')
parser.add_argument("--quant", choices=[8, 4], type=int, default=None, help='quantization bits')
parser.add_argument("--from_pretrained", type=str, default="cogagent-chat", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")
parser.add_argument("--stream_chat", action="store_true")
args = parser.parse_args()
rank = int(os.environ.get('RANK', 0))
world_size = int(os.environ.get('WORLD_SIZE', 1))
args = parser.parse_args()
# load model
model, model_args = AutoModel.from_pretrained(
args.from_pretrained,
args=argparse.Namespace(
deepspeed=None,
local_rank=rank,
rank=rank,
world_size=world_size,
model_parallel_size=world_size,
mode='inference',
skip_init=True,
use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
device='cpu' if args.quant else 'cuda',
**vars(args)
), overwrite_args={'model_parallel_size': world_size} if world_size != 1 else {})
model = model.eval()
from sat.mpu import get_model_parallel_world_size
assert world_size == get_model_parallel_world_size(), "world size must equal to model parallel size for cli_demo!"
language_processor_version = model_args.text_processor_version if 'text_processor_version' in model_args else args.version
print("[Language processor version]:", language_processor_version)
tokenizer = llama2_tokenizer(args.local_tokenizer, signal_type=language_processor_version)
image_processor = get_image_processor(model_args.eva_args["image_size"][0])
cross_image_processor = get_image_processor(model_args.cross_image_pix) if "cross_image_pix" in model_args else None
if args.quant:
quantize(model, args.quant)
if torch.cuda.is_available():
model = model.cuda()
model.add_mixin('auto-regressive', CachedAutoregressiveMixin())
text_processor_infer = llama2_text_processor_inference(tokenizer, args.max_length, model.image_length)
if args.chinese:
if rank == 0:
print('欢迎使用 CogAgent-CLI ,输入图像URL或本地路径读图,继续输入内容对话,clear 重新开始,stop 终止程序')
else:
if rank == 0:
print('Welcome to CogAgent-CLI. Enter an image URL or local file path to load an image. Continue inputting text to engage in a conversation. Type "clear" to start over, or "stop" to end the program.')
with torch.no_grad():
while True:
history = None
cache_image = None
if args.chinese:
if rank == 0:
image_path = [input("请输入图像路径或URL: ")]
else:
image_path = [None]
else:
if rank == 0:
image_path = [input("Please enter the image path or URL: ")]
else:
image_path = [None]
if world_size > 1:
torch.distributed.broadcast_object_list(image_path, 0)
image_path = image_path[0]
assert image_path is not None
if image_path == 'stop':
break
if args.chinese:
if rank == 0:
query = [input("用户:")]
else:
query = [None]
else:
if rank == 0:
query = [input("User: ")]
else:
query = [None]
if world_size > 1:
torch.distributed.broadcast_object_list(query, 0)
query = query[0]
assert query is not None
while True:
if query == "clear":
break
if query == "stop":
sys.exit(0)
try:
response, history, cache_image = chat(
image_path,
model,
text_processor_infer,
image_processor,
query,
history=history,
cross_img_processor=cross_image_processor,
image=cache_image,
max_length=args.max_length,
top_p=args.top_p,
temperature=args.temperature,
top_k=args.top_k,
invalid_slices=text_processor_infer.invalid_slices,
args=args
)
except Exception as e:
print(e)
break
if rank == 0 and not args.stream_chat:
if args.chinese:
print("模型:"+response)
else:
print("Model: "+response)
image_path = None
if args.chinese:
if rank == 0:
query = [input("用户:")]
else:
query = [None]
else:
if rank == 0:
query = [input("User: ")]
else:
query = [None]
if world_size > 1:
torch.distributed.broadcast_object_list(query, 0)
query = query[0]
assert query is not None
if __name__ == "__main__":
main()
================================================
FILE: basic_demo/web_demo.py
================================================
"""
This script is a simple web demo of the CogVLM and CogAgent models, designed for easy and quick demonstrations.
For a more sophisticated user interface, users are encouraged to refer to the 'composite_demo',
which is built with a more aesthetically pleasing Streamlit framework.
Usage:
- Use the interface to upload images and enter text prompts to interact with the models.
Requirements:
- Gradio (only 3.x,4.x is not support) and other necessary Python dependencies must be installed.
- Proper model checkpoints should be accessible as specified in the script.
Note: This demo is ideal for a quick showcase of the CogVLM and CogAgent models. For a more comprehensive and interactive
experience, refer to the 'composite_demo'.
"""
import gradio as gr
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from PIL import Image
import torch
import time
from sat.model.mixins import CachedAutoregressiveMixin
from sat.mpu import get_model_parallel_world_size
from sat.model import AutoModel
from utils.utils import chat, llama2_tokenizer, llama2_text_processor_inference, get_image_processor, parse_response
from utils.models import CogAgentModel, CogVLMModel
DESCRIPTION = ''''''
NOTES = ' This app is adapted from https://github.com/THUDM/CogVLM. It would be recommended to check out the repo if you want to see the detail of our model, CogVLM & CogAgent.
'
MAINTENANCE_NOTICE1 = 'Hint 1: If the app report "Something went wrong, connection error out", please turn off your proxy and retry.
Hint 2: If you upload a large size of image like 10MB, it may take some time to upload and process. Please be patient and wait.'
AGENT_NOTICE = 'Hint 1: To use Agent function, please use the prompts for agents.'
GROUNDING_NOTICE = 'Hint 2: To use Grounding function, please use the prompts for grounding.'
default_chatbox = [("", "Hi, What do you want to know about this image?")]
model = image_processor = text_processor_infer = None
is_grounding = False
def process_image_without_resize(image_prompt):
image = Image.open(image_prompt)
# print(f"height:{image.height}, width:{image.width}")
timestamp = int(time.time())
file_ext = os.path.splitext(image_prompt)[1]
filename_grounding = f"examples/{timestamp}_grounding{file_ext}"
return image, filename_grounding
from sat.quantization.kernels import quantize
def load_model(args):
model, model_args = AutoModel.from_pretrained(
args.from_pretrained,
args=argparse.Namespace(
deepspeed=None,
local_rank=0,
rank=0,
world_size=world_size,
model_parallel_size=world_size,
mode='inference',
fp16=args.fp16,
bf16=args.bf16,
skip_init=True,
use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
device='cpu' if args.quant else 'cuda'),
overwrite_args={'model_parallel_size': world_size} if world_size != 1 else {}
)
model = model.eval()
assert world_size == get_model_parallel_world_size(), "world size must equal to model parallel size for cli_demo!"
language_processor_version = model_args.text_processor_version if 'text_processor_version' in model_args else args.version
tokenizer = llama2_tokenizer(args.local_tokenizer, signal_type=language_processor_version)
image_processor = get_image_processor(model_args.eva_args["image_size"][0])
cross_image_processor = get_image_processor(model_args.cross_image_pix) if "cross_image_pix" in model_args else None
if args.quant:
quantize(model, args.quant)
if torch.cuda.is_available():
model = model.cuda()
model.add_mixin('auto-regressive', CachedAutoregressiveMixin())
text_processor_infer = llama2_text_processor_inference(tokenizer, args.max_length, model.image_length)
return model, image_processor, cross_image_processor, text_processor_infer
def post(
input_text,
temperature,
top_p,
top_k,
image_prompt,
result_previous,
hidden_image,
state
):
result_text = [(ele[0], ele[1]) for ele in result_previous]
for i in range(len(result_text)-1, -1, -1):
if result_text[i][0] == "" or result_text[i][0] == None:
del result_text[i]
print(f"history {result_text}")
global model, image_processor, cross_image_processor, text_processor_infer, is_grounding
try:
with torch.no_grad():
pil_img, image_path_grounding = process_image_without_resize(image_prompt)
response, _, cache_image = chat(
image_path="",
model=model,
text_processor=text_processor_infer,
img_processor=image_processor,
query=input_text,
history=result_text,
cross_img_processor=cross_image_processor,
image=pil_img,
max_length=2048,
top_p=top_p,
temperature=temperature,
top_k=top_k,
invalid_slices=text_processor_infer.invalid_slices if hasattr(text_processor_infer, "invalid_slices") else [],
no_prompt=False,
args=state['args']
)
except Exception as e:
print("error message", e)
result_text.append((input_text, 'Timeout! Please wait a few minutes and retry.'))
return "", result_text, hidden_image
answer = response
if is_grounding:
parse_response(pil_img, answer, image_path_grounding)
new_answer = answer.replace(input_text, "")
result_text.append((input_text, new_answer))
result_text.append((None, (image_path_grounding,)))
else:
result_text.append((input_text, answer))
print(result_text)
print('finished')
return "", result_text, hidden_image
def clear_fn(value):
return "", default_chatbox, None
def clear_fn2(value):
return default_chatbox
def main(args):
global model, image_processor, cross_image_processor, text_processor_infer, is_grounding
model, image_processor, cross_image_processor, text_processor_infer = load_model(args)
is_grounding = 'grounding' in args.from_pretrained
gr.close_all()
with gr.Blocks(css='style.css') as demo:
state = gr.State({'args': args})
gr.Markdown(DESCRIPTION)
gr.Markdown(NOTES)
with gr.Row():
with gr.Column(scale=5):
with gr.Group():
gr.Markdown(AGENT_NOTICE)
gr.Markdown(GROUNDING_NOTICE)
input_text = gr.Textbox(label='Input Text', placeholder='Please enter text prompt below and press ENTER.')
with gr.Row():
run_button = gr.Button('Generate')
clear_button = gr.Button('Clear')
image_prompt = gr.Image(type="filepath", label="Image Prompt", value=None)
with gr.Row():
temperature = gr.Slider(maximum=1, value=0.8, minimum=0, label='Temperature')
top_p = gr.Slider(maximum=1, value=0.4, minimum=0, label='Top P')
top_k = gr.Slider(maximum=100, value=10, minimum=1, step=1, label='Top K')
with gr.Column(scale=5):
result_text = gr.components.Chatbot(label='Multi-round conversation History', value=[("", "Hi, What do you want to know about this image?")], height=600)
hidden_image_hash = gr.Textbox(visible=False)
gr.Markdown(MAINTENANCE_NOTICE1)
print(gr.__version__)
run_button.click(fn=post,inputs=[input_text, temperature, top_p, top_k, image_prompt, result_text, hidden_image_hash, state],
outputs=[input_text, result_text, hidden_image_hash])
input_text.submit(fn=post,inputs=[input_text, temperature, top_p, top_k, image_prompt, result_text, hidden_image_hash, state],
outputs=[input_text, result_text, hidden_image_hash])
clear_button.click(fn=clear_fn, inputs=clear_button, outputs=[input_text, result_text, image_prompt])
image_prompt.upload(fn=clear_fn2, inputs=clear_button, outputs=[result_text])
image_prompt.clear(fn=clear_fn2, inputs=clear_button, outputs=[result_text])
# demo.queue(concurrency_count=10)
demo.launch()
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--max_length", type=int, default=2048, help='max length of the total sequence')
parser.add_argument("--top_p", type=float, default=0.4, help='top p for nucleus sampling')
parser.add_argument("--top_k", type=int, default=1, help='top k for top k sampling')
parser.add_argument("--temperature", type=float, default=.8, help='temperature for sampling')
parser.add_argument("--version", type=str, default="chat", choices=['chat', 'vqa', 'chat_old', 'base'], help='version of language process. if there is \"text_processor_version\" in model_config.json, this option will be overwritten')
parser.add_argument("--quant", choices=[8, 4], type=int, default=None, help='quantization bits')
parser.add_argument("--from_pretrained", type=str, default="cogagent-chat", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")
parser.add_argument("--stream_chat", action="store_true")
args = parser.parse_args()
rank = int(os.environ.get('RANK', 0))
world_size = int(os.environ.get('WORLD_SIZE', 1))
args = parser.parse_args()
main(args)
================================================
FILE: composite_demo/client.py
================================================
from __future__ import annotations
from threading import Thread
import streamlit as st
import torch
import warnings
import os
from typing import Any, Protocol
from collections.abc import Iterable
from huggingface_hub.inference._text_generation import TextGenerationStreamResponse, Token
from transformers import AutoTokenizer, TextIteratorStreamer, AutoModelForCausalLM
from conversation import Conversation
# Check if GPU supports bfloat16
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8:
torch_type = torch.bfloat16
else:
torch_type = torch.float16
warnings.warn("Your GPU does not support bfloat16 type, use fp16 instead")
# if you use all of Our model, include cogagent-chat cogvlm-chat cogvlm-grounding and put it in different devices, you can do like this.
models_info = {
'tokenizer': {
'path': os.environ.get('TOKENIZER_PATH', 'lmsys/vicuna-7b-v1.5'),
},
'agent_chat': {
'path': os.environ.get('MODEL_PATH_AGENT_CHAT', 'THUDM/cogagent-chat-hf'),
'device': ['cuda:0']
},
'vlm_chat': {
'path': os.environ.get('MODEL_PATH_VLM_CHAT', 'THUDM/cogvlm-chat-hf'),
'device': ['cuda:3']
},
'vlm_grounding': {
'path': os.environ.get('MODEL_PATH_VLM_GROUNDING','THUDM/cogvlm-grounding-generalist-hf'),
'device': ['cuda:6']
}
}
# if you just use one model, use like this
# models_info = {
# 'tokenizer': {
# 'path': os.environ.get('TOKENIZER_PATH', 'lmsys/vicuna-7b-v1.5'),
# },
# 'agent_chat': {
# 'path': os.environ.get('MODEL_PATH_AGENT_CHAT', 'THUDM/cogagent-chat-hf'),
# 'device': ['cuda:0']
# },
@st.cache_resource
def get_client() -> Client:
client = HFClient(models_info)
return client
def process_history(history: list[Conversation]):
"""
Process the input history to extract the query and the history pairs.
Args:
History(list[Conversation]): A list of Conversation objects representing all conversations.
Returns:
query(str): The current user input string.
history_pairs(list[(str,str)]): A list of (user, assistant) pairs.
last_user_image(Image): The last user image. Only the latest image.
"""
history_pairs = []
query = ""
last_user_image = None
user_text = None
for i, conversation in enumerate(history):
if conversation.role == conversation.role.USER:
user_text = conversation.content
if conversation.image:
last_user_image = conversation.image
if i == len(history) - 1:
query = conversation.content
else:
if user_text is not None:
history_pairs.append((user_text, conversation.content))
user_text = None
return query, history_pairs, last_user_image
class Client(Protocol):
def generate_stream(self,
history: list[Conversation],
grounding: bool = False,
model_use: str = 'agent_chat',
**parameters: Any
) -> Iterable[TextGenerationStreamResponse]:
...
class HFClient(Client):
"""
The HFClient class manages the interaction with various large language models
for text generation tasks. It supports handling multiple models, each designated
for a specific task like chatting or grounding.
Args:
models_info (dict): A dictionary containing the configuration for each model.
The dictionary format is:
- 'tokenizer': Path and settings for the tokenizer.
- 'agent_chat': Path and settings for the CogAgent-chat-18B model.
- 'vlm_chat': Path and settings for the CogVLM-chat-17B model.
- 'vlm_grounding': Path and settings for the CogVLM-grounding-17B model.
The class loads each model based on the provided information and assigns it to the
specified CUDA device. It also handles the tokenizer used across all models.
"""
def __init__(self, models_info):
self.models = {}
self.tokenizer = AutoTokenizer.from_pretrained(models_info['tokenizer']['path'], trust_remote_code=True)
for model_name, model_info in models_info.items():
if model_name != 'tokenizer':
self.models[model_name] = []
for device in model_info['device']:
model = AutoModelForCausalLM.from_pretrained(
model_info['path'],
torch_dtype=torch_type,
low_cpu_mem_usage=True,
trust_remote_code=True,
).to(device).eval()
self.models[model_name].append(model)
def select_best_gpu(self, model_name):
min_memory_used = None
selected_model = None
for model in self.models[model_name]:
device = next(model.parameters()).device
mem_used = torch.cuda.memory_allocated(device=device)
if min_memory_used is None or mem_used < min_memory_used:
min_memory_used = mem_used
selected_model = model
return selected_model
def generate_stream(self,
history: list,
grounding: bool = False,
model_use: str = 'agent_chat',
**parameters: Any
) -> Iterable[TextGenerationStreamResponse]:
"""
Generates a stream of text responses based on the input history and selected model.
This method facilitates a chat-like interaction with the models. Depending on the
model selected and whether grounding is enabled, it alters the behavior of the text
generation process.
Args:
history (list[Conversation]): A list of Conversation objects representing the
dialogue history.
grounding (bool, optional): A flag to indicate whether grounding should be used
in the generation process. Defaults to False.
model_use (str, optional): The key name of the model to be used for the generation.
Defaults to 'agent_chat'.
**parameters (Any): Additional parameters that may be required for the generation
process.
Yields:
Iterable[TextGenerationStreamResponse]: A stream of text generation responses, each
encapsulating a generated piece of text.
The method selects the appropriate model based on `model_use`, processes the input
history, and feeds it into the model to generate text. It uses threading to handle
the generation process efficiently.
"""
query, history, image = process_history(history)
if grounding:
query += "(with grounding)"
model = self.select_best_gpu(model_use)
device = next(model.parameters()).device
# Print user input info
print("\n== Input ==\n", query)
print("\n==History==\n", history)
print("\n== Model ==\n\n", model.config.name_or_path)
print("\n== Device ==\n\n", device)
input_by_model = model.build_conversation_input_ids(
self.tokenizer,
query=query,
history=history,
images=[image]
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(device),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(device),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(device),
'images': [[input_by_model['images'][0].to(device).to(torch_type)]],
}
# CogVLM model do not have param 'cross_images', Only CogAgent have.
if 'cross_images' in input_by_model and input_by_model['cross_images']:
inputs['cross_images'] = [[input_by_model['cross_images'][0].to(device).to(torch_type)]]
# Use TextIteratorStreamer for streaming generation like huggingface.
streamer = TextIteratorStreamer(self.tokenizer, timeout=20.0, skip_prompt=True, skip_special_tokens=True)
parameters['streamer'] = streamer
gen_kwargs = {**parameters, **inputs}
with torch.no_grad():
thread = Thread(target=model.generate, kwargs=gen_kwargs)
thread.start()
for next_text in streamer:
yield TextGenerationStreamResponse(
token=Token(
id=0,
logprob=0,
text=next_text,
special=False,
)
)
================================================
FILE: composite_demo/conversation.py
================================================
import requests
import re
import streamlit as st
from dataclasses import dataclass
from enum import auto, Enum
from PIL.Image import Image
from PIL import ImageDraw
from streamlit.delta_generator import DeltaGenerator
class Role(Enum):
"""
CogVLM | CogAgent Only have 2 roles: USER, ASSISTANT
Represents the roles in a conversation, specifically for CogVLM and CogAgent applications.
There are two roles available:
- USER: The user of the system, typically the one asking questions or initiating conversation.
- ASSISTANT: The system or AI assistant responding to the user's queries.
Methods:
get_message(self):
Retrieves a Streamlit chat message component based on the role. For the USER role, it
returns a chat message with the name "user" and user avatar. For the ASSISTANT role,
it returns a chat message with the name "assistant" and assistant avatar.
"""
USER = auto()
ASSISTANT = auto()
def get_message(self):
match self.value:
case Role.USER.value:
return st.chat_message(name="user", avatar="user")
case Role.ASSISTANT.value:
return st.chat_message(name="assistant", avatar="assistant")
case _:
st.error(f'Unexpected role: {self}')
@dataclass
class Conversation:
"""
Represents a single conversation turn within a dialogue.
Attributes:
role (Role): The role of the speaker in the conversation (USER or ASSISTANT).
content (str): The textual content of the conversation turn.
image (Image, optional): An optional image associated with the conversation turn.
content_show (str, optional): The content to be displayed in the WebUI. This may differ
from `content` if translation or other processing is applied.
translate (bool, optional): Whether to translate the content of the conversation turn.
Methods:
__str__(self) -> str:
Returns a string representation of the conversation turn, including the role and content.
show(self, placeholder: DeltaGenerator | None = None) -> str:
Displays the conversation turn in the WebUI. If `placeholder` is provided, the content
is shown in the specified Streamlit container. Otherwise, it uses the message style
determined by the role.
"""
role: Role = Role.USER
content: str = ""
image: Image | None = None
content_show: str | None = None
translate: bool = False
def __str__(self) -> str:
print(self.role, self.content)
match self.role:
case Role.USER | Role.ASSISTANT:
return f'{self.role}\n{self.content}'
def show(self, placeholder: DeltaGenerator | None = None) -> str:
"""
show in markdown formate
"""
if placeholder:
message = placeholder
else:
message = self.role.get_message()
# for Chinese WebUI show
if self.role == Role.USER:
if self.translate:
self.content = translate_baidu(self.content_show, source_lan="zh", target_lan="en")
if self.content == "error":
self.content_show = "Please Enter your Baidu Translation API Key in function translate_baidu()"
else:
self.content = self.content_show
if self.role == Role.ASSISTANT:
if self.translate:
self.content_show = translate_baidu(self.content, source_lan="en", target_lan="zh")
else:
self.content_show = self.content
self.content_show = self.content_show.replace('\n', ' \n')
message.markdown(self.content_show)
if self.image:
message.image(self.image)
def preprocess_text(history: list[Conversation], ) -> str:
"""
Prepares the conversation history for processing by concatenating the content of each turn.
Args:
history (list[Conversation]): The conversation history, a list of Conversation objects.
Returns:
str: A single string that concatenates the content of each conversation turn, followed by
the ASSISTANT role indicator. This string is suitable for use as input to a text generation model.
"""
prompt = ""
for conversation in history:
prompt += f'{conversation}'
prompt += f'{Role.ASSISTANT}\n'
return prompt
def postprocess_text(template: str, text: str) -> str:
"""
Post-processes the generated text by incorporating it into a given template.
Args:
template (str): A template string containing a placeholder for the generated text.
text (str): The generated text to be incorporated into the template.
Returns:
str: The template with the generated text replacing the placeholder.
"""
quoted_text = f'"{text.strip()}"'
return template.replace("", quoted_text).strip() if template != "" else text.strip()
def postprocess_image(text: str, img: Image) -> (str, Image):
"""
Processes the given text to identify and draw bounding boxes on the provided image.
This function searches for patterns in the text that represent coordinates for bounding
boxes and draws rectangles on the image at these coordinates. Each box is drawn in a
different color for distinction.
Args:
text (str): The text containing bounding box coordinates in a specific pattern.
img (Image): The image on which to draw the bounding boxes.
Returns:
tuple[str, Image]: The processed text with additional annotations for each bounding
box, and the image with the drawn bounding boxes.
"""
colors = ["red", "green", "blue", "yellow", "purple", "orange"]
# Updated pattern to match single or multiple coordinate groups
pattern = r"\[\[([\d,]+(?:;[\d,]+)*)\]\]"
matches = re.findall(pattern, text)
draw = ImageDraw.Draw(img)
if not matches:
return text, None
for i, match in enumerate(matches):
# Splitting the matched string into individual coordinate groups
coords_groups = match.split(';')
# Determining the color for the current match
color = colors[i % len(colors)]
for coords_str in coords_groups:
coords = coords_str.split(',')
if len(coords) == 4: # Rectangle
scaled_coords = (
int(float(coords[0]) * 0.001 * img.width),
int(float(coords[1]) * 0.001 * img.height),
int(float(coords[2]) * 0.001 * img.width),
int(float(coords[3]) * 0.001 * img.height)
)
draw.rectangle(scaled_coords, outline=color, width=3)
elif len(coords) == 2: # Point
scaled_coords = (
int(float(coords[0]) * 0.001 * img.width),
int(float(coords[1]) * 0.001 * img.height)
)
radius = 5
draw.ellipse([scaled_coords[0] - radius, scaled_coords[1] - radius,
scaled_coords[0] + radius, scaled_coords[1] + radius],
fill=color)
return text, img
def translate_baidu(translate_text, source_lan, target_lan):
"""
Translates text using Baidu's translation service. (if you are not use English)
This function sends a request to the Baidu translation API to translate the provided text
from the source language to the target language.
Args:
translate_text (str): The text to be translated.
source_lan (str): The source language code (e.g., "en" for English).
target_lan (str): The target language code (e.g., "zh" for Chinese).
Returns:
str: The translated text. Returns "error" in case of an exception.
"""
url = "https://aip.baidubce.com/rpc/2.0/mt/texttrans/v1?access_token="
headers = {'Content-Type': 'application/json'}
payload = {
'q': translate_text,
'from': source_lan,
'to': target_lan
}
try:
r = requests.post(url, json=payload, headers=headers)
result = r.json()
final_translation = ''
for item in result['result']['trans_result']:
final_translation += item['dst'] + '\n'
except Exception as e:
print(e)
return "error"
return final_translation
================================================
FILE: composite_demo/demo_agent_cogagent.py
================================================
from io import BytesIO
import base64
import streamlit as st
import re
from streamlit.delta_generator import DeltaGenerator
from client import get_client
from conversation import postprocess_text, Conversation, Role, postprocess_image
from PIL import Image
from utils import images_are_same
client = get_client()
def append_conversation(
conversation: Conversation,
history: list[Conversation],
placeholder: DeltaGenerator | None = None,
) -> None:
history.append(conversation)
conversation.show(placeholder)
def main(
top_p: float = 0.8,
temperature: float = 0.95,
prompt_text: str = "",
metadata: str = "",
top_k: int = 2,
max_new_tokens: int = 2048,
grounding: bool = False,
retry: bool = False,
template: str = ""
):
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
if prompt_text == "" and retry == False:
print("\n== Clean ==\n")
st.session_state.chat_history = []
return
history: list[Conversation] = st.session_state.chat_history
for conversation in history:
conversation.show()
if retry:
print("\n== Retry ==\n")
last_user_conversation_idx = None
for idx, conversation in enumerate(history):
if conversation.role == Role.USER:
last_user_conversation_idx = idx
if last_user_conversation_idx is not None:
prompt_text = history[last_user_conversation_idx].content_show
del history[last_user_conversation_idx:]
if prompt_text:
image = Image.open(BytesIO(base64.b64decode(metadata))).convert('RGB') if metadata else None
image.thumbnail((1120, 1120))
image_input = image
if history and image:
last_user_image = next(
(conv.image for conv in reversed(history) if conv.role == Role.USER and conv.image), None)
if last_user_image and images_are_same(image, last_user_image):
image_input = None
# Not necessary to clear history
# else:
# # new picture means new conversation
# st.session_state.chat_history = []
# history = []
# Set conversation
if re.search('[\u4e00-\u9fff]', prompt_text):
translate = True
else:
translate = False
user_conversation = Conversation(
role=Role.USER,
translate=translate,
content_show=prompt_text.strip() if retry else postprocess_text(template=template,
text=prompt_text.strip()),
image=image_input
)
append_conversation(user_conversation, history)
placeholder = st.empty()
assistant_conversation = placeholder.chat_message(name="assistant", avatar="assistant")
assistant_conversation = assistant_conversation.empty()
# steam Answer
output_text = ''
for response in client.generate_stream(
model_use='agent_chat',
grounding=grounding,
history=history,
do_sample=True,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
top_k=top_k,
):
output_text += response.token.text
assistant_conversation.markdown(output_text.strip() + '▌')
## Final Answer with image.
print("\n==Output:==\n", output_text)
content_output, image_output = postprocess_image(output_text, image)
assistant_conversation = Conversation(
role=Role.ASSISTANT,
content=content_output,
image=image_output,
translate=translate,
)
append_conversation(
conversation=assistant_conversation,
history=history,
placeholder=placeholder.chat_message(name="assistant", avatar="assistant"),
)
================================================
FILE: composite_demo/demo_chat_cogagent.py
================================================
import streamlit as st
import base64
import re
from PIL import Image
from io import BytesIO
from streamlit.delta_generator import DeltaGenerator
from client import get_client
from utils import images_are_same
from conversation import Conversation, Role, postprocess_image, postprocess_text
client = get_client()
def append_conversation(
conversation: Conversation,
history: list[Conversation],
placeholder: DeltaGenerator | None = None,
) -> None:
history.append(conversation)
conversation.show(placeholder)
def main(
top_p: float = 0.8,
temperature: float = 0.95,
prompt_text: str = "",
metadata: str = "",
top_k: int = 2,
max_new_tokens: int = 2048,
grounding: bool = False,
retry: bool = False,
template: str = "",
):
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
if prompt_text == "" and retry == False:
print("\n== Clean ==\n")
st.session_state.chat_history = []
return
history: list[Conversation] = st.session_state.chat_history
for conversation in history:
conversation.show()
if retry:
last_user_conversation_idx = None
for idx, conversation in enumerate(history):
if conversation.role == Role.USER:
last_user_conversation_idx = idx
if last_user_conversation_idx is not None:
prompt_text = history[last_user_conversation_idx].content_show
del history[last_user_conversation_idx:]
if prompt_text:
image = Image.open(BytesIO(base64.b64decode(metadata))).convert('RGB') if metadata else None
image.thumbnail((1120, 1120))
image_input = image
if history and image:
last_user_image = next(
(conv.image for conv in reversed(history) if conv.role == Role.USER and conv.image), None)
if last_user_image and images_are_same(image, last_user_image):
image_input = None
else:
st.session_state.chat_history = []
history = []
# Set conversation
if re.search('[\u4e00-\u9fff]', prompt_text):
translate = True
else:
translate = False
user_conversation = Conversation(
role=Role.USER,
translate=translate,
content_show=prompt_text.strip() if retry else postprocess_text(template=template,
text=prompt_text.strip()),
image=image_input
)
append_conversation(user_conversation, history)
placeholder = st.empty()
assistant_conversation = placeholder.chat_message(name="assistant", avatar="assistant")
assistant_conversation = assistant_conversation.empty()
# steam Answer
output_text = ''
for response in client.generate_stream(
model_use='agent_chat',
grounding=grounding,
history=history,
do_sample=True,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
top_k=top_k,
):
output_text += response.token.text
assistant_conversation.markdown(output_text.strip() + '▌')
print("\n==Output:==\n", output_text)
content_output, image_output = postprocess_image(output_text, image)
assistant_conversation = Conversation(
role=Role.ASSISTANT,
content=content_output,
image=image_output,
translate=translate
)
append_conversation(
conversation=assistant_conversation,
history=history,
placeholder=placeholder.chat_message(name="assistant", avatar="assistant")
)
================================================
FILE: composite_demo/demo_chat_cogvlm.py
================================================
import streamlit as st
import base64
import re
from PIL import Image
from io import BytesIO
from streamlit.delta_generator import DeltaGenerator
from client import get_client
from utils import images_are_same
from conversation import Conversation, Role, postprocess_image, postprocess_text
client = get_client()
def append_conversation(
conversation: Conversation,
history: list[Conversation],
placeholder: DeltaGenerator | None = None,
) -> None:
history.append(conversation)
conversation.show(placeholder)
def main(
top_p: float = 0.8,
temperature: float = 0.95,
prompt_text: str = "",
metadata: str = "",
top_k: int = 2,
max_new_tokens: int = 2048,
grounding: bool = False,
retry: bool = False,
template: str = "",
):
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
if prompt_text == "" and retry == False:
print("\n== Clean ==\n")
st.session_state.chat_history = []
return
history: list[Conversation] = st.session_state.chat_history
for conversation in history:
conversation.show()
if retry:
last_user_conversation_idx = None
for idx, conversation in enumerate(history):
if conversation.role == Role.USER:
last_user_conversation_idx = idx
if last_user_conversation_idx is not None:
prompt_text = history[last_user_conversation_idx].content_show
del history[last_user_conversation_idx:]
if prompt_text:
image = Image.open(BytesIO(base64.b64decode(metadata))).convert('RGB') if metadata else None
image.thumbnail((1120, 1120))
image_input = image
if history and image:
last_user_image = next(
(conv.image for conv in reversed(history) if conv.role == Role.USER and conv.image), None)
if last_user_image and images_are_same(image, last_user_image):
image_input = None
else:
st.session_state.chat_history = []
history = []
# Set conversation
if re.search('[\u4e00-\u9fff]', prompt_text):
translate = True
else:
translate = False
user_conversation = Conversation(
role=Role.USER,
translate=translate,
content_show=prompt_text.strip() if retry else postprocess_text(template=template,
text=prompt_text.strip()),
image=image_input
)
append_conversation(user_conversation, history)
placeholder = st.empty()
assistant_conversation = placeholder.chat_message(name="assistant", avatar="assistant")
assistant_conversation = assistant_conversation.empty()
# steam Answer
output_text = ''
for response in client.generate_stream(
model_use='vlm_grounding' if grounding else 'vlm_chat',
grounding=False,
history=history,
do_sample=True,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
top_k=top_k,
):
output_text += response.token.text
assistant_conversation.markdown(output_text.strip() + '▌')
print("\n==Output:==\n", output_text)
content_output, image_output = postprocess_image(output_text, image)
assistant_conversation = Conversation(
role=Role.ASSISTANT,
content=content_output,
image=image_output,
translate=translate
)
append_conversation(
conversation=assistant_conversation,
history=history,
placeholder=placeholder.chat_message(name="assistant", avatar="assistant")
)
================================================
FILE: composite_demo/main.py
================================================
"""
This is a demo using the chat version about CogAgent and CogVLM in WebDEMO
Make sure you have installed the vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5),
and a full checkpoint of vicuna-7b-v1.5 LLM is not required.
Mention that only one image can be processed in a conversation, which means you cannot replace or insert another image
during the conversation.
The models_info parameter is explained as follows
tokenizer: tokenizer model using vicuna-7b-v1.5 model
agent_chat: Use the CogAgent-chat-18B model to complete the conversation task
vlm_chat: Use the CogVLM-chat-17B model to complete the conversation task
vlm_grounding: Use CogVLM-grounding-17B model to complete the Grounding task
Web Demo user operation logic is as follows:
CogVLM-Chat -> grounding? - yes -> Choose a template -> CogVLM-grounding-17B
- no -> CogVLM-chat-17B (without grounding)
CogAgent-Chat -> CogAgent-chat-18B (Only QA,without Grounding)
CogAgent-Agent -> CogAgent-chat-18B
-> Choose a template -> grounding? - yes -> prompt + (with grounding)
- no -> prompt
CogAgent-vqa-hf are not included in this demo, but you can use it in the same way as CogAgent-chat-18B
and used it in CogAgent-Chat
"""
import streamlit as st
st.set_page_config(
page_title="CogVLM & CogAgent Demo",
page_icon=":robot:",
layout='centered',
initial_sidebar_state='expanded',
)
from enum import Enum
from utils import encode_file_to_base64, templates_agent_cogagent, template_grounding_cogvlm
import demo_chat_cogvlm, demo_agent_cogagent, demo_chat_cogagent
st.markdown("CogAgent & CogVLM Chat Demo
", unsafe_allow_html=True)
st.markdown(
"更多使用方法请参考文档: https://lslfd0slxc.feishu.cn/wiki/WvQbwIJ9tiPAxGk8ywDck6yfnof \n\n 请根据文档的引导说明来尝试demo,以便理解demo的布局设计 \n",
unsafe_allow_html=True)
class Mode(str, Enum):
CogVLM_Chat, CogAgent_Chat, CogAgent_Agent = '💬CogVLM-Chat', '🧑💻 CogAgent-Chat', '💡 CogAgent-Agent'
with st.sidebar:
top_p = st.slider(
'top_p', 0.0, 1.0, 0.8, step=0.01
)
temperature = st.slider(
'temperature', 0.01, 1.0, 0.90, step=0.01
)
top_k = st.slider(
'top_k', 1, 20, 5, step=1
)
max_new_token = st.slider(
'Output length', 1, 2048, 2048, step=1
)
uploaded_file = st.file_uploader("Choose an image...", type=['.jpg', '.png', '.jpeg'], accept_multiple_files=False)
cols = st.columns(2)
export_btn = cols[0]
clear_history = cols[1].button("Clear History", use_container_width=True)
retry = export_btn.button("Retry", use_container_width=True)
prompt_text = st.chat_input(
'Chat with CogAgent | CogVLM',
key='chat_input',
)
tab = st.radio(
'Mode',
[mode.value for mode in Mode],
horizontal=True,
label_visibility='hidden',
)
selected_template_grounding_cogvlm = ""
with st.sidebar:
grounding = st.checkbox("Grounding")
if tab == Mode.CogVLM_Chat or tab == Mode.CogAgent_Chat:
if grounding:
selected_template_grounding_cogvlm = st.selectbox("Template For Grounding", template_grounding_cogvlm)
if tab == Mode.CogAgent_Agent:
with st.sidebar:
selected_template_agent_cogagent = st.selectbox("Template For Agent", templates_agent_cogagent)
if clear_history or retry:
prompt_text = ""
match tab:
case Mode.CogVLM_Chat:
st.info("This option uses cogvlm-chat and cogvlm-grounding model.")
if uploaded_file is not None:
demo_chat_cogvlm.main(
retry=retry,
top_p=top_p,
top_k=top_k,
temperature=temperature,
prompt_text=prompt_text,
metadata=encode_file_to_base64(uploaded_file),
max_new_tokens=max_new_token,
grounding=grounding,
template=selected_template_grounding_cogvlm
)
else:
st.error(f'Please upload an image to start')
case Mode.CogAgent_Chat:
st.info("This option uses cogagent-chat model.")
if uploaded_file is not None:
demo_chat_cogagent.main(
retry=retry,
top_p=top_p,
top_k=top_k,
temperature=temperature,
prompt_text=prompt_text,
metadata=encode_file_to_base64(uploaded_file),
max_new_tokens=max_new_token,
grounding=grounding,
template=selected_template_grounding_cogvlm
)
else:
st.error(f'Please upload an image to start')
case Mode.CogAgent_Agent:
st.info("This option uses cogagent-chat model with agent template.")
if uploaded_file is not None:
demo_agent_cogagent.main(
retry=retry,
top_p=top_p,
top_k=top_k,
temperature=temperature,
prompt_text=prompt_text,
metadata=encode_file_to_base64(uploaded_file),
max_new_tokens=max_new_token,
grounding=grounding,
template=selected_template_agent_cogagent
)
else:
st.error(f'Please upload an image to start')
case _:
st.error(f'Unexpected tab: {tab}')
================================================
FILE: composite_demo/utils.py
================================================
import base64
from io import BytesIO
from PIL import Image
def images_are_same(img1: Image, img2: Image) -> bool:
"""
Compare two PIL images.
"""
if img1.size != img2.size or img1.mode != img2.mode:
return False
return list(img1.getdata()) == list(img2.getdata())
def encode_file_to_base64(file):
"""
Convert a file to base64.
"""
buffer = BytesIO()
buffer.write(file.read())
return base64.b64encode(buffer.getvalue()).decode()
# The templates is for CogAgent_Agent Template
templates_agent_cogagent = [
"Can you advise me on how to ?",
"I'm looking for guidance on how to .",
"What steps do I need to take to ?",
"Could you provide instructions for ?",
"I'm wondering what the process is for .",
"How can I go about ?",
"I need assistance with planning to .",
"Do you have any recommendations for ?",
"Please share some tips for .",
"I'd like to know the best way to .",
"What's the most effective way to ?",
"I'm seeking advice on accomplishing .",
"Could you guide me through the steps to ?",
"I'm unsure how to start with .",
"Is there a strategy for successfully ?",
"What's the proper procedure for ?",
"How should I prepare for ?",
"I'm not sure where to begin with .",
"I need some insights on .",
"Can you explain how to tackle ?",
"I'm interested in the process of .",
"Could you enlighten me on ?",
"What are the recommended steps for ?",
"Is there a preferred method for ?",
"I'd appreciate your advice on .",
"Can you shed light on ?",
"What would be the best approach to ?",
"How do I get started with ?",
"I'm inquiring about the procedure for .",
"Could you share your expertise on ?",
"I'd like some guidance on .",
"What's your recommendation for ?",
"I'm seeking your input on how to .",
"Can you provide some insights into ?",
"How can I successfully accomplish ?",
"What steps are involved in ?",
"I'm curious about the best way to .",
"Could you show me the ropes for ?",
"I need to know how to go about .",
"What are the essential steps for ?",
"Is there a specific method for ?",
"I'd like to get some advice on .",
"Can you explain the process of ?",
"I'm looking for guidance on how to approach .",
"What's the proper way to handle ?",
"How should I proceed with ?",
"I'm interested in your expertise on .",
"Could you walk me through the steps for ?",
"I'm not sure where to begin when it comes to .",
"What should I prioritize when doing ?",
"How can I ensure success with ?",
"I'd appreciate some tips on .",
"Can you provide a roadmap for ?",
"What's the recommended course of action for ?",
"I'm seeking your guidance on .",
"Could you offer some suggestions for ?",
"I'd like to know the steps to take for .",
"What's the most effective way to achieve ?",
"How can I make the most of ?",
"I'm wondering about the best approach to .",
"Can you share your insights on ?",
"What steps should I follow to complete ?",
"I'm looking for advice on .",
"What's the strategy for successfully completing ?",
"How should I prepare myself for ?",
"I'm not sure where to start with .",
"What's the procedure for ?",
"Could you provide some guidance on ?",
"I'd like to get some tips on how to .",
"Can you explain how to tackle step by step?",
"I'm interested in understanding the process of .",
"What are the key steps to ?",
"Is there a specific method that works for ?",
"I'd appreciate your advice on successfully completing .",
"Can you shed light on the best way to ?",
"What would you recommend as the first step to ?",
"How do I initiate ?",
"I'm inquiring about the recommended steps for .",
"Could you share some insights into ?",
"I'm seeking your expertise on .",
"What's your recommended approach for ?",
"I'd like some guidance on where to start with .",
"Can you provide recommendations for ?",
"What's your advice for someone looking to ?",
"I'm seeking your input on the process of .",
"How can I achieve success with ?",
"What's the best way to navigate ?",
"I'm curious about the steps required for .",
"Could you show me the proper way to ?",
"I need to know the necessary steps for .",
"What's the most efficient method for ?",
"I'd appreciate your guidance on .",
"Can you explain the steps involved in ?",
"I'm looking for recommendations on how to approach .",
"What's the right way to handle ?",
"How should I manage ?",
"I'm interested in your insights on .",
"Could you provide a step-by-step guide for ?",
"I'm not sure how to start when it comes to .",
"What are the key factors to consider for ?",
"How can I ensure a successful outcome with ?",
"I'd like some tips and tricks for .",
"Can you offer a roadmap for accomplishing ?",
"What's the preferred course of action for ?",
"I'm seeking your expert advice on .",
"Could you suggest some best practices for ?",
"I'd like to understand the necessary steps to complete .",
"What's the most effective strategy for ?",
]
template_grounding_cogvlm = [
"Where is ?",
"Where is in the image?",
"Where is ? answer in [[x0,y0,x1,y1]] format.",
"Can you point out in the image and provide the bounding boxes of its location?",
"Help me to locate in and give me its bounding boxes, please.",
"In the given, could you find and tell me the bounding boxes of ?",
"Guide me to the location of within the image by providing its bounding boxes.",
"I'd like to know the exact bounding boxes of in the photo.",
"Would you kindly provide the bounding boxes of located in the picture?",
"Can you find in and give me the bounding boxes of where it is located?",
"I'm trying to locate in. Can you determine its bounding boxes for me?",
"What are the bounding boxes of in the image?",
"Can you disclose the position of in the photograph by stating its bounding boxes?",
"In, could you let me know the location of in the form of bounding boxes?",
"I need the bounding boxes of in, can you please assist me with that?",
"Where in is located? Provide me with its bounding boxes, please.",
"May I have the bounding boxes of ?",
"In the photograph, could you pinpoint the location of and tell me its bounding boxes?",
"Can you please search and find in, then let me know its bounding boxes?",
"Please, point out the position of in the image by giving its bounding boxes.",
"What are the exact bounding boxes of in the provided picture?",
"Detect the location of in and share the bounding boxes with me, please.",
"In the picture, I'd like you to locate and provide its coordinates.",
"Please indicate the location of in the photo by giving bounding boxes.",
"Find in and share its coordinates with me.",
"Could you please help me find the bounding boxes of in the image?",
"I am looking for the position of in. Can you provide its bounding boxes?",
"In the image, can you locate and let me know its coordinates?",
"I'd appreciate if you could find and tell me the bounding boxes of .",
"In, I need the bounding box bounding boxes of .",
"Point me to the location of in the picture by providing its bounding boxes.",
"Could you trace in and tell me its bounding boxes?",
"Can you assist me in locating in, and then provide its bounding boxes?",
"I'm curious, what are the bounding boxes of in the photo?",
"Kindly share the bounding boxes of located in the image.",
"I would like to find in. Can you give me its bounding boxes?",
"Can you spot in and disclose its bounding boxes to me?",
"Please, reveal the location of in the provided photograph as coordinates.",
"Help me locate and determine the bounding boxes of .",
"I request the bounding boxes of in the image.",
"In the given, can you find and tell me its bounding boxes?",
"I need to know the position of in as bounding boxes.",
"Locate in and provide its bounding boxes, please.",
"Assist me in finding in the photo and provide the bounding box bounding boxes.",
"In, can you guide me to the location of by providing bounding boxes?",
"I'd like the bounding boxes of as it appears in the image.",
"What location does hold in the picture? Inform me of its bounding boxes.",
"Identify the position of in and share its bounding boxes.",
"I'd like to request the bounding boxes of within the photo.",
"How can I locate in the image? Please provide the bounding boxes.",
"I am interested in knowing the bounding boxes of in the picture.",
"Assist me in locating the position of in the photograph and its bounding box bounding boxes.",
"In the image, I need to find and know its bounding boxes. Can you please help?"
"Can you give me a description of the region in image?",
"In the provided image, would you mind describing the selected area ?",
"I need details about the area located within image.",
"Could you please share some information on the region in this photograph?",
"Describe what's happening within the coordinates of the given image.",
"What can you tell me about the selected region in the photo?",
"Please, can you help me understand what's inside the region in image?",
"Give me a comprehensive description of the specified area in the picture.",
"I'm curious about the area in the following image. Can you describe it?",
"Please elaborate on the area with the coordinates in the visual.",
"In the displayed image, help me understand the region defined by .",
"Regarding the image, what's going on in the section ?",
"In the given photograph, can you explain the area with coordinates ?",
"Kindly describe what I should be seeing in the area of image.",
"Within the input image, what can be found in the region defined by ?",
"Tell me what you see within the designated area in the picture.",
"Please detail the contents of the chosen region in the visual input.",
"What's inside the area of the provided graphic?",
"I'd like some information about the specific region in the image.",
"Help me understand the details within the area in photograph.",
"Can you break down the region in the image for me?",
"What is taking place within the specified area in this capture?",
"Care to elaborate on the targeted area in the visual illustration?",
"What insights can you provide about the area in the selected picture?",
"What does the area within the given visual contain?",
"Analyze and describe the region in the included photo.",
"Please provide details for the area marked as in this photographic.",
"For the image, can you assess and describe what's happening at ?",
"Fill me in about the selected portion within the presented image.",
"In the image, elaborate on the details found within the section .",
"Please interpret and describe the area inside the given picture.",
"What information can you give me about the coordinates in image?",
"Regarding the coordinates in image, can you provide a description?",
"In the photo, can you delve into the details of the region ?",
"Please provide insights on the specified area within the graphic.",
"Detail the chosen region in the depicted scene.",
"Can you discuss the entities within the region of image?",
"I'd appreciate a breakdown of the area in the displayed image.",
"What's the story in the section of the included visual?",
"Please enlighten me about the region in the given photo.",
"Offer a thorough description of the area within the illustration.",
"What can you share about the area in the presented image?",
"Help me grasp the context of the region within image.",
"Kindly give an overview of the section in photo.",
"What details can you provide about the region in the snapshot?",
"Can you divulge the contents of the area within the given image?",
"In the submitted image, please give a synopsis of the area .",
"In the image, please describe the bounding box .",
"Please describe the region in the picture.",
"Describe the bbox in the provided photo.",
"What can you tell me about the area within the image?",
"Could you give me a description of the rectangular region found in?",
"In, what elements can be found within the coordinates ?",
"Please provide details for the area within the bounding box in.",
"Can you generate a description for the selected region in the image?",
"Kindly describe the objects or scenery in the bounding box within.",
"What details can you provide for the rectangle defined by the coordinates in?",
"In relation to the picture, please describe the content of the area marked by .",
"I'd like to know more about the area in the given image. Can you describe it?",
"Can you help me by describing the part of that lies within the bounding box ?",
"What's happening in the section of the photo enclosed by the coordinates ?",
"Describe the image content present in the specified rectangular area of.",
"Please provide information about the area within the bounding box in the picture.",
"Could you offer a description of the contents in the selected area of the image?",
"I'm curious about the area in. Can you provide a description of it?",
"What can be observed in the rectangular region in the photograph?",
"Please explain what is contained in the portion of defined by the box .",
"In the photograph, can you describe the objects or scenery enclosed by ?",
"Can you give a brief explanation of the specified area in the image?",
"What does the area look like in the context of the image?",
"Could you please describe the contents of the bounding box in the given image?",
"I would like to know more about the rectangular region within the picture. Can you describe it?",
"Please tell me about the area in the image. What does it contain?",
"Help me understand what's happening in the selected bounding box within.",
"Can you provide a description of the area in the image?",
"What sort of things can be seen in the region of the photo?",
"Describe what can be found within the bounds of in the image.",
"In, can you paint a picture of the area enclosed by coordinates ?",
"Please provide a detailed account of the area covered by the bounding box in.",
"Give me a vivid description of what's happening in the area within the snapshot.",
"In the image, what do you observe within the rectangular box defined by the coordinates ?",
"Could you give me a breakdown of the content in the specified area of the picture?",
"Please elucidate the area