Repository: OpenBMB/VisCPM Branch: main Commit: d21f3ebda1d5 Files: 58 Total size: 811.8 KB Directory structure: gitextract_qaxjy5oq/ ├── README.md ├── README_en.md ├── VisCPM/ │ ├── __init__.py │ ├── cpm_tokenizers/ │ │ ├── __init__.py │ │ ├── bee.py │ │ └── vocabs/ │ │ └── bee.txt │ ├── generation/ │ │ ├── __init__.py │ │ ├── generation_utils.py │ │ └── vllm_bee.py │ ├── models/ │ │ ├── __init__.py │ │ ├── beit3.py │ │ ├── cpmbee.py │ │ ├── modeling_utils.py │ │ ├── modules/ │ │ │ ├── __init__.py │ │ │ ├── attention.py │ │ │ ├── blocks.py │ │ │ ├── config.py │ │ │ ├── embedding.py │ │ │ ├── feedforward.py │ │ │ ├── layernorm.py │ │ │ ├── linear.py │ │ │ ├── position_embedding.py │ │ │ └── transformer.py │ │ ├── stablediffusion.py │ │ ├── vlg_cpmbee.py │ │ └── vlu_cpmbee.py │ ├── utils/ │ │ ├── __init__.py │ │ └── utils.py │ ├── viscpm_chat.py │ └── viscpm_paint.py ├── config/ │ └── cpm-bee-10b.json ├── data/ │ ├── prompts.txt │ └── translated_LLaVA_qa90/ │ ├── caps_boxes_coco2014_val_80_zh.jsonl │ ├── qa90_gpt4_answer_zh.jsonl │ ├── qa90_questions_zh.jsonl │ └── rule_zh.json ├── demo_chat.py ├── demo_paint.py ├── figures/ │ └── wechat.md ├── finetune/ │ ├── README.md │ ├── __init__.py │ ├── dataset/ │ │ ├── __init__.py │ │ ├── bee.py │ │ ├── itembuilder.py │ │ └── transformeddataset.py │ ├── exporter.py │ ├── ft_viscpm_chat/ │ │ ├── __init__.py │ │ ├── config/ │ │ │ └── deepspeed/ │ │ │ └── viscpm_chat_ft.json │ │ ├── get_llava150k_zh.sh │ │ ├── run_viscpm_chat_ft.sh │ │ └── train_viscpm_chat.py │ ├── initializer.py │ └── utils/ │ ├── __init__.py │ ├── logger.py │ ├── prompts.py │ └── utils.py ├── requirements.txt └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================
# VisCPM **基于CPM基础模型的中英双语多模态大模型系列**

多模态对话模型VisCPM-Chat文生图模型VisCPM-Paint使用论文

VisCPM-Chat DemoVisCPM-Paint Demo VisCPM-Chat🤗 VisCPM-Paint🤗

简体中文 | English

**`VisCPM`** is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art performance among Chinese open-source multimodal models. VisCPM is trained based on the large language model [CPM-Bee](https://github.com/OpenBMB/CPM-Bee) with 10B parameters, fusing visual encoder (Muffin) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities. **`VisCPM`** 是一个开源的多模态大模型系列,支持中英双语的多模态对话能力(`VisCPM-Chat`模型)和文到图生成能力(`VisCPM-Paint`模型),在中文多模态开源模型中达到最佳水平。VisCPM基于百亿参数量语言大模型[CPM-Bee](https://github.com/OpenBMB/CPM-Bee)(10B)训练,融合视觉编码器[Muffin](https://github.com/thunlp/Muffin)和视觉解码器[Diffusion-UNet](https://github.com/CompVis/stable-diffusion)以支持视觉信号的输入和输出。得益于CPM-Bee基座优秀的双语能力,`VisCPM`可以仅通过英文多模态数据预训练,泛化实现优秀的中文多模态能力。 - **👐 开源使用**:VisCPM可以自由被用于个人和研究用途。我们希望通过开源VisCPM模型系列,推动多模态大模型开源社区和相关研究的发展。 - **🌟 涵盖图文双向生成**:VisCPM模型系列较为全面地支持了图文多模态能力,涵盖多模态对话(图到文生成)能力和文到图生成能力。 - **💫 中英双语性能优异**:得益于语言模型基座CPM-Bee优秀的双语能力,VisCPM在中英双语的多模态对话和文到图生成均取得亮眼的效果。
## 📰 更新信息 `VisCPM`在持续升级中,我们支持了低资源推理、网页版部署等功能,并提供了能力升级的更高版本的模型[OmniLMM](https://github.com/OpenBMB/OmniLMM),欢迎大家持续关注! - **[2024/04/17]** 🚀 欢迎关注我们最新发布的[MiniCPM-V 2.0](https://github.com/OpenBMB/MiniCPM-V)端侧多模态大模型,具备领先的光学字符识别(OCR)和多模态理解能力,在综合性OCR能力评测基准OCRBench上达到开源模型最佳水平,甚至在场景文字理解方面实现接近Gemini Pro的性能。 - **[2024/02/02]** 🚀 欢迎关注我们最新发布的[OmniLMM](https://github.com/OpenBMB/OmniLMM)多模态大模型!其中[OmniLMM-3B](https://huggingface.co/openbmb/MiniCPM-V)为中英双语多模态对话模型,基于中英双语大模型[MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/)和SigLip-400M视觉编码器训练,采用与VisCPM-Chat相同的训练流程训练,可在终端设备上部署并具备先进的多模态对话能力;[OmniLMM-13B](https://huggingface.co/openbmb/OmniLMM-12B/)为英文多模态模型,基于EVA02-5B和Zephyr-7B-β初始化训练,相比同规模其他模型在多个基准测试中具有领先性能。 - **[2024/01/16]** 🎉 [VisCPM论文](https://arxiv.org/pdf/2308.12038.pdf)被**ICLR 2024**接收,并被选为**spotlight(top 5%)**! - **[2023/09/06]** 🔌 VisCPM-Chat API 发布!现在您可以直接通过API轻松地使用VisCPM-Chat模型了。查看[API使用指南](#API使用指南)以了解更多详情。 - **[2023/08/23]** 📑 VisCPM论文发布:[Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages](https://arxiv.org/pdf/2308.12038.pdf),论文提供了更详细的实现细节和实验结果 - **[2023/08/18]** ⤴️ [VisCPM-Chat-v1.1](#模型下载)版本发布,带来更强的细节理解和复杂推理能力! - **[2023/08/18]** 🛠️ 支持[微调](#模型微调),让VisCPM更适配你的应用场景! - **[2023/07/20]** 🌐 发布[VisCPM-Chat](http://120.92.209.146/)和[VisCPM-Paint](https://huggingface.co/spaces/openbmb/viscpm-paint) 的在线Demo,欢迎尝试! - **[2023/07/20]** 🎢 支持一键[部署本地网页版Demo](#demo部署) - **[2023/07/20]** ⚡️ 支持[低资源推理](#低资源推理),最低5G显存运行多模态对话模型! - **[2023/07/18]** 🤗 [VisCPM-Chat](https://huggingface.co/openbmb/VisCPM-Chat)和[VisCPM-Paint](https://huggingface.co/openbmb/VisCPM-Paint) 已整合到Huggingface框架中 ## VisCPM-Chat `VisCPM-Chat`支持面向图像进行中英双语多模态对话。该模型使用`Muffin`视觉编码架构,使用CPM-Bee(10B)作为语言基座模型,并通过语言建模训练目标融合视觉和语言模型。模型训练包括预训练和指令精调两阶段: * 预训练:我们使用约100M高质量英文图文对数据对`VisCPM-Chat`进行了预训练,数据包括CC3M、CC12M、COCO、Visual Genome、Laion等。在预训练阶段,语言模型参数保持固定,仅更新视觉编码器的参数,以支持大规模视觉-语言表示的高效对齐。 * 指令精调:我们采用[LLaVA-150K](https://llava-vl.github.io/)英文指令精调数据,并混合相应[翻译后的中文数据](https://huggingface.co/datasets/openbmb/llava_zh)对模型进行指令精调,以对齐模型多模态基础能力和用户使用意图。在指令精调阶段,我们更新全部模型参数,以提升指令精调数据的利用效率。有趣的是,我们发现即使仅采用英文指令数据进行指令精调,模型也可以理解中文问题,但仅能用英文回答。这表明模型的多语言多模态能力已经得到良好的泛化。在指令精调阶段进一步加入少量中文翻译数据,可以将模型回复语言和用户问题语言对齐。 我们在[LLaVA标准英文测试集](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)和翻译的[中文测试集](data/translated_LLaVA_qa90)对模型进行了评测,该评测基准考察模型在开放域对话、图像细节描述、复杂推理方面的表现,并使用GPT-4进行打分。可以观察到,`VisCPM-Chat`在中文多模态能力方面取得了最佳的平均性能,在通用域对话和复杂推理表现出色,同时也表现出了不错的英文多模态能力。我们提供了两个模型版本,分别为`VisCPM-Chat-balance`和`VisCPM-Chat-zhplus`,前者在英文和中文两种语言上的能力较为平衡,后者在中文能力上更加突出。两个模型在指令精调阶段使用的数据相同,`VisCPM-Chat-zhplus`在预训练阶段额外加入了20M清洗后的原生中文图文对数据和120M翻译到中文的图文对数据。`VisCPM-Chat-v1.1`在指令精调阶段额外加入了[UniMM-Chat](https://huggingface.co/datasets/Yirany/UniMM-Chat)多模态指令精调数据集。
模型 语言模型基座 英文 中文
对话 精细描述 复杂推理 平均 对话 精细描述 复杂推理 平均
英文模型 MiniGPT4 Vicuna-13B 65.0 67.3 76.6 69.7 - - - -
InstructBLIP Vicuna-13B 81.9 68.0 91.2 80.5 - - - -
LLaVA Vicuna-13B 89.5 70.4 96.2 85.6 - - - -
中英双语模型 mPLUG-Owl LLaMA-7B 64.6 47.7 80.1 64.2 76.3 61.2 77.8 72.0
VisualGLM ChatGLM-6B 62.4 63.0 80.6 68.7 76.6 87.8 83.6 82.7
Ziya-Visual Ziya-LLaMA-13B-v1 82.7 69.9 92.1 81.7 85.0 74.7 82.4 80.8
Qwen-VL Qwen-7B 82.4 72.6 91.9 83.8 82.3 93.4 89.5 88.2
VisCPM-Chat-balance CPMBee-10B 83.3 68.9 90.5 81.1 92.7 76.1 89.2 86.3
VisCPM-Chat-zhplus CPMBee-10B 80.1 65.7 92.5 79.6 90.3 81.4 92.1 88.2
VisCPM-Chat-v1.1 CPMBee-10B 80.1 67.1 97.1 81.5 91.3 90.7 95.4 92.5
![图片](figures/vlu_case4.png) ## VisCPM-Paint `VisCPM-Paint`支持中英双语的文到图生成。该模型使用CPM-Bee(10B)作为文本编码器,使用`UNet`作为图像解码器,并通过扩散模型训练目标融合语言和视觉模型。在训练过程中,语言模型参数始终保持固定。我们使用[Stable Diffusion 2.1](https://github.com/Stability-AI/stablediffusion)的UNet参数初始化视觉解码器,并通过逐步解冻其中关键的桥接参数将其与语言模型融合。该模型在[LAION 2B](https://laion.ai/)英文图文对数据上进行了训练。 与`VisCPM-Chat`类似,我们发现得益于CPM-Bee的双语能力,`VisCPM-Paint`可以仅通过英文图文对训练,泛化实现良好的中文文到图生成能力,达到中文开源模型的最佳效果。通过进一步加入20M清洗后的原生中文图文对数据,以及120M翻译到中文的图文对数据,模型的中文文到图生成能力可以获得进一步提升。我们在标准图像生成测试集MSCOCO上采样了3万张图片,计算了常用评估图像生成指标FID (Fréchet Inception Distance)评估生成图片的质量。我们同样提供了两个模型版本,分别为`VisCPM-Paint-balance`和`VisCPM-Paint-zhplus`,前者在英文和中文两种语言上的能力较为平衡,后者在中文能力上更加突出。`VisCPM-Paint-balance`只使用了英文图文对进行训练,`VisCPM-Paint-zhplus`在`VisCPM-Paint-balance`基础上增加了20M原生中文图文对数据和120M翻译到中文的图文对数据进行训练。
模型 Zero-shot FID↓
英文 中文
GLIDE 12.2 -
Make-A-Scene 11.8 -
DALL·E-2 10.4 -
Unidiffuser 9.7 -
Cogview2 - 24.0
Stable Diffusion 8.6 -
AltDiffusion 17.2 16.1
TaiyiDiffusion - 15.6
VisCPM-Paint-balance 9.5 10.9
VisCPM-Paint-zhplus 9.9 9.6
## ⚙️ 安装 1. 克隆仓库并进入源码目录 ```bash git clone https://github.com/OpenBMB/VisCPM.git cd VisCPM ``` 2. 创建conda环境 ```Shell conda create -n viscpm python=3.10 -y conda activate viscpm ``` 3. 安装依赖 ```shell pip install torch>=1.10 pip install -r requirements.txt ``` ## 💡 使用 ### 模型下载 | 模型 | 描述 | 下载链接 | |----------------------|-------------------|------ | | VisCPM-Chat-v1.1 | 新版本多模态对话模型,强化了细节理解和复杂推理能力 | [链接](https://huggingface.co/openbmb/VisCPM-Chat/blob/main/pytorch_model.v1.bin) | | VisCPM-Chat-balance | 中英文能力较为平衡的多模态对话模型 | [链接](https://huggingface.co/openbmb/VisCPM-Chat/resolve/main/pytorch_model.bin) | | VisCPM-Chat-zhplus | 中文能力突出的多模态对话模型 | [链接](https://huggingface.co/openbmb/VisCPM-Chat/resolve/main/pytorch_model.zhplus.bin) | | VisCPM-Paint-balance | 中英文能力较为平衡的文生图模型 | [链接](https://huggingface.co/openbmb/VisCPM-Paint/resolve/main/pytorch_model.balance.bin) | | VisCPM-Paint-zhplus | 中文能力突出的文生图模型 | [链接](https://huggingface.co/openbmb/VisCPM-Paint/resolve/main/pytorch_model.bin) | ### VisCPM-Chat 在下载模型权重后,可以使用如下代码运行VisCPM-Chat(`'/path/to/checkpoint'`改为模型存放路径) #### 单轮对话
VisCPM-Chat可以通过几行代码实现多模态对话,我们在代码中默认开启了对输入图片的安全检查。 ```shell # 如果您单卡显存不足40G,可以引入如下环境变量并将安全模块开关关闭。引入后显存占用约为5G,但推理所需时间会变长。此选项依赖BMInf,需要安装BMInf依赖库。 export CUDA_MEMORY_CPMBEE_MAX=1g ``` ```python from VisCPM import VisCPMChat from PIL import Image model_path = '/path/to/checkpoint' viscpm_chat = VisCPMChat(model_path, image_safety_checker=True) # 默认开启对输入图片的安全检查 image_path = 'figures/vlu_case1.png' image = Image.open(image_path).convert("RGB") question = '如果用一句中国唐代的著名诗人"李白"的古诗来描述这幅图像,你能想到什么?' answer, _, _ = viscpm_chat.chat(image, question) print(answer) ``` 可得到如下结果 ``` “黄河之水天上来,奔流到海不复回。” 李白的这句诗可以用来形容这幅图片中汹涌澎湃、波涛汹涌的景象:一条湍急的河流从山上奔腾而下,形成了一幅令人叹为观止的画面,展示出大自然的力量和雄伟壮丽。 ``` #### 多轮对话
```python from VisCPM import VisCPMChat from PIL import Image model_path = '/path/to/checkpoint' viscpm_chat = VisCPMChat(model_path, image_safety_checker=True) # 默认开启对输入图片的安全检查 image_path = 'figures/vlu_case2.jpeg' image = Image.open(image_path).convert("RGB") question = '这幅图像是在哪个节日拍摄的?' answer, context, vision_hidden_states = viscpm_chat.chat(image, question) # 多轮对话传入历史 context question = '你能用什么古诗描述这幅画?' answer, context, _ = viscpm_chat.chat(image, question, context, vision_hidden_states=vision_hidden_states) print(context) ``` 可得到如下结果 ``` User: 这幅图像是在哪个节日拍摄的? AI: 这幅图像是在中秋节拍摄的, 也就是中国传统节日中的月圆之夜。 User: 你能用什么古诗描述这幅画? AI: “明月几时有,把酒问青天。” 这是苏轼的《水调歌头》中的一句诗,用来形容这幅图片再贴切不过了:在中秋之夜,月亮高高地挂在天空中,一座古老的建筑沐浴着月光,营造出一种宁静祥和的气氛。 ``` #### API使用指南 我们提供了API接口,可以通过如下代码轻松体验VisCPM-Chat。API接口支持的输入格式和使用方式如下: ```python import requests import base64 url = "http://34.143.180.202:3389/viscpm" resp = requests.post(url, json={ # need to modify "image": base64.b64encode(open("path/to/image", "rb").read()).decode(), "question": "描述一下这张图片", }) resp = resp.json() print(resp) ``` ### VisCPM-Paint 在下载模型权重后,可以使用如下代码运行VisCPM-Paint(`'/path/to/checkpoint'`改为模型存放路径)。 ![图片](figures/vlg_case1.png) ![图片](figures/vlg_case2.png) 生成上面图片的文本输入可参考[prompts.txt](data/prompts.txt)。 ```shell # 如果您单卡显存不足40G,可以引入如下环境变量并将安全模块开关关闭。引入后显存占用约为17G,但推理所需时间会变长。此选项依赖BMInf,需要安装BMInf依赖库。 export CUDA_MEMORY_CPMBEE_MAX=1g ``` ```python from VisCPM import VisCPMPaint painter = VisCPMPaint('/path/to/checkpoint', image_safety_checker=True, prompt_safety_checker=True, add_ranker=True) # 默认对输入的文本和输出的图片进行安全检查,默认开启重排序 image = painter.generate('人闲桂花落,月静春山空') # 对应上图第一行第二张图片 image.save('/data/test.png') ``` 我们在代码中默认开启了对输入文本和输出图片的安全检查。 同时,我们默认对生成的图像使用重排序,即对同一个输入,同时生成4张图片,返回与输入相关性最高的1张图片,相关性通过[Chinese-Clip](https://github.com/OFA-Sys/Chinese-CLIP)进行打分。重排序可以提升生成图片质量的稳定性,但也会降低模型的生成速度,如希望快速得到生成结果,可以关闭重排序机制。 VisCPM-Paint目前使用中文模型进行重排序打分,如果输入英文生成图片,请关闭重排序机制和输入文本检查模块。 ### 低资源推理 为了支持更加高效的低资源推理场景,我们借助[BMInf](https://github.com/OpenBMB/BMInf)工具支持更低的显存需求。首先安装BMInf依赖`pip install bminf`,然后在命令行中指定`export CUDA_MEMORY_CPMBEE_MAX=1g`(具体数值可以根据个人需求设定),然后按照上述步骤进行推理,VisCPM-Chat最低显存占用可以降至5G,VisCPM-Paint最低显存占用可以降至17G。 ### Demo部署 我们提供简易的基于gradio的网页版Demo,首先安装gradio:`pip install gradio`,然后执行如下命令: ```shell git clone https://github.com/OpenBMB/VisCPM.git cd VisCPM python demo_chat.py # viscpm_chat demo, or python demo_paint.py # viscpm_paint demo ``` ### 模型微调 为适应特定场景下的需求, 我们提供了VisCPM-Chat模型的微调代码,用户可以在私有数据上进行微调。微调代码位于`./finetune/ft_viscpm_chat`目录下,具体的微调代码使用方法如下: ```shell # 获取数据集 bash ./finetune/ft_viscpm_chat/get_llava150k_zh.sh # 模型微调, 注意修改其中的数据集与模型checkpoint路径 bash ./finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh # node: 8 # batch_size: 8 * 1 # 其他配置可参考'./finetune/ft_viscpm_chat/config/viscpm_chat_ft.json'与'./finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh' ``` 注: - 微调代码中使用了deepspeed-0.9.1配置训练环境,配置方法可以参考此[链接](https://www.deepspeed.ai/getting-started/)。 - 目前微调代码仅在linux系统下测试,如果您在其他系统配置下进行微调,可能需要修改部分代码。 ## 🛡 安全 ### 安全声明 作为多模态模型,`VisCPM`通过学习大量的公开图文数据来生成内容,但它无法理解、表达个人观点或价值判断,它所输出的任何内容都不代表模型开发者的观点和立场。因此用户在使用VisCPM生成的内容时,应自行负责对其进行评估和验证。 ### 安全模块 为了帮助用户防止模型处理或生成不符合普遍社会价值观的内容,我们在`VisCPM`中加入了内容安全保障模块。当安全模块检测到模型处理或生成的图像文本内容不符合安全规范时,会对相应内容进行拦截。我们对`VisCPM-Chat`接受的图片输入以及`VisCPM-Paint`接受的文字输入和图片输出进行了安全检查。VisCPM的安全模块仍然不完美,可能会出现漏判和误判的情况。我们会在未来进一步提升安全模块的性能。 ## 📝 开源协议 VisCPM系列模型采用协议为["通用模型许可协议-来源说明-宣传限制-非商业化"](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E9%9D%9E%E5%95%86%E4%B8%9A%E5%8C%96.md),允许个人使用和研究用途。如需将模型用于商业用途,请联系cpm@modelbest.cn来洽谈商业授权事宜。 CPM-Bee基座采用协议为[“通用模型许可协议-来源说明-宣传限制-商业授权”](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md),允许商用,如需将模型用于商业用途,请联系cpm@modelbest.cn来获取书面授权。 ## ✅ TODO - [ ] 支持模型量化功能,降低推理成本 ## 🏫 机构 此项目由以下机构联合研发: - [清华自然语言处理实验室](https://nlp.csai.tsinghua.edu.cn/) - [面壁智能](https://modelbest.cn/) - [知乎](https://www.zhihu.com/) ## 引用 如果我们的工作对你有帮助的话,请考虑引用以下论文 ```bibtex @article{VisCPM, title={Large multilingual models pivot zero-shot multimodal learning across languages}, author={Hu, Jinyi and Yao, Yuan and Wang, Chongyi and Wang, Shan and Pan, Yinxu and Chen, Qianyu and Yu, Tianyu and Wu, Hanghao and Zhao, Yue and Zhang, Haoye and others}, journal={arXiv preprint arXiv:2308.12038}, year={2023} } @article{muffin, title={Reformulating vision-language foundation models and datasets towards universal multimodal assistants}, author={Yu, Tianyu and Hu, Jinyi and Yao, Yuan and Zhang, Haoye and Zhao, Yue and Wang, Chongyi and Wang, Shan and Pan, Yinxv and Xue, Jiao and Li, Dahai and others}, journal={arXiv preprint arXiv:2310.00653}, year={2023} } ``` ================================================ FILE: README_en.md ================================================
# VisCPM **Chinese-English bilingual large multi-modal model series**

Multimodal Conversation Model VisCPM-ChatText-to-image Model VisCPM-PaintInferencePaper

VisCPM-Chat DemoVisCPM-Paint Demo VisCPM-Chat🤗 VisCPM-Paint🤗

简体中文 | English

`VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art performance among Chinese open-source multimodal models. VisCPM is trained based on the large language model [CPM-Bee](https://github.com/OpenBMB/CPM-Bee) with 10B parameters, fusing visual encoder [Muffin](https://github.com/thunlp/Muffin) and visual decoder [Diffusion-UNet](https://github.com/CompVis/stable-diffusion) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities. - **👐 Open-source Usage**: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research. - **🌟 Image and text generation coverage**: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities. - **💫 Excellent bilingual performance**: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.
## 📰 Update News `VisCPM` is continuously updating. We have provided functions such as low-resource reasoning, easy-to-use web deployment. We have provide new versions with upgraded capabilities, [OmniLMM](https://github.com/OpenBMB/OmniLMM). Please continue to pay attention! - **[2024/04/22]** 🚀 Welcome to follow our latest release of the [MiniCPM-V 2.0](https://github.com/OpenBMB/MiniCPM-V) edge-side large multimodal model, which has leading Optical Character Recognition (OCR) and multimodal understanding capabilities. It has achieved the best level among open-source models in the comprehensive OCR capability benchmark OCRBench, and even approaches the performance of Gemini Pro in scene text understanding. - **[2024/02/02]** 🚀 Welcome to follow our latest release of the [OmniLMM](https://github.com/OpenBMB/OmniLMM) large multimodal model! Among them, [OmniLMM-3B](https://huggingface.co/openbmb/MiniCPM-V) is a bilingual multimodal dialogue model in Chinese and English, trained based on the bilingual large model [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/) and the SigLip-400M visual encoder, using the same training process as VisCPM-Chat. It can be deployed on terminal devices and possesses advanced multimodal dialogue capabilities; [OmniLMM-13B](https://huggingface.co/openbmb/OmniLMM-12B/) is an English multimodal model, initially trained based on EVA02-5B and Zephyr-7B-β, and compared to other models of the same scale, it demonstrates superior performance in multiple benchmark tests. - **[2024/01/16]** 🎉 [The paper of VisCPM](https://arxiv.org/pdf/2308.12038.pdf) is accepted by **ICLR 2024** as **spotlight (top 5%)**! - **[2023/09/06]** 🔌 VisCPM-Chat API Released! Now you can easily use the VisCPM-Chat model directly through the API. Check out the [API Usage Guide](#API-Usage-Guide) for more details - **[2023/08/23]** 📑 We release the paper of VisCPM: [Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages](https://arxiv.org/pdf/2308.12038.pdf). More impletation details and experimental results are presented in the paper. - **[2023/08/18]** ⤴️ We upgrade to [VisCPM-Chat-v1.1](#model-zoo), with stronger detail understanding and complex reasoning ability! - **[2023/08/18]** 🛠️ We support [fine-tuning](#Fine-Tuning) to make VisCPM more suitable for your application scenarios! - **[2023/07/20]** 🌐 We release [VisCPM-Chat](http://120.92.209.146/) and [VisCPM-Paint](https://huggingface.co/spaces/openbmb/viscpm-paint) online demo! - **[2023/07/20]** 🎢 We provide one-click [deployment of local web version demo](#demo-deployment)! - **[2023/07/20]** ⚡️ We support [low-resource inference](#low-resource-inference), with minimum 5G GPU memory cost to run VisCPM-Chat! - **[2023/07/18]** 🤗 [VisCPM-Chat](https://huggingface.co/openbmb/VisCPM-Chat) and [VisCPM-Paint](https://huggingface.co/openbmb/VisCPM-Paint) have been integrated into the huggingface framework! ## VisCPM-Chat `VisCPM-Chat` supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes `Muffin` visual encoding architecture and `CPM-Bee` (10B) as the base LLM. It combines visual and language models and is optimized with the language modeling training objective. The model training consists of two stages: Multimodal Pretraining and Instruction Tuning. * Multimodal Pretraining: `VisCPM-Chat` is pretrained using approximately 150M high-quality English text-image pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, etc. In this stage, the language model parameters remain fixed, and only the parameters of visual modules are updated to enable efficient alignment of vision and language representations. * Instruction Tuning: We utilize the [LLaVA-150K](https://llava-vl.github.io/) dataset that contains English multimodal instruction-following data. We mix this data with corresponding [translated Chinese data](https://huggingface.co/datasets/openbmb/llava_zh) to fine-tune the model and align its multimodal capabilities with user intents. In this stage, we update all model parameters to improve the data efficiency of instruction tuning. Interestingly, we observe that even when using only English instruction data for fine-tuning, the model can well comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction tuning stage, we can align the model's response language with the user's question language. We evaluate the model on the standard [LLaVA English benchmark](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and the translated [Chinese benchmark](data/translated_LLaVA_qa90) from the standard English benchmark. The evaluation benchmark examines the model's performance in conversation, detailed description, and complex reasoning, and uses GPT-4 for scoring. It can be observed that `VisCPM-Chat` achieves the best average performance in Chinese multimodal capabilities, excelling in conversation and complex reasoning, while also demonstrating good English multimodal capabilities. We provide two versions of the model, namely `VisCPM-Chat-balance` and `VisCPM-Chat-zhplus`. The former has a balanced ability in both English and Chinese, while the latter has a stronger emphasis on Chinese proficiency. Both models use the same data during the instruction tuning stage. `VisCPM-Chat-zhplus` additionally incorporates 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese during the pretraining stage. `VisCPM-Chat-v1.1` additioanlly utilizes [UniMM-Chat](https://huggingface.co/datasets/Yirany/UniMM-Chat) multimodal instruction tuning dataset.
Model LLM Backbone English Chinese
Conversation Detailed Description Complex Reasoning Avg Conversation Detailed Description Complex Reasoning Avg
English Model MiniGPT4 Vicuna-13B 65.0 67.3 76.6 69.7 - - - -
InstructBLIP Vicuna-13B 81.9 68.0 91.2 80.5 - - - -
LLaVA Vicuna-13B 89.5 70.4 96.2 85.6 - - - -
En-Zh Bilingual Model mPLUG-Owl LLaMA-7B 64.6 47.7 80.1 64.2 76.3 61.2 77.8 72.0
VisualGLM ChatGLM-6B 62.4 63.0 80.6 68.7 76.6 87.8 83.6 82.7
Ziya-Visual Ziya-LLaMA-13B-v1 82.7 69.9 92.1 81.7 85.0 74.7 82.4 80.8
Qwen-VL Qwen-7B 82.4 72.6 91.9 83.8 82.3 93.4 89.5 88.2
VisCPM-Chat-balance CPMBee-10B 83.3 68.9 90.5 81.1 92.7 76.1 89.2 86.3
VisCPM-Chat-zhplus CPMBee-10B 80.1 65.7 92.5 79.6 90.3 81.4 92.1 88.2
VisCPM-Chat-v1.1 CPMBee-10B 80.1 67.1 97.1 81.5 91.3 90.7 95.4 92.5
## VisCPM-Paint `VisCPM-Paint` supports bilingual text-to-image generation. The model uses `CPM-Bee` as the text encoder, `UNet` as the image decoder, and fuses vision and language models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of [Stable Diffusion 2.1](https://github.com/Stability-AI/stablediffusion), and it is fused with the language model by gradually unfreezing key bridging parameters. The model is trained on the [LAION 2B](https://laion.ai/) English text-image pair dataset. Similar to `VisCPM-Chat`, we found that due to the bilingual capability of `CPM-Bee`, `VisCPM-Paint` can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely `VisCPM-Paint-balance` and `VisCPM-Paint-zhplus`. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. `VisCPM-Paint-balance` is trained only using English text-image pairs, while `VisCPM-Paint-zhplus` incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on `VisCPM-Paint-balance`.
Model Zero-shot FID↓
English Chinese
GLIDE 12.2 -
Make-A-Scene 11.8 -
DALL·E-2 10.4 -
Unidiffuser 9.7 -
Cogview2 - 24.0
Stable Diffusion 8.6 -
AltDiffusion 17.2 16.1
TaiyiDiffusion - 15.6
VisCPM-Paint-balance 9.5 10.9
VisCPM-Paint-zhplus 9.9 9.6
## ⚙️ Install 1. Clone this repository and navigate to source folder ```bash git clone https://github.com/OpenBMB/VisCPM.git cd VisCPM ``` 2. Create conda environment ```Shell conda create -n viscpm python=3.10 -y conda activate viscpm ``` 3. Install dependencies ```shell pip install torch>=1.10 pip install -r requirements.txt ``` ## 💡 Inference ### Model Zoo | Model | Description | Download Link | |----------------------|-------------------|------ | | VisCPM-Chat-v1.1 | Latest version of multimodal conversation model with stronger detail understanding and complex reasoning ability! | [download](https://huggingface.co/openbmb/VisCPM-Chat/blob/main/pytorch_model.v1.bin) | | VisCPM-Chat-balance | Multimodal conversation model with balanced proficiency in both Chinese and English | [download](https://huggingface.co/openbmb/VisCPM-Chat/resolve/main/viscpm_chat_balance_checkpoint.pt) | | VisCPM-Chat-zhplus | Multimodal conversation model with a strong emphasis on Chinese proficiency | [download](https://huggingface.co/openbmb/VisCPM-Chat/resolve/main/viscpm_chat_zhplus_checkpoint.pt) | | VisCPM-Paint-balance | Text-to-image model with balanced proficiency in both Chinese and English | [download](https://huggingface.co/openbmb/VisCPM-Paint/resolve/main/viscpm_paint_balance_checkpoint.pt) | | VisCPM-Paint-zhplus | Text-to-image model with a strong emphasis on Chinese proficiency | [download](https://huggingface.co/openbmb/VisCPM-Paint/resolve/main/viscpm_paint_zhplus_checkpoint.pt) | ### VisCPM-Chat After downloading the checkpoints, please refer to the following codes to run `VisCPM-Chat` (replace `'/path/to/checkpoint'` with actually path of downloaded checkpoint). #### Single-turn Conversation
We can have a multimodal conversation with VisCPM-Chat using a few lines of codes. ```shell # If the memory of your GPU is less than 40G, you can introduce the following environment variables. After the introduction, the memory usage is about 17G, but the time required for inference will be longer. This feature relies on the BMInf package. export CUDA_MEMORY_CPMBEE_MAX=1g ``` ```python from VisCPM import VisCPMChat from PIL import Image model_path = '/path/to/checkpoint' viscpm_chat = VisCPMChat(model_path, image_safety_checker=True) # We perform security checks on the input images by default. image_path = 'figures/vlu_case1.png' image = Image.open(image_path).convert("RGB") question = '如果用一句中国唐代的著名诗人"李白"的古诗来描述这幅图像,你能想到什么?' # If you use an ancient poem by the famous Tang Dynasty poet "Li Bai" to describe this image, what can you think of? answer, _, _ = viscpm_chat.chat(image, question) print(answer) ``` We can obtain the following results: ``` “黄河之水天上来,奔流到海不复回。” 李白的这句诗可以用来形容这幅图片中汹涌澎湃、波涛汹涌的景象:一条湍急的河流从山上奔腾而下,形成了一幅令人叹为观止的画面,展示出大自然的力量和雄伟壮丽。 "The water of the Yellow River comes up from the sky, rushing to the sea and never returning." This line of Li Bai's poem can be used to describe the surging and turbulent scene in this picture: a rushing river rushes down from the mountain, forming a breathtaking scene. Breathtaking images that showcase the power and majesty of nature. ``` #### Multi-turn Conversation
The codes can be easily extended to support multi-turn multimodal conversations. ```python from VisCPM import VisCPMChat from PIL import Image model_path = '/path/to/checkpoint' viscpm_chat = VisCPMChat(model_path, image_safety_checker=True) # We perform security checks on the input images by default. image_path = 'figures/vlu_case2.jpeg' image = Image.open(image_path).convert("RGB") question = '这幅图像是在哪个节日拍摄的?' # At which festival was this image taken? answer, context, vision_hidden_states = viscpm_chat.chat(image, question) # pass history context of multi-turn conversation question = '你能用什么古诗描述这幅画?' # What ancient poem can you use to describe this painting? answer, context, _ = viscpm_chat.chat(image, question, context, vision_hidden_states=vision_hidden_states) print(context) ``` We can obtain the following results: ``` User: 这幅图像是在哪个节日拍摄的? AI: 这幅图像是在中秋节拍摄的, 也就是中国传统节日中的月圆之夜。 User: 你能用什么古诗描述这幅画? AI: “明月几时有,把酒问青天。” 这是苏轼的《水调歌头》中的一句诗,用来形容这幅图片再贴切不过了:在中秋之夜,月亮高高地挂在天空中,一座古老的建筑沐浴着月光,营造出一种宁静祥和的气氛。 User: On which festival was this image taken? AI: This image was taken during the Mid-Autumn Festival, which is the night of the full moon in the traditional Chinese festival. User: What ancient poem can you use to describe this painting? AI: "When did the bright moon first exist? With a cup of wine in my hand, I ask the blue sky." This is a line from Su Shi's "Shui Diao Ge Tou", which can't be more appropriate to describe this picture: On the night of Mid-Autumn Festival, the moon hangs high in the sky among them, an ancient building is bathed in moonlight, creating a peaceful atmosphere. ``` #### API Usage Guide We offer an API, and you can easily experience VisCPM-Chat with the following code. Supported input formats and usage of the API are as follows: ```python import requests import base64 url = "http://34.143.180.202:3389/viscpm" resp = requests.post(url,json={ # need to modify "image": base64.b64encode(open("path/to/image", "rb").read()).decode(), "question": "Describe this image", }) resp = resp.json() print(resp) ``` ### VisCPM-Paint After downloading the checkpoints, please refer to the following codes to run `VisCPM-Paint` (replace `'/path/to/checkpoint'` with actually path of downloaded checkpoint). ![图片](figures/vlg_case1.png) ![图片](figures/vlg_case2.png) The input prompts of the images above can be found at [prompts.txt](data/prompts.txt). ```shell # If the memory of your GPU is less than 40G, you can introduce the following environment variables. After the introduction, the memory usage is about 17G, but the time required for inference will be longer. This feature relies on the BMInf package. export CUDA_MEMORY_CPMBEE_MAX=1g ``` ```python from VisCPM import VisCPMPaint painter = VisCPMPaint('/path/to/checkpoint', image_safety_checker=True, prompt_safety_checker=True, add_ranker=True) # We perform security checks on the input text and output images by default. Additionally, the default setting includes image reranking. image = painter.generate('人闲桂花落,月静春山空') # The sweet-scented osmanthus falls when people are idle, the moon is quiet and the mountains are empty in spring. # Corresponding to the second picture in the first row of the above picture. image.save('/data/test.png') ``` In our code, we have enabled the default security checks for both input text and output images. Additionally, we have implemented a default setting of reranking for the generated images. This means that for a given input, we generate four images simultaneously and return the one with the highest relevance score to the input, which is evaluated using [Chinese-Clip](https://github.com/OFA-Sys/Chinese-CLIP). Reranking enhances the stability of the generated image quality but may also slow the model's generation speed. If you prefer to obtain the generated results quickly, you can disable the reranking mechanism. If you are providing English text as input for generating images, it is advisable to disable the reranking mechanism and input text checker, since the scoring model used for reranking and safety checker for the input prompt are specifically trained for Chinese text. ### Low-Resource Inference We use [BMInf](https://github.com/OpenBMB/BMInf) to reduce GPU memory costs. First you need to install BMInf by `pip install bminf`, and then specify `export CUDA_MEMORY_CPMBEE_MAX=1g` the environment variable in shell, and then follow the above steps to inference. The minimum GPU memory usage of VisCPM-Chat can be reduced to 5G, and the minimum GPU memory usage of VisCPM-Paint can be reduced to 17G. ### Demo Deployment We provide a simple web version demo based on gradio. First you need to install gradio: `pip install gradio`, and then execute the following command: ```shell git clone https://github.com/OpenBMB/VisCPM.git cd VisCPM python demo_chat.py # viscpm_chat demo, or python demo_paint.py # viscpm_paint demo ``` ### Fine-Tuning We provide the fine-tuning code for VisCPM-Chat. Users can fine-tune it on their own private data. The fine-tuning code is located in the `finetune/ft_viscpm_chat` directory, and the specific usage of the fine-tuning code is as follows: ```shell # Get the dataset bash ./finetune/ft_viscpm_chat/get_llava150k_zh.sh # Model fine-tuning, note to modify the dataset and model checkpoint paths within bash ./finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh # node: 8 # batch_size: 8 * 1 # More details can be found in './finetune/ft_viscpm_chat/config/viscpm_chat_ft.json' and './finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh' ``` Note: - `deepspeed-0.9.1` is used in the fine-tuning code, and the installation method can be found in [here](https://www.deepspeed.ai/getting-started/). - Currently, we have only tested the codes of fine-tuning on `Linux`. If you are fine-tuning under other system configurations, you may need to modify some of the code ## 🛡 Safety ### Safety Statement As a multimodal model, `VisCPM` generates content by learning from a vast amount of public image and text data. However, it does not possess the ability to comprehend or express personal opinions or value judgments. Any content generated by VisCPM does not represent the viewpoints or positions of the model developers. Therefore, when using content generated by `VisCPM`, users should take full responsibility for evaluating and verifying it on their own ### Safety Module To prevent the model from being misused to process or generate content that violates widely accepted societal values, we have incorporated a content safety module in `VisCPM`. When the safety module detects image or text content that does not comply with safety regulations during model processing or generation, it intercepts the corresponding content. We performed security checks on the input images accepted by `VisCPM-Chat` and the input text and output images of `VisCPM-Paint`. While the safety module in VisCPM still has room for improvement, there may be instances of both false positives and false negatives. We will continue to enhance the performance of the safety module in future updates. ## 📝 License VisCPM is governed by the [GML License](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E9%9D%9E%E5%95%86%E4%B8%9A%E5%8C%96.md), and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to negotiate commercial licensing. The CPM-Bee base, governed by the [General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to obtain the certificate of authorization. ## ✅ Todo `VisCPM` is still undergoing continuous improvement, and we will further optimize it in the following aspects: - [ ] Enabling model quantization ## 🏫 Institutions This project is developed by the following institutions: - [THUNLP](https://nlp.csai.tsinghua.edu.cn/) - [ModelBest](https://modelbest.cn/) - [Zhihu](https://www.zhihu.com/ ) ## Citation Please consider citing the following papers if our work is helpful to you ```bibtex @article{VisCPM, title={Large multilingual models pivot zero-shot multimodal learning across languages}, author={Hu, Jinyi and Yao, Yuan and Wang, Chongyi and Wang, Shan and Pan, Yinxu and Chen, Qianyu and Yu, Tianyu and Wu, Hanghao and Zhao, Yue and Zhang, Haoye and others}, journal={arXiv preprint arXiv:2308.12038}, year={2023} } @article{muffin, title={Reformulating vision-language foundation models and datasets towards universal multimodal assistants}, author={Yu, Tianyu and Hu, Jinyi and Yao, Yuan and Zhang, Haoye and Zhao, Yue and Wang, Chongyi and Wang, Shan and Pan, Yinxv and Xue, Jiao and Li, Dahai and others}, journal={arXiv preprint arXiv:2310.00653}, year={2023} } ``` ================================================ FILE: VisCPM/__init__.py ================================================ from VisCPM.viscpm_chat import VisCPMChat from VisCPM.viscpm_paint import VisCPMPaint ================================================ FILE: VisCPM/cpm_tokenizers/__init__.py ================================================ from .bee import CPMBeeTokenizer ================================================ FILE: VisCPM/cpm_tokenizers/bee.py ================================================ # coding=utf-8 # Copyright 2022 The OpenBMB team. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import pkg_resources import io import os from typing import IO, Dict, List, Optional, Tuple file_path = os.path.dirname(__file__) def load_vocab(fp: IO[bytes]) -> Dict[str, int]: """Loads a vocabulary file into a dictionary.""" vocab: Dict[str, int] = {} for token in fp.readlines(): if token[-1] == "\n": token = token[:-1] if len(token) == 0: continue vocab[token] = len(vocab) return vocab class Token(object): def __init__(self, token: str, start: int, is_unk: bool, is_special: bool): self.token = token self.start = start self.is_unk = is_unk self.is_special = is_special def __str__(self): return "Token(token={}, start={}, is_unk={}, is_special={})".format( self.token, self.start, self.is_unk, self.is_special ) def __repr__(self): return self.__str__() class CPMBeeTokenizer(object): def __init__( self, old_version=False ): self.unk_token = "" self.mask_token = "" self.bos_token = "" self.eos_token = "" self.line_token = "\n" self.space_token = " " if old_version: self.encoder = load_vocab(open(os.path.join(file_path, "vocabs/bee-old.txt"), encoding="utf-8")) else: self.encoder = load_vocab(open(os.path.join(file_path, "vocabs/bee.txt"), encoding="utf-8")) self.encoder[self.line_token] = self.encoder[""] self.encoder[self.space_token] = self.encoder[""] del self.encoder[""] del self.encoder[""] self.decoder = {v: k for k, v in self.encoder.items()} self._special_tokens = { k: v for k, v in self.encoder.items() if k.startswith("<") and k.endswith(">") } self._max_word_len = max([len(x) for x in self.encoder.keys()]) def get_piece(self, text: str) -> str: text = text[: self._max_word_len] len_text = len(text) for i in range(len(text)): sub = text[: len_text - i] if (sub in self.encoder) and (sub not in self._special_tokens): return sub return text[0] @property def vocab_size(self): return len(self.encoder) @property def eos_id(self): return self.encoder[self.eos_token] @property def bos_id(self): return self.encoder[self.bos_token] @property def unk_id(self): return self.encoder[self.unk_token] @property def mask_id(self): return self.encoder[self.mask_token] def __len__(self): return len(self.encoder) def tokenize(self, text: str) -> List[Token]: output_tokens: List[Token] = [] sentence_split = [""] is_escape = False is_special_token = False for i, c in enumerate(text): if is_special_token: if c == "<": raise ValueError("Invalid special token at pos {}".format(i)) elif c == ">": # end of special token sentence_split[-1] += c is_special_token = False sentence_split.append("") else: sentence_split[-1] += c else: if c == "<": if is_escape: # case: << sentence_split[-1] += c is_escape = False else: # case: x< is_escape = True else: if is_escape: # case str: return text.replace("<", "<<") @staticmethod def unescape(text: str) -> str: return text.replace("<<", "<") def encode( self, text: str, past_table: Dict[int, str] = {} ) -> Tuple[List[int], Dict[int, str]]: ext_table_rev: Dict[str, int] = {} ext_table: Dict[int, str] = {} for idx, val in past_table.items(): ext_table[idx] = val ext_table_rev[val] = idx ret = [] for x in self.tokenize(text): if x.is_unk or (x.is_special and (x.token not in self.encoder)): if x.token not in ext_table_rev: ext_table_rev[x.token] = len(ext_table_rev) + self.vocab_size ext_table[ext_table_rev[x.token]] = x.token ret.append(ext_table_rev[x.token]) elif x.token in self.encoder: ret.append(self.encoder[x.token]) else: raise ValueError("Unknown token `{}` at pos {}".format(x.token, x.start)) return ret, ext_table def decode(self, tokens: List[int], ext_table: Optional[Dict[int, str]] = None): """Decode ids into a string.""" if ext_table is None: ext_table = {} ret = [] for token in tokens: if token in ext_table: ret.append(ext_table[token]) else: if token >= 0: w = self.decoder[token] if w in self._special_tokens: ret.append(w) else: ret.append(self.escape(w)) return "".join(ret) ================================================ FILE: VisCPM/cpm_tokenizers/vocabs/bee.txt ================================================