Showing preview only (1,000K chars total). Download the full file or copy to clipboard to get everything.
Repository: QwenLM/Qwen
Branch: main
Commit: 2df8e8ac450f
Files: 114
Total size: 957.2 KB
Directory structure:
gitextract_zv37hsfa/
├── .dockerignore
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.yaml
│ │ ├── config.yaml
│ │ └── feature_request.yaml
│ └── workflows/
│ └── stale.yml
├── .gitignore
├── FAQ.md
├── FAQ_ja.md
├── FAQ_zh.md
├── LICENSE
├── NOTICE
├── README.md
├── README_CN.md
├── README_ES.md
├── README_FR.md
├── README_JA.md
├── Tongyi Qianwen LICENSE AGREEMENT
├── Tongyi Qianwen RESEARCH LICENSE AGREEMENT
├── ascend-support/
│ ├── README.md
│ └── docker_qwen.sh
├── cli_demo.py
├── dcu-support/
│ ├── README.md
│ ├── cli_demo.py
│ ├── cli_demo_batch.py
│ ├── model.properties
│ ├── package/
│ │ ├── fastllm_pytools/
│ │ │ ├── __init__.py
│ │ │ ├── hf_model.py
│ │ │ ├── llm.py
│ │ │ └── torch2flm.py
│ │ └── setup.py
│ ├── qwen2flm.py
│ ├── requirements.txt
│ └── web_demo.py
├── docker/
│ ├── Dockerfile
│ ├── Dockerfile-cu114
│ ├── Dockerfile-cu121
│ ├── docker_cli_demo.sh
│ ├── docker_openai_api.sh
│ └── docker_web_demo.sh
├── eval/
│ ├── EVALUATION.md
│ ├── evaluate_ceval.py
│ ├── evaluate_chat_ceval.py
│ ├── evaluate_chat_gsm8k.py
│ ├── evaluate_chat_humaneval.py
│ ├── evaluate_chat_mmlu.py
│ ├── evaluate_cmmlu.py
│ ├── evaluate_gsm8k.py
│ ├── evaluate_humaneval.py
│ ├── evaluate_mmlu.py
│ ├── evaluate_plugin.py
│ └── gsm8k_prompt.txt
├── examples/
│ ├── add_merges.py
│ ├── auto_comments.md
│ ├── auto_comments.py
│ ├── function_call_examples.py
│ ├── function_call_finetune_examples.py
│ ├── langchain_tooluse.ipynb
│ ├── qwen_extra.tiktoken
│ ├── qwen_extra_vocab.txt
│ ├── react_demo.py
│ ├── react_prompt.md
│ ├── system_prompt.md
│ ├── tokenizer_showcase.ipynb
│ ├── transformers_agent.md
│ └── vllm_wrapper.py
├── finetune/
│ ├── ds_config_zero2.json
│ ├── ds_config_zero3.json
│ ├── finetune_ds.sh
│ ├── finetune_lora_ds.sh
│ ├── finetune_lora_single_gpu.sh
│ ├── finetune_qlora_ds.sh
│ └── finetune_qlora_single_gpu.sh
├── finetune.py
├── openai_api.py
├── recipes/
│ ├── applications/
│ │ ├── chatbot/
│ │ │ └── qwen_chatbot.ipynb
│ │ ├── domain_finetune/
│ │ │ └── qwen_domain_finetune.ipynb
│ │ └── retrieval/
│ │ └── retrieval.ipynb
│ ├── finetune/
│ │ ├── ascend/
│ │ │ └── README.md
│ │ ├── deepspeed/
│ │ │ ├── finetune_fullparameter_multi_gpu.ipynb
│ │ │ ├── finetune_fullparameter_single_gpu.ipynb
│ │ │ ├── finetune_lora_multi_gpu.ipynb
│ │ │ ├── finetune_lora_single_gpu.ipynb
│ │ │ ├── finetune_qlora_multi_gpu.ipynb
│ │ │ ├── finetune_qlora_single_gpu.ipynb
│ │ │ ├── readme.md
│ │ │ └── requirements.txt
│ │ └── swift/
│ │ ├── README.md
│ │ └── README_CN.md
│ ├── inference/
│ │ ├── dashscope/
│ │ │ └── README.md
│ │ ├── hf_modelscope/
│ │ │ └── README.md
│ │ ├── quantization/
│ │ │ └── README.md
│ │ ├── tensorrt/
│ │ │ ├── README.md
│ │ │ └── docker/
│ │ │ └── Dockerfile
│ │ └── vllm/
│ │ ├── README.md
│ │ ├── template_chatml.jinja
│ │ └── vllm_wrapper.py
│ ├── quickstart/
│ │ └── qwen.ipynb
│ └── tests/
│ ├── README.md
│ ├── __init__.py
│ ├── assets/
│ │ └── test_sampled_qwen.json
│ ├── test_finetune/
│ │ └── test_finetune_ds.py
│ ├── test_inference/
│ │ ├── test_inference_api.py
│ │ └── test_inference_vllm_fschat.py
│ ├── ut_config.py
│ └── utils.py
├── requirements.txt
├── requirements_web_demo.txt
├── run_gptq.py
├── tech_memo.md
├── tokenization_note.md
├── tokenization_note_ja.md
├── tokenization_note_zh.md
├── utils.py
└── web_demo.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .dockerignore
================================================
__pycache__
*.so
build
.coverage_*
*.egg-info
*~
.vscode/
.idea/
.git/
.github/
.DS_Store
/private/
/README-docker.md
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.yaml
================================================
name: 🐞 Bug
description: 提交错误报告 | File a bug/issue
title: "[BUG] <title>"
labels: []
body:
- type: checkboxes
attributes:
label: 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
description: |
请先搜索您遇到的错误是否在已有的issues或讨论中提到过。
Please search to see if an issue / discussion already exists for the bug you encountered.
[Issues](https://github.com/QwenLM/Qwen-7B/issues)
[Discussions](https://github.com/QwenLM/Qwen-7B/discussions)
options:
- label: 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
required: true
- type: checkboxes
attributes:
label: 该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
description: |
请先搜索您遇到的错误是否已在FAQ中有相关解答。
Please search to see if an answer already exists in FAQ for the bug you encountered.
[FAQ-en](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ.md)
[FAQ-zh](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ_zh.md)
options:
- label: 我已经搜索过FAQ | I have searched FAQ
required: true
- type: textarea
attributes:
label: 当前行为 | Current Behavior
description: |
准确描述遇到的行为。
A concise description of what you're experiencing.
validations:
required: false
- type: textarea
attributes:
label: 期望行为 | Expected Behavior
description: |
准确描述预期的行为。
A concise description of what you expected to happen.
validations:
required: false
- type: textarea
attributes:
label: 复现方法 | Steps To Reproduce
description: |
复现当前行为的详细步骤。
Steps to reproduce the behavior.
placeholder: |
1. In this environment...
2. With this config...
3. Run '...'
4. See error...
validations:
required: false
- type: textarea
attributes:
label: 运行环境 | Environment
description: |
examples:
- **OS**: Ubuntu 20.04
- **Python**: 3.8
- **Transformers**: 4.31.0
- **PyTorch**: 2.0.1
- **CUDA**: 11.4
value: |
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
render: Markdown
validations:
required: false
- type: textarea
attributes:
label: 备注 | Anything else?
description: |
您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。
您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。
Links? References? Anything that will give us more context about the issue you are encountering!
Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
validations:
required: false
================================================
FILE: .github/ISSUE_TEMPLATE/config.yaml
================================================
blank_issues_enabled: true
================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.yaml
================================================
name: "💡 Feature Request"
description: 创建新功能请求 | Create a new ticket for a new feature request
title: "💡 [REQUEST] - <title>"
labels: [
"question"
]
body:
- type: input
id: start_date
attributes:
label: "起始日期 | Start Date"
description: |
起始开发日期
Start of development
placeholder: "month/day/year"
validations:
required: false
- type: textarea
id: implementation_pr
attributes:
label: "实现PR | Implementation PR"
description: |
实现该功能的Pull request
Pull request used
placeholder: "#Pull Request ID"
validations:
required: false
- type: textarea
id: reference_issues
attributes:
label: "相关Issues | Reference Issues"
description: |
与该功能相关的issues
Common issues
placeholder: "#Issues IDs"
validations:
required: false
- type: textarea
id: summary
attributes:
label: "摘要 | Summary"
description: |
简要描述新功能的特点
Provide a brief explanation of the feature
placeholder: |
Describe in a few lines your feature request
validations:
required: true
- type: textarea
id: basic_example
attributes:
label: "基本示例 | Basic Example"
description: Indicate here some basic examples of your feature.
placeholder: A few specific words about your feature request.
validations:
required: true
- type: textarea
id: drawbacks
attributes:
label: "缺陷 | Drawbacks"
description: |
该新功能有哪些缺陷/可能造成哪些影响?
What are the drawbacks/impacts of your feature request ?
placeholder: |
Identify the drawbacks and impacts while being neutral on your feature request
validations:
required: true
- type: textarea
id: unresolved_question
attributes:
label: "未解决问题 | Unresolved questions"
description: |
有哪些尚未解决的问题?
What questions still remain unresolved ?
placeholder: |
Identify any unresolved issues.
validations:
required: false
================================================
FILE: .github/workflows/stale.yml
================================================
name: Close stale issues
on:
schedule:
- cron: "0 8 * * *"
jobs:
close-issues:
runs-on: ubuntu-latest
permissions:
actions: write
issues: write
steps:
- uses: actions/stale@v9
with:
days-before-issue-stale: 30
days-before-issue-close: 7
stale-issue-label: inactive
stale-issue-message: This issue has been automatically marked as inactive due to lack of recent activity.
Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.
此问题由于长期未有新进展而被系统自动标记为不活跃。如果您认为它仍有待解决,请在此帖下方留言以补充信息。
days-before-pr-stale: -1
days-before-pr-close: -1
operations-per-run: 128
repo-token: ${{ secrets.GITHUB_TOKEN }}
================================================
FILE: .gitignore
================================================
__pycache__
*.so
build
.coverage_*
*.egg-info
*~
.vscode/
.idea/
.DS_Store
/private/
================================================
FILE: FAQ.md
================================================
# FAQ
## Installation & Environment
#### Failure in installing flash attention
Flash attention is an option for accelerating training and inference. Only NVIDIA GPUs of Turing, Ampere, Ada, and Hopper architecture, e.g., H100, A100, RTX 3090, T4, RTX 2080, can support flash attention. **You can use our models without installing it.**
#### Which version of transformers should I use?
4.32.0 is preferred.
#### I downloaded the codes and checkpoints but I can't load the model locally. What should I do?
Please check if you have updated the code to the latest, and correctly downloaded all the sharded checkpoint files.
#### `qwen.tiktoken` is not found. What is it?
This is the merge file of the tokenizer. You have to download it. Note that if you just git clone the repo without [git-lfs](https://git-lfs.com), you cannot download this file.
#### transformers_stream_generator/tiktoken/accelerate not found
Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt).
<br><br>
## Demo & Inference
#### Is there any demo? CLI demo and Web UI demo?
Yes, see `web_demo.py` for web demo and `cli_demo.py` for CLI demo. See README for more information.
#### Can I use CPU only?
Yes, run `python cli_demo.py --cpu-only` will load the model and inference on CPU only.
#### Can Qwen support streaming?
Yes. See the function `chat_stream` in `modeling_qwen.py`.
#### Gibberish in result when using chat_stream().
This is because tokens represent bytes and a single token may be a meaningless string. We have updated the default setting of our tokenizer to avoid such decoding results. Please update the code to the latest version.
#### It seems that the generation is not related to the instruction...
Please check if you are loading Qwen-Chat instead of Qwen. Qwen is the base model without alignment, which behaves differently from the SFT/Chat model.
#### Is quantization supported?
Yes, the quantization is supported by AutoGPTQ.
#### Slow when processing long sequences
Updating the code to the latest version can help.
#### Unsatisfactory performance in processing long sequences
Please ensure that NTK is applied. `use_dynamc_ntk` and `use_logn_attn` in `config.json` should be set to `true` (`true` by default).
<br><br>
## Finetuning
#### Can Qwen support SFT or even RLHF?
Yes, we now support SFT, including full-parameter finetuning, LoRA, and Q-LoRA. Also you can check other projects like [FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)), [Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)), [**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)), etc.
However, temporarily we do not support RLHF. We will provide the code in the near future.
<br><br>
## Tokenizer
#### bos_id/eos_id/pad_id not found
In our training, we only use `<|endoftext|>` as the separator and padding token. You can set bos_id, eos_id, and pad_id to tokenizer.eod_id. Learn more about our tokenizer from our documents about the tokenizer.
## Docker
#### Download official docker image is very slow
When downloading our official docker image, you may have a slow download speed due to some network issues. You can refer to [Alibaba Cloud Container Image Service](https://help.aliyun.com/zh/acr/user-guide/accelerate-the-pulls-of-docker-official-images) to accelerate the download of official images.
================================================
FILE: FAQ_ja.md
================================================
# FAQ
## インストールと環境
#### Flash attention 導入の失敗例
Flash attention は、トレーニングと推論を加速するオプションです。H100、A100、RTX 3090、T4、RTX 2080 などの Turing、Ampere、Ada、および Hopper アーキテクチャの NVIDIA GPU だけが、flash attention をサポートできます。それをインストールせずに私たちのモデルを使用することができます。
#### transformers のバージョンは?
4.32.0 が望ましいです。
#### コードとチェックポイントをダウンロードしましたが、モデルをローカルにロードできません。どうすればよいでしょうか?
コードを最新のものに更新し、すべてのシャードされたチェックポイントファイルを正しくダウンロードしたかどうか確認してください。
#### `qwen.tiktoken` が見つかりません。これは何ですか?
これはトークナイザーのマージファイルです。ダウンロードする必要があります。[git-lfs](https://git-lfs.com) を使わずにリポジトリを git clone しただけでは、このファイルをダウンロードできないことに注意してください。
#### transformers_stream_generator/tiktoken/accelerate が見つかりません。
コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt) にあります。
<br><br>
## デモと推論
#### デモはありますか?CLI と Web UI のデモはありますか?
はい、Web デモは `web_demo.py` を、CLI デモは `cli_demo.py` を参照してください。詳しくは README を参照してください。
#### CPU のみを使うことはできますか?
はい、`python cli_demo.py --cpu-only` を実行すると、CPU のみでモデルと推論をロードします。
#### Qwen はストリーミングに対応していますか?
`modeling_qwen.py` の `chat_stream` 関数を参照してください。
#### chat_stream() を使用すると、結果に文字化けが発生します。
これは、トークンがバイトを表し、単一のトークンが無意味な文字列である可能性があるためです。このようなデコード結果を避けるため、トークナイザのデフォルト設定を更新しました。コードを最新版に更新してください。
#### インストラクションとは関係ないようですが...
Qwen ではなく Qwen-Chat を読み込んでいないか確認してください。Qwen はアライメントなしのベースモデルで、SFT/Chat モデルとは挙動が異なります。
#### 量子化はサポートされていますか?
はい、量子化は AutoGPTQ でサポートされています。
#### 長いシーケンスの処理に時間がかかる
コードを最新版に更新することで解決します。
#### 長いシーケンスの処理で不満足なパフォーマンス
NTK が適用されていることを確認してください。`config.json` の `use_dynamc_ntk` と `use_logn_attn` を `true` に設定する必要があります(デフォルトでは `true`)。
<br><br>
## ファインチューニング
#### Qwen は SFT、あるいは RLHF に対応できますか?
SFTのコードは提供します。[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))など、いくつかのプロジェクトではファインチューニングをサポートしています。近日中に関連コードを更新する予定です。
<br><br>
## トークナイザー
#### bos_id/eos_id/pad_id が見つかりません。
私たちのトレーニングでは、セパレータとパディングトークンとして `<|endoftext|>` のみを使用しています。bos_id、eos_id、pad_id は tokenizer.eod_id に設定できます。私たちのトークナイザーについて詳しくは、トークナイザーについてのドキュメントをご覧ください。
================================================
FILE: FAQ_zh.md
================================================
# FAQ
## 安装&环境
#### flash attention 安装失败
flash attention是一个用于加速模型训练推理的可选项,且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡(如H100、A100、RTX 3090、T4、RTX 2080),您可以在不安装flash attention的情况下正常使用模型进行推理。
#### 我应该用哪个transformers版本?
建议使用4.32.0。
#### 我把模型和代码下到本地,按照教程无法使用,该怎么办?
答:别着急,先检查你的代码是不是更新到最新版本,然后确认你是否完整地将模型checkpoint下到本地。
#### `qwen.tiktoken`这个文件找不到,怎么办?
这个是我们的tokenizer的merge文件,你必须下载它才能使用我们的tokenizer。注意,如果你使用git clone却没有使用git-lfs,这个文件不会被下载。如果你不了解git-lfs,可点击[官网](https://git-lfs.com/)了解。
#### transformers_stream_generator/tiktoken/accelerate,这几个库提示找不到,怎么办?
运行如下命令:`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen/blob/main/requirements.txt) 可以找到。
<br><br>
## Demo & 推理
#### 是否提供Demo?CLI Demo及Web UI Demo?
`web_demo.py`和`cli_demo.py`分别提供了Web UI以及CLI的Demo。请查看README相关内容了解更多。
#### 我没有GPU,只用CPU运行CLI demo可以吗?
可以的,运行`python cli_demo.py --cpu-only`命令即可将模型读取到CPU并使用CPU进行推理。
#### Qwen支持流式推理吗?
Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数。
#### 使用`chat_stream()`生成混乱的内容及乱码,为什么?
这是由于模型生成过程中输出的部分token需要与后续token一起解码才能输出正常文本,单个token解码结果是无意义字符串,我们已经更新了tokenizer解码时的默认设置,避免这些字符串在生成结果中出现,如果仍有类似问题请更新模型至最新版本。
#### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的
请检查是否加载的是Qwen-Chat模型进行推理,Qwen模型是未经align的预训练基模型,不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`及`chat_stream`接口内进行了检查,避免您误将预训练模型作为SFT/Chat模型使用。
#### 是否有量化版本模型
目前Qwen支持基于AutoGPTQ的4-bit的量化推理。
#### 生成序列较长后速度显著变慢
请更新到最新代码。
#### 处理长序列时效果有问题
请确认是否开启ntk。若要启用这些技巧,请将`config.json`里的`use_dynamc_ntk`和`use_logn_attn`设置为`true`。最新代码默认为`true`。
<br><br>
## 微调
#### 当前是否支持SFT和RLHF?
我们目前提供了SFT的代码,支持全参数微调、LoRA和Q-LoRA。此外,当前有多个外部项目也已实现支持,如[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))等。我们会尽快更新这部分代码和说明。
我们还没提供对RLHF训练的支持,敬请期待。
<br><br>
## Tokenizer
#### bos_id/eos_id/pad_id,这些token id不存在,为什么?
在训练过程中,我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符,你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档,了解如何设置这些id。
## Docker
#### 下载官方Docker镜像速度很慢
在下载官方镜像时,您可能由于某些网络原因导致下载速度变慢。可以参考[阿里云容器镜像服务](https://help.aliyun.com/zh/acr/user-guide/accelerate-the-pulls-of-docker-official-images)加速官方镜像的下载。
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2023 Alibaba Cloud
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: NOTICE
================================================
------------- LICENSE FOR NVIDIA Megatron-LM code --------------
Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
------------- LICENSE FOR OpenAI tiktoken code --------------
MIT License
Copyright (c) 2022 OpenAI, Shantanu Jain
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
------------- LICENSE FOR stanford_alpaca code --------------
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
------------- LICENSE FOR PanQiWei AutoGPTQ code --------------
MIT License
Copyright (c) 2023 潘其威(William)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
<p align="left">
<a href="README_CN.md">中文</a>  |  English  |  <a href="README_JA.md">日本語</a> |  <a href="README_FR.md">Français</a> |  <a href="README_ES.md">Español</a>
</p>
<br><br>
<p align="center">
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" width="400"/>
<p>
<br>
<p align="center">
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>   |    📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a>    |   🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-72B-Chat-Demo/summary">Demo</a>
<br>
<a href="assets/wechat.png">WeChat (微信)</a>   |   <a href="https://discord.gg/CV4E9rpNSD">Discord</a>   |   <a href="https://dashscope.aliyun.com">API</a>
</p>
<br><br>
> [!Important]
> Qwen2 is here! You are welcome to follow [QwenLM/Qwen2](https://github.com/QwenLM/Qwen2) and share your experience there.
>
> This repo ([QwenLM/Qwen](https://github.com/QwenLM/Qwen)) is no longer actively maintained, due to substantial codebase differences.
<br>
| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen |
|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 1.8B | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B">🤗</a> |
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a> |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
| 72B | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B">🤗</a> |
We opensource our **Qwen** series, now including **Qwen**, the base language models, namely **Qwen-1.8B**, **Qwen-7B**, **Qwen-14B**, and **Qwen-72B**, as well as **Qwen-Chat**, the chat models, namely **Qwen-1.8B-Chat**, **Qwen-7B-Chat**, **Qwen-14B-Chat**, and **Qwen-72B-Chat**. Links are on the above table. Click them and check the model cards. Also, we release the **[technical report](https://arxiv.org/abs/2309.16609)**. Please click the paper link and check it out!
In brief, we have strong base language models, which have been stably pretrained for up to 3 trillion tokens of multilingual data with a wide coverage of domains, languages (with a focus on Chinese and English), etc. They are able to achieve competitive performance on benchmark datasets. Additionally, we have chat models that are aligned with human preference based on SFT and RLHF (not released yet), which are able to chat, create content, extract information, summarize, translate, code, solve math problems, and so on, and are able to use tools, play as agents, or even play as code interpreters, etc.
| Model | Release Date | Max Length | System Prompt Enhancement | # of Pretrained Tokens | Minimum GPU Memory Usage of Finetuning (Q-Lora) | Minimum GPU Usage of Generating 2048 Tokens (Int4) | Tool Usage |
|:----------|:------------:|:----------:|:-------------------------:|:----------------------:|:-----------------------------------------------:|:--------------------------------------------------:|:----------:|
| Qwen-1.8B | 23.11.30 | 32K | ✅ | 2.2T | 5.8GB | 2.9GB | ✅ |
| Qwen-7B | 23.08.03 | 32K | ❎ | 2.4T | 11.5GB | 8.2GB | ✅ |
| Qwen-14B | 23.09.25 | 8K | ❎ | 3.0T | 18.7GB | 13.0GB | ✅ |
| Qwen-72B | 23.11.30 | 32K | ✅ | 3.0T | 61.4GB | 48.9GB | ✅ |
In this repo, you can figure out:
* Quickstart with Qwen, and enjoy the simple inference.
* Details about the quantization models, including GPTQ and KV cache quantization.
* Statistics of inference performance, including speed and memory.
* Tutorials on finetuning, including full-parameter tuning, LoRA, and Q-LoRA.
* Instructions on deployment, with the example of vLLM and FastChat.
* Instructions on building demos, including WebUI, CLI demo, etc.
* Introduction to DashScope API service, as well as the instructions on building an OpenAI-style API for your model.
* Information about Qwen for tool use, agent, and code interpreter
* Statistics of long-context understanding evaluation
* License agreement
* ...
Also, if you meet problems, turn to [FAQ](FAQ.md) for help first. Still feeling struggled? Feel free to shoot us issues (better in English so that more people can understand you)! If you would like to help us, send us pull requests with no hesitation! We are always excited about PR!
Would like to chat with us or date us coffee time? Welcome to our Discord or WeChat!
<br><br>
## News and Updates
* 2023.11.30 🔥 We release **Qwen-72B** and **Qwen-72B-Chat**, which are trained on 3T tokens and support 32k context, along with **Qwen-1.8B**, and **Qwen-1.8B-Chat**, on ModelScope and Hugging Face. We have also strengthened the System Prompt capabilities of the Qwen-72B-Chat and Qwen-1.8B-Chat, see [example documentation](examples/system_prompt.md). Additionally, support the inference on **Ascend 910** and **Hygon DCU**. Check `ascend-support` and `dcu-support` for more details.
* 2023.10.17 We release the Int8 quantized model **Qwen-7B-Chat-Int8** and **Qwen-14B-Chat-Int8**.
* 2023.9.25 🔥 We release **Qwen-14B** and **Qwen-14B-Chat** on ModelScope and Hugging Face, along with [qwen.cpp](https://github.com/QwenLM/qwen.cpp) and [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent). Codes and checkpoints of **Qwen-7B** and **Qwen-7B-Chat** are also updated. **PLEASE PULL THE LATEST VERSION!**
- Compared to **Qwen-7B** (original), **Qwen-7B** uses more training tokens, increasing from 2.2T tokens to 2.4T tokens, while the context length extends from 2048 to 8192. The Chinese knowledge and coding ability of **Qwen-7B** have been further improved.
* 2023.9.12 We now support finetuning on the Qwen-7B models, including full-parameter finetuning, LoRA and Q-LoRA.
* 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.
* 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
<br>
## Performance
Qwen models outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models’ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.
<p align="left">
<img src="assets/radar_72b.jpg" width=600px/>
<p>
<br>
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 |
| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - |
| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 |
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 |
| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** |
For all compared models, we report the best scores between their official reported results and [OpenCompass](https://opencompass.org.cn/leaderboard-llm).
For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical report by clicking [here](https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf).
<br><br>
## Requirements
* python 3.8 and above
* pytorch 1.12 and above, 2.0 and above are recommended
* transformers 4.32 and above
* CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
<br>
## Quickstart
Below, we provide simple examples to show how to use Qwen-Chat with 🤖 ModelScope and 🤗 Transformers.
You can use our pre-built docker images to skip most of the environment setup steps, see Section ["Using Pre-built Docker Images"](#-docker) for more details.
If not using docker, please make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
```bash
pip install -r requirements.txt
```
If your device supports fp16 or bf16, we recommend installing [flash-attention](https://github.com/Dao-AILab/flash-attention) (**we support flash attention 2 now.**) for higher efficiency and lower memory usage. (**flash-attention is optional and the project can run normally without installing it**)
```bash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# If the version of flash-attn is higher than 2.1.1, the following is not needed.
# pip install csrc/rotary
```
Now you can start with ModelScope or Transformers.
### 🤗 Transformers
To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, **please make sure that you are using the latest code.**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True
).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好!很高兴为你提供帮助。
# 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
# 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
# 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
# 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
# 3rd dialogue turn
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)
# 《奋斗创业:一个年轻人的成功之路》
```
Running Qwen, the base language model, is also simple.
<details>
<summary>Running Qwen</summary>
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B",
device_map="auto",
trust_remote_code=True
).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
```
</details>
<p id="DownloadModel">
In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:
</p>
```python
from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
# Downloading model checkpoint to a local dir model_dir
# model_dir = snapshot_download('qwen/Qwen-7B')
# model_dir = snapshot_download('qwen/Qwen-7B-Chat')
# model_dir = snapshot_download('qwen/Qwen-14B')
model_dir = snapshot_download('qwen/Qwen-14B-Chat')
# Loading local checkpoints
# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map="auto",
trust_remote_code=True
).eval()
```
### 🤖 ModelScope
ModelScope is an open-source platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:
```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig
# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
response, history = model.chat(tokenizer, "浙江的省会在哪里?", history=history)
print(response)
response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
print(response)
```
### Batch Inference
Qwen supports batch inference. With flash attention enabled, using batch inference can bring a 40% speedup. The example code is shown below:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig
from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids
# To generate attention masks automatically, it is necessary to assign distinct
# token_ids to pad_token and eos_token, and set pad_token_id in the generation_config.
tokenizer = AutoTokenizer.from_pretrained(
'./',
pad_token='<|extra_0|>',
eos_token='<|endoftext|>',
padding_side='left',
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'./',
pad_token_id=tokenizer.pad_token_id,
device_map="auto",
trust_remote_code=True
).eval()
model.generation_config = GenerationConfig.from_pretrained('./', pad_token_id=tokenizer.pad_token_id)
all_raw_text = ["我想听你说爱我。", "今天我想吃点啥,甜甜的,推荐下", "我马上迟到了,怎么做才能不迟到"]
batch_raw_text = []
for q in all_raw_text:
raw_text, _ = make_context(
tokenizer,
q,
system="You are a helpful assistant.",
max_window_size=model.generation_config.max_window_size,
chat_format=model.generation_config.chat_format,
)
batch_raw_text.append(raw_text)
batch_input_ids = tokenizer(batch_raw_text, padding='longest')
batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)
batch_out_ids = model.generate(
batch_input_ids,
return_dict_in_generate=False,
generation_config=model.generation_config
)
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
batch_response = [
decode_tokens(
batch_out_ids[i][padding_lens[i]:],
tokenizer,
raw_text_len=len(batch_raw_text[i]),
context_length=(batch_input_ids[i].size(0)-padding_lens[i]),
chat_format="chatml",
verbose=False,
errors='replace'
) for i in range(len(all_raw_text))
]
print(batch_response)
response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None)
print(response)
response, _ = model.chat(tokenizer, "今天我想吃点啥,甜甜的,推荐下", history=None)
print(response)
response, _ = model.chat(tokenizer, "我马上迟到了,怎么做才能不迟到", history=None)
print(response)
```
### CPU
To deploy our models on CPU, we strongly advise you to use [qwen.cpp](https://github.com/QwenLM/qwen.cpp), which is a pure C++ implementation of Qwen and tiktoken. Check the repo for more details!
Also, it is also simple to directly run the model on CPU, which requires your specification of device:
```python
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
```
However, it is likely that you suffer from extremely low inference efficiency.
### Multiple GPUs
If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can directly use the default loading method, which is now supported by Transformers. The previous method based on `utils.py` is deprecated.
However, though this method is simple, the efficiency of the native pipeline parallelism is low. We advise you to use vLLM with FastChat and please read the section for deployment.
### x86 Platforms
When deploy on Core™/Xeon® Scalable Processors or with Arc™ GPU, [OpenVINO™ Toolkit](https://docs.openvino.ai/2023.3/gen_ai_guide.html) is recommended. You can install and run this [example notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/254-llm-chatbot). For related issues, you are welcome to file an issue at [OpenVINO repo](https://github.com/openvinotoolkit/openvino_notebooks/issues).
### DashScope
The most simple way to use Qwen through APIs is DashScope API service through Alibaba Cloud. We give an introduction to the usage. Additionally, we provide a script for you to deploy an OpenAI-style API on your own servers.
DashScope is the large language model API service provided by Alibaba Cloud, which now supports Qwen. Note that the models behind DashScope are in-house versions temporarily without details provided. The services include `qwen-turbo` and `qwen-plus`, where the former one runs faster and the latter achieves better performance. For more information, visit the documentation [here](https://dashscope.aliyun.com).
Please head to the official website [link](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) to create a DashScope account and obtain the API key (AK). We recommend setting the AK with an environment variable:
```bash
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
Then please install the packages and click [here](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) for the documentation. If you use Python, you can install DashScope with pip:
```bash
pip install dashscope
```
If you use JAVA SDK, you can install it in this way:
```xml
<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>the-latest-version</version>
</dependency>
```
The simplest way to use DashScope is the usage with messages, which is similar to OpenAI API. The example is demonstrated below:
```python
import random
from http import HTTPStatus
from dashscope import Generation
def call_with_messages():
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '如何做西红柿鸡蛋?'}]
gen = Generation()
response = gen.call(
Generation.Models.qwen_turbo,
messages=messages,
seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set
result_format='message', # set the result to be "message" format.
)
return response
if __name__ == '__main__':
response = call_with_messages()
if response.status_code == HTTPStatus.OK:
print(response)
else:
print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
```
For more usages, please visit the official website for more details.
<br><br>
## Quantization
### GPTQ
We provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release the Int4 and Int8 quantized models, which achieve nearly lossless model effects but improved performance on both memory costs and inference speed.
Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
```bash
pip install auto-gptq optimum
```
If you meet problems installing `auto-gptq`, we advise you to check out the official [repo](https://github.com/PanQiWei/AutoGPTQ) to find a wheel.
> Note: The pre-compiled `auto-gptq` packages strongly depend on the version of `torch` and its CUDA version. Moreover, due to recent update,
> you may also encounter unsupported version errors from `transformers`, `optimum`, or `peft`.
> We recommend using the latest versions meeting the following requirements:
> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
Then you can load the quantized model easily and run inference as same as usual:
```python
# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "Hi", history=None)
```
We illustrate the model performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|----------------------|:----:|:-----------:|:-----:|:---------:|
| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 |
| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 |
| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 |
| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 |
| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 |
| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 |
| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 |
| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 |
| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 |
| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 |
| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 |
| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 |
### Quantization of KV cache
> NOTE: Please be aware that due to the internal mechanism of Hugging Face, the support files for this functionality
> (i.e., `cache_autogptq_cuda_256.cpp` and `cache_autogptq_cuda_kernel_256.cu`) may be missing. Please manually download
> them from the Hugging Face Hub and place them into the same folder as the other module files.
The attention KV cache can be quantized and compressed for storage, to get a higher sample throughput. The arguments `use_cache_quantization` and `use_cache_kernel` in `config.json` are provided to enable KV cache quantization. The specific use method is as follows:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True,
use_cache_quantization=True,
use_cache_kernel=True,
use_flash_attn=False
)
```
Attention: Currently, KV cache quantization and flash attention cannot be used at the same time.
If you enable KV cache quantization and flash attention at the same time (`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`), `use_flash_attn` is disabled by default (`use_flash_attn=false`).
We have verified that the use of the quantized Int8-KV-Cache model does not suffer from significant performance degradation in downstream evaluation. In the following, we focus on profiling its memory footprint in different conditions.
The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4.
We use BF16 models to generate 1024 tokens by default, and "OOM" indicates out-of-memory error.
With KV cache quantization, the model can infer with a larger batch size (bs).
| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
|--------------|:------:|:------:|:------:|:------:|:------:|:------:|
| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | OOM | OOM |
| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
With KV cache quantization the model can save more memory when generating longer sequence (`sl`, sequence length, referring to the number of tokens generated) at the stage of inference.
| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
|--------------|:------:|:-------:|:-------:|:-------:|:-------:|
| No | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
| Yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
The model with KV cache quantization will convert the format of `layer_past` from float to int8, and meanwhile the quantized `layer-past` will also store the quantization parameters.
Specific steps are as follows:
1. Quantize key/value
```
qv,scale,zero_point=quantize_cache_v(v)
```
2. Store into layer_past
The following is the format of quantized `layer_past`:
```
layer_past=((q_key,key_scale,key_zero_point),
(q_value,value_scale,value_zero_point))
```
The original format of `layer_past` is shown below:
```
layer_past=(key,value)
```
If you want to use the attention KV which is quantized, you can use the dequantization operation to convert the Int8 key/value back to the float format as follows:
```
v=dequantize_cache_torch(qv,scale,zero_point)
```
<br>
## Inference Performance
This section provides the statistics of speed and memory of models in different precisions. The speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
We measured the average inference speed (tokens/s) and GPU memory usage of generating 2048 with the models in BF16, Int8, and Int4.
<table>
<tr>
<td>Model Size</td>
<td>Quantization</td>
<td>Speed (Tokens/s)</td>
<td>GPU Memory Usage</td>
</tr>
<tr>
<td rowspan="3">1.8B</td>
<td>BF16</td>
<td>54.09</td>
<td>4.23GB</td>
</tr>
<tr>
<td>Int8</td>
<td>55.56</td>
<td>3.48GB</td>
</tr>
<tr>
<td>Int4</td>
<td>71.07</td>
<td>2.91GB</td>
</tr>
<tr>
<td rowspan="3">7B</td>
<td>BF16</td>
<td>40.93</td>
<td>16.99GB</td>
</tr>
<tr>
<td>Int8</td>
<td>37.47</td>
<td>11.20GB</td>
</tr>
<tr>
<td>Int4</td>
<td>50.09</td>
<td>8.21GB</td>
</tr>
<tr>
<td rowspan="3">14B</td>
<td>BF16</td>
<td>32.22</td>
<td>30.15GB</td>
</tr>
<tr>
<td>Int8</td>
<td>29.28</td>
<td>18.81GB</td>
</tr>
<tr>
<td>Int4</td>
<td>38.72</td>
<td>13.01GB</td>
</tr>
<tr>
<td rowspan="3">72B</td>
<td>BF16</td>
<td>8.48</td>
<td>144.69GB (2xA100)</td>
</tr>
<tr>
<td>Int8</td>
<td>9.05</td>
<td>81.27GB (2xA100)</td>
</tr>
<tr>
<td>Int4</td>
<td>11.32</td>
<td>48.86GB</td>
</tr>
<tr>
<td>72B + vLLM</td>
<td>BF16</td>
<td>17.60</td>
<td>2xA100</td>
</tr>
</table>
The profiling runs on a single A100-SXM4-80G GPU (except 2xA100 is mentioned) with PyTorch 2.0.1, CUDA 11.8, and Flash-Attention 2. (72B + vLLM uses PyTorch 2.1.0 and Cuda 11.8.) The inference speed is averaged over the encoded and generated tokens.
Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using ``AutoModelForCausalLM.from_pretrained`` will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
We also measure the inference speed and GPU memory usage with different settings of context and generation lengths, Flash-Attention version. You can find the results in the according modelcards on Hugging Face or ModelScope.
## Finetuning
### Usage
Now we provide the official training script, `finetune.py`, for users to finetune the pretrained model for downstream applications in a simple fashion. Additionally, we provide shell scripts to launch finetuning with no worries. This script supports the training with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). The shell scripts that we provide use DeepSpeed (Note: this may have conflicts with the latest version of pydantic and you should use make sure `pydantic<2.0`) and Peft. You can install them by:
```bash
pip install "peft<0.8.0" deepspeed
```
To prepare your training data, you need to put all the samples into a list and save it to a json file. Each sample is a dictionary consisting of an id and a list for conversation. Below is a simple example list with 1 sample:
```json
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是一个语言模型,我叫通义千问。"
}
]
}
]
```
After data preparation, you can use the provided shell scripts to run finetuning. Remember to specify the path to the data file, `$DATA`.
The finetuning scripts allow you to perform:
- Full-parameter finetuning
- LoRA
- Q-LoRA
Full-parameter finetuning requires updating all parameters in the whole training process. To launch your training, run the following script:
```bash
# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training.
bash finetune/finetune_ds.sh
```
Remember to specify the correct model name or path, the data path, as well as the output directory in the shell scripts. Another thing to notice is that we use DeepSpeed ZeRO 3 in this script. If you want to make changes, just remove the argument `--deepspeed` or make changes in the DeepSpeed configuration json file based on your requirements. Additionally, this script supports mixed-precision training, and thus you can use `--bf16 True` or `--fp16 True`. Remember to use DeepSpeed when you use fp16 due to mixed precision training. Empirically we advise you to use bf16 to make your training consistent with our pretraining and alignment if your machine supports bf16, and thus we use it by default.
Similarly, to run LoRA, use another script to run as shown below. Before you start, make sure that you have installed `peft`. Also, you need to specify your paths to your model, data, and output. We advise you to use absolute path for your pretrained model. This is because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load. Also, this script support both bf16 and fp16.
```bash
# Single GPU training
bash finetune/finetune_lora_single_gpu.sh
# Distributed training
bash finetune/finetune_lora_ds.sh
```
In comparison with full-parameter finetuning, LoRA ([paper](https://arxiv.org/abs/2106.09685)) only updates the parameters of adapter layers but keeps the original large language model layers frozen. This allows much fewer memory costs and thus fewer computation costs.
Note that if you use LoRA to finetune the base language model, e.g., Qwen-7B, instead of chat models, e.g., Qwen-7B-Chat, the script automatically switches the embedding and output layer as trainable parameters. This is because the base language model has no knowledge of special tokens brought by ChatML format. Thus these layers should be updated for the model to understand and predict the tokens. Or in another word, if your training brings in special tokens in LoRA, you should set the layers to trainable parameters by setting `modules_to_save` inside the code. Also, if we have these parameters trainable, it is not available to use ZeRO 3, and this is why we use ZeRO 2 in the script by default. If you do not have new trainable parameters, you can switch to ZeRO 3 by changing the DeepSpeed configuration file. Additionally, we find that there is a significant gap between the memory footprint of LoRA with and without these trainable parameters. Therefore, if you have trouble with memory, we advise you to LoRA finetune the chat models. Check the profile below for more information.
If you still suffer from insufficient memory, you can consider Q-LoRA ([paper](https://arxiv.org/abs/2305.14314)), which uses the quantized large language model and other techniques such as paged attention to allow even fewer memory costs.
Note: to run single-GPU Q-LoRA training, you may need to install `mpi4py` through `pip` or `conda`.
To run Q-LoRA, directly run the following script:
```bash
# Single GPU training
bash finetune/finetune_qlora_single_gpu.sh
# Distributed training
bash finetune/finetune_qlora_ds.sh
```
For Q-LoRA, we advise you to load our provided quantized model, e.g., Qwen-7B-Chat-Int4. You **SHOULD NOT** use the bf16 models. Different from full-parameter finetuning and LoRA, only fp16 is supported for Q-LoRA. For single-GPU training, we have to use DeepSpeed for mixed-precision training due to our observation of errors caused by torch amp. Besides, for Q-LoRA, the troubles with the special tokens in LoRA still exist. However, as we only provide the Int4 models for chat models, which means the language model has learned the special tokens of ChatML format, you have no worry about the layers. Note that the layers of the Int4 model should not be trainable, and thus if you introduce special tokens in your training, Q-LoRA might not work.
> NOTE: Please be aware that due to the internal mechanisms of Hugging Face, certain non-Python files (e.g., `*.cpp` and `*.cu`)
> may be missing from the saved checkpoint. You may need to manually copy them to the directory containing other files.
Different from full-parameter finetuning, the training of both LoRA and Q-LoRA only saves the adapter parameters. Suppose your training starts from Qwen-7B, you can load the finetuned model for inference as shown below:
```python
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
path_to_adapter, # path to the output directory
device_map="auto",
trust_remote_code=True
).eval()
```
> NOTE: If `peft>=0.8.0`, it will try to load the tokenizer as well, however, initialized without `trust_remote_code=True`, leading to `ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.` Currently, you could downgrade `peft<0.8.0` or move tokenizer files elsewhere to workaround this issue.
If you want to merge the adapters and save the finetuned model as a standalone model (you can only do this with LoRA, and you CANNOT merge the parameters from Q-LoRA), you can run the following codes:
```python
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
path_to_adapter, # path to the output directory
device_map="auto",
trust_remote_code=True
).eval()
merged_model = model.merge_and_unload()
# max_shard_size and safe serialization are not necessary.
# They respectively work for sharding checkpoint and save the model to safetensors
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)
```
The `new_model_directory` directory will contain the merged model weights and module files. Please note that `*.cu` and `*.cpp` files may be missing in the saved files. If you wish to use the KV cache functionality, please manually copy them. Besides, the tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
path_to_adapter, # path to the output directory
trust_remote_code=True
)
tokenizer.save_pretrained(new_model_directory)
```
Note: For multi-GPU training, you need to specify the proper hyperparameters for distributed training based on your machine. Besides, we advise you to specify your maximum sequence length with the argument `--model_max_length`, based on your consideration of data, memory footprint, and training speed.
### Quantize Fine-tuned Models
This section applies to full-parameter/LoRA fine-tuned models. (Note: You do not need to quantize the Q-LoRA fine-tuned model because it is already quantized.)
If you use LoRA, please follow the above instructions to merge your model before quantization.
We recommend using [auto_gptq](https://github.com/PanQiWei/AutoGPTQ) to quantize the finetuned model.
```bash
pip install auto-gptq optimum
```
Note: Currently AutoGPTQ has a bug referred in [this issue](https://github.com/PanQiWei/AutoGPTQ/issues/370). Here is a [workaround PR](https://github.com/PanQiWei/AutoGPTQ/pull/495), and you can pull this branch and install from the source.
First, prepare the calibration data. You can reuse the fine-tuning data, or use other data following the same format.
Second, run the following script:
```bash
python run_gptq.py \
--model_name_or_path $YOUR_LORA_MODEL_PATH \
--data_path $DATA \
--out_path $OUTPUT_PATH \
--bits 4 # 4 for int4; 8 for int8
```
This step requires GPUs and may costs a few hours according to your data size and model size.
Then, copy all `*.py`, `*.cu`, `*.cpp` files and `generation_config.json` to the output path. And we recommend you to overwrite `config.json` by copying the file from the coresponding official quantized model
(for example, if you are fine-tuning `Qwen-7B-Chat` and use `--bits 4`, you can find the `config.json` from [Qwen-7B-Chat-Int4](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4/blob/main/config.json)).
You should also rename the ``gptq.safetensors`` into ``model.safetensors``.
Finally, test the model by the same method to load the official quantized model. For example,
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("/path/to/your/model", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"/path/to/your/model",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
```
### Multinode Finetuning
Our provided scripts support multinode finetuning. You can refer to the comments in [script](./finetune/finetune_lora_ds.sh) to correctly set corresponding arguments and launch the script on each node. For more information about multinode distributed training, please refer to [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
Note: DeepSpeed ZeRO 3 requires much greater inter-node communication rate than ZeRO 2, which will significantly reduce the training speed in the case of multinode finetuning. Therefore, we do not recommend using DeepSpeed ZeRO 3 configurations in multinode finetuning scripts.
### Profiling of Memory and Speed
We profile the GPU memory and training speed of both LoRA (LoRA (emb) refers to training the embedding and output layer, while LoRA has no trainable embedding and output layer) and Q-LoRA in the setup of single-GPU training. In this test, we experiment on a single A100-SXM4-80G GPU, and we use CUDA 11.8 and Pytorch 2.0. Flash attention 2 is applied. We uniformly use a batch size of 1 and gradient accumulation of 8. We profile the memory (GB) and speed (s/iter) of inputs of different lengths, namely 256, 512, 1024, 2048, 4096, and 8192. We also report the statistics of full-parameter finetuning with Qwen-7B on 2 A100 GPUs. We only report the statistics of 256, 512, and 1024 tokens due to the limitation of GPU memory.
For Qwen-7B, we also test the performance of multinode finetuning. We experiment using two servers, each containing two A100-SXM4-80G GPUs, and the rest of configurations are the same as other Qwen-7B experiments. The results of multinode finetuning are marked as LoRA (multinode) in the table.
For Qwen-72B, we experiment in two ways: 1) Lora fintuning + DeepSpeed ZeRO 3 on 4 A100-SXM4-80G GPUs and 2) QLora (int4) fine-tuning on a single A100-SXM4-80G GPU. Note that OOM occurs on 4 A100-SXM4-80G GPUs both with LoRA (emb) fine-tuning and LoRA fine-tuning without Deepspeed ZeRO 3 (you can pass `--deepspeed finetune/ds_config_zero3.json` to [`finetune/finetune_lora_ds.sh`](finetune/finetune_lora_ds.sh) to enable DeepSpeed ZeRO 3).
The statistics are listed below:
<table>
<tr>
<th rowspan="2">Model Size</th><th rowspan="2">Method</th><th rowspan="2">#Nodes</th><th rowspan="2">#GPUs per node</th><th colspan="6" align="center">Sequence Length</th>
</tr>
<tr>
<th align="center">256</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th>
</tr>
<tr>
<th rowspan="4">1.8B</th><td>LoRA</td>
<td>1</td><td>1</td>
<td align="center">6.7G / 1.0s/it</td><td align="center">7.4G / 1.0s/it</td><td align="center">8.4G / 1.1s/it</td><td align="center">11.0G / 1.7s/it</td><td align="center">16.2G / 3.3s/it</td><td align="center">21.8G / 6.8s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td>
<td>1</td><td>1</td>
<td align="center">13.7G / 1.0s/it</td><td align="center">14.0G / 1.0s/it</td><td align="center">14.0G / 1.1s/it</td><td align="center">15.1G / 1.8s/it</td><td align="center">19.7G / 3.4s/it</td><td align="center">27.7G / 7.0s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">5.8G / 1.4s/it</td><td align="center">6.0G / 1.4s/it</td><td align="center">6.6G / 1.4s/it</td><td align="center">7.8G / 2.0s/it</td><td align="center">10.2G / 3.4s/it</td><td align="center">15.8G / 6.5s/it</td>
</tr>
<tr>
<td>Full-parameter</td>
<td>1</td><td>1</td>
<td align="center">43.5G / 2.1s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.3s/it</td><td align="center">47.1G / 2.8s/it</td><td align="center">48.3G / 5.6s/it</td>
</tr>
<tr>
<th rowspan="5">7B</th>
<td>LoRA</td>
<td>1</td><td>1</td>
<td align="center">20.1G / 1.2s/it</td><td align="center">20.4G / 1.5s/it</td><td align="center">21.5G / 2.8s/it</td><td align="center">23.8G / 5.2s/it</td><td align="center">29.7G / 10.1s/it</td><td align="center">36.6G / 21.3s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td>
<td>1</td><td>1</td>
<td align="center">33.7G / 1.4s/it</td><td align="center">34.1G / 1.6s/it</td><td align="center">35.2G / 2.9s/it</td><td align="center">35.1G / 5.3s/it</td><td align="center">39.2G / 10.3s/it</td><td align="center">48.5G / 21.7s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">11.5G / 3.0s/it</td><td align="center">11.5G / 3.0s/it</td><td align="center">12.3G / 3.5s/it</td><td align="center">13.9G / 7.0s/it</td><td align="center">16.9G / 11.6s/it</td><td align="center">23.5G / 22.3s/it</td>
</tr>
<tr>
<td>Full-parameter</td>
<td>1</td><td>2</td>
<td align="center">139.2G / 4.0s/it</td><td align="center">148.0G / 4.0s/it</td><td align="center">162.0G / 4.5s/it</td><td align="center">-</td><td align="center">-</td><td align="center">-</td>
</tr>
<tr>
<td>LoRA (multinode)</td>
<td>2</td><td>2</td>
<td align="center">74.7G / 2.09s/it</td><td align="center">77.6G / 3.16s/it</td><td align="center">84.9G / 5.17s/it</td><td align="center">95.1G / 9.25s/it</td><td align="center">121.1G / 18.1s/it</td><td align="center">155.5G / 37.4s/it</td>
</tr>
<tr>
<th rowspan="3">14B</th>
<td>LoRA</td>
<td>1</td><td>1</td>
<td align="center">34.6G / 1.6s/it</td><td align="center">35.1G / 2.4s/it</td><td align="center">35.3G / 4.4s/it</td><td align="center">37.4G / 8.4s/it</td><td align="center">42.5G / 17.0s/it</td><td align="center">55.2G / 36.0s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td>
<td>1</td><td>1</td>
<td align="center">51.2 / 1.7s/it</td><td align="center">51.1G / 2.6s/it</td><td align="center">51.5G / 4.6s/it</td><td align="center">54.1G / 8.6s/it</td><td align="center">56.8G / 17.2s/it</td><td align="center">67.7G / 36.3s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">18.7G / 5.3s/it</td><td align="center">18.4G / 6.3s/it</td><td align="center">18.9G / 8.2s/it</td><td align="center">19.9G / 11.8s/it</td><td align="center">23.0G / 20.1s/it</td><td align="center">27.9G / 38.3s/it</td>
</tr>
<tr>
<th rowspan="2">72B</th>
<td>LoRA + Deepspeed Zero3</td>
<td>1</td><td>4</td>
<td align="center">215.4G / 17.6s/it</td><td align="center">217.7G / 20.5s/it</td><td align="center">222.6G / 29.4s/it</td><td align="center">228.8G / 45.7s/it</td><td align="center">249.0G / 83.4s/it</td><td align="center">289.2G / 161.5s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">61.4G / 27.4s/it</td><td align="center">61.4G / 31.5s/it</td><td align="center">62.9G / 41.4s/it</td><td align="center">64.1G / 59.5s/it</td><td align="center">68.0G / 97.7s/it</td><td align="center">75.6G / 179.8s/it</td>
</tr>
</table>
<br>
## Deployment
### vLLM
For deployment and fast inference, we suggest using vLLM.
If you use **CUDA 12.1 and PyTorch 2.1**, you can directly use the following command to install vLLM.
```bash
pip install vllm
```
Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
#### vLLM + Transformer-like Wrapper
You can download the [wrapper codes](examples/vllm_wrapper.py) and execute the following commands for multiple rounds of dialogue interaction. (Note: It currently only supports the ``model.chat()`` method.)
```python
from vllm_wrapper import vLLMWrapper
model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4', tensor_parallel_size=1, dtype="float16")
response, history = model.chat(query="你好", history=None)
print(response)
response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
response, history = model.chat(query="给这个故事起一个标题", history=history)
print(response)
```
#### vLLM + Web Demo / OpenAI-like API
You can use FastChat to lauch a web demo or an OpenAI API server. First, install FastChat:
```bash
pip install "fschat[model_worker,webui]"
```
To run Qwen with vLLM and FastChat, you need launch a controller by:
```bash
python -m fastchat.serve.controller
```
Then you can launch the model worker, which means loading your model for inference. For single GPU inference, you can directly run:
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype float16 # run int4 model
```
However, if you hope to run the model on multiple GPUs for faster inference or larger memory, you can use tensor parallelism supported by vLLM. Suppose you run the model on 4 GPUs, the command is shown below:
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype float16 # run int4 model
```
After launching your model worker, you can launch a:
* Web UI Demo
```bash
python -m fastchat.serve.gradio_web_server
```
* OpenAI API
```bash
python -m fastchat.serve.openai_api_server --host localhost --port 8000
```
However, if you find it difficult to use vLLM and FastChat, you can try our provided simplest methods to deploy a web demo, CLI demo, and API.
### Web UI
We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages:
```
pip install -r requirements_web_demo.txt
```
Then run the command below and click on the generated link:
```bash
python web_demo.py
```
<p align="center">
<br>
<img src="assets/web_demo.gif" width="600" />
<br>
<p>
### CLI Demo
We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:
```bash
python cli_demo.py
```
<p align="center">
<br>
<img src="assets/cli_demo.gif" width="600" />
<br>
<p>
<br>
### API
We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages:
```bash
pip install fastapi uvicorn "openai<1.0" pydantic sse_starlette
```
Then run the command to deploy your API:
```bash
python openai_api.py
```
You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them.
Using the API is also simple. See the example below:
```python
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"
# create a request activating streaming response
for chunk in openai.ChatCompletion.create(
model="Qwen",
messages=[
{"role": "user", "content": "你好"}
],
stream=True
# Specifying stop words in streaming output format is not yet supported and is under development.
):
if hasattr(chunk.choices[0].delta, "content"):
print(chunk.choices[0].delta.content, end="", flush=True)
# create a request not activating streaming response
response = openai.ChatCompletion.create(
model="Qwen",
messages=[
{"role": "user", "content": "你好"}
],
stream=False,
stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
)
print(response.choices[0].message.content)
```
<p align="center">
<br>
<img src="assets/openai_api.gif" width="600" />
<br>
<p>
**Function calling** is also supported (but only when `stream=False` for the moment). See the [example usage](examples/function_call_examples.py) here.
<br><br>
## 🐳 Docker
To simplify the deployment process, we provide docker images with pre-built environments: [qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen). You only need to install the driver and download model files to launch demos, deploy OpenAI API, and finetune the model.
### Preparation
1. Install the correct version of Nvidia driver depending on the image to use:
- `qwenllm/qwen:cu117` (**recommend**): `>= 515.48.07`
- `qwenllm/qwen:cu114` (w/o flash-attention): `>= 470.82.01`
- `qwenllm/qwen:cu121`: `>= 530.30.02`
- `qwenllm/qwen:latest`: same as `qwenllm/qwen:cu117`
2. Install and configure [docker](https://docs.docker.com/engine/install/) and [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html):
```bash
# configure docker
sudo systemctl start docker
# test if docker is correctly installed
sudo docker run hello-world
# configure nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# test if nvidia-container-toolkit is correctly installed
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```
3. Download model checkpoints and codes to your environment (see [here](#DownloadModel)).
### Deployment
Here we use Qwen-7B-Chat as an example. Before launching a web demo or API, you can setup the configuration as shown below:
```bash
IMAGE_NAME=qwenllm/qwen:cu117
PORT=8901
CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # Path to downloaded model checkpoints and codes
```
The following scripts can help you build:
* OpenAI API
```bash
bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* Web UI
```bash
bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* CLI Demo
```bash
bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH}
```
The commands above will automatically download the required image and launch a Web UI demo in background (the service will auto-restart). You can open `http://localhost:${PORT}` on the host to use the demo.
The demo is successfully launched if you see the following output:
```text
Successfully started web demo. Open '...' to try!
Run `docker logs ...` to check demo status.
Run `docker rm -f ...` to stop and remove the demo.
```
If you want to check the status of the demo, you can use `docker logs qwen` to display outputs.
You can use `docker rm -f qwen` to stop the service and remove the container.
### Finetuning
The method of finetuning using the pre-built Docker image is basically the same as [the above chapter](#Finetuning) (we have already installed dependencies in the image):
The following is an example of single-GPU LoRA:
```bash
IMAGE_NAME=qwenllm/qwen:cu117
CHECKPOINT_PATH=/path/to/Qwen-7B # Path to downloaded model checkpoints and codes
#CHECKPOINT_PATH=/path/to/Qwen-7B-Chat-Int4 # Path to downloaded model checkpoints and codes (Q-LoRA)
DATA_PATH=/path/to/data/root # Prepare finetune data at ${DATA_PATH}/example.json
OUTPUT_PATH=/path/to/output/checkpoint # Path to finetune outputs
# Use all host devices by default
DEVICE=all
# If you need to specify GPUs for training, set device as follow (NOTE: internal quotation marks cannot be omitted)
#DEVICE='"device=0,1,2,3"'
mkdir -p ${OUTPUT_PATH}
# Single-GPU LoRA finetuning
docker run --gpus ${DEVICE} --rm --name qwen \
--mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B \
--mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \
--mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \
--shm-size=2gb \
-it ${IMAGE_NAME} \
bash finetune/finetune_lora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B/ -d /data/shared/Qwen/data/example.json
```
To make a change to single-GPU Q-LoRA for example, you just need to modify the bash command inside `docker run`:
```bash
bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json
```
<br>
## 🔥 System Prompt
Qwen-1.8-Chat and Qwen-72B-Chat have been fully trained on diverse system prompts with multiple rounds of complex interactions, so that they can follow a variety of system prompts and realize model customization in context, further improving the scalability of Qwen-chat.
With System Prompt, Qwen-Chat can realize **roly playing**, **language style transfer**, **task setting**, and **behavior setting**.


For more information, please refer to the [example documentation](examples/system_prompt.md).
## Tool Usage
Qwen-Chat has been optimized for tool usage and function calling capabilities. Users can develop agents, LangChain applications, and even augment Qwen with a Python Code Interpreter.
We provide documentation on how to implement tool calls based on the principle of ReAct Prompting, please refer to [the ReAct example](examples/react_prompt.md). Based on this principle, we provide support for function calling in [openai_api.py](openai_api.py).
We have tested the model's tool calling capabilities on our open-source Chinese evaluation benchmark and found that Qwen-Chat consistently performs well:
<table>
<tr>
<th colspan="4" align="center">Chinese Tool-Use Benchmark (Version 20231206)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection (Acc.↑)</th><th align="center">Tool Input (Rouge-L↑)</th><th align="center">False Positive Error↓</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">98.0%</td><td align="center">0.953</td><td align="center">23.9%</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">74.5%</td><td align="center">0.807</td><td align="center">80.6%</td>
</tr>
<tr>
<td>Qwen-1_8B-Chat</td><td align="center">85.0%</td><td align="center">0.839</td><td align="center">27.6%</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">95.5%</td><td align="center">0.900</td><td align="center">11.6%</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">96.9%</td><td align="center">0.917</td><td align="center">5.6%</td>
</tr>
<tr>
<td>Qwen-72B-Chat</td><td align="center">98.2%</td><td align="center">0.927</td><td align="center">1.1%</td>
</tr>
</table>
To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this [link](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark).
We have observed that Qwen performs well in terms of code executability and result accuracy when generating code:
<table>
<tr>
<th colspan="5" align="center">Code Interpreter Benchmark (Version 20231206)</th>
</tr>
<tr>
<th rowspan="2" align="center">Model</th>
<th colspan="3" align="center">Accuracy of Code Execution Results (%)</th>
<th colspan="1" align="center">Executable Rate of Code (%)</th>
</tr>
<tr>
<th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th><th align="center">General↑</th>
</tr>
<tr>
<td>GPT-4</td>
<td align="center">82.8</td>
<td align="center">66.7</td>
<td align="center">60.8</td>
<td align="center">82.8</td>
</tr>
<tr>
<td>GPT-3.5</td>
<td align="center">47.3</td>
<td align="center">33.3</td>
<td align="center">55.7</td>
<td align="center">74.1</td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">8.3</td>
<td align="center">1.2</td>
<td align="center">15.2</td>
<td align="center">48.3</td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">28.2</td>
<td align="center">15.5</td>
<td align="center">21.5</td>
<td align="center">74.1</td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">34.6</td>
<td align="center">10.7</td>
<td align="center">25.1</td>
<td align="center">65.5</td>
</tr>
<tr>
<td>ChatGLM3-6B</td>
<td align="center">54.2</td>
<td align="center">4.8</td>
<td align="center">15.2</td>
<td align="center">67.1</td>
</tr>
<tr>
<td>Qwen-1.8B-Chat</td>
<td align="center">25.6</td>
<td align="center">21.4</td>
<td align="center">22.8</td>
<td align="center">65.5</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">23.8</td>
<td align="center">38.0</td>
<td align="center">67.2</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">58.4</td>
<td align="center">31.0</td>
<td align="center">45.6</td>
<td align="center">65.5</td>
</tr>
<tr>
<td>Qwen-72B-Chat</td>
<td align="center">72.7</td>
<td align="center">41.7</td>
<td align="center">43.0</td>
<td align="center">82.8</td>
</tr>
</table>
<p align="center">
<br>
<img src="assets/code_interpreter_showcase_001.jpg" />
<br>
<p>
<br>
## Long-Context Understanding
To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length of Qwen-14B from 2K to over 8K tokens, and Qwen-1.8B/7B from 8K to 32K tokens.
For Qwen-72B, we adapt RoPE to longer contexts with a larger rotary base. Qwen-72B supports the max context length of 32K tokens.
We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen can reach outstanding performance in the scenario of long context. Results are demonstrated below:
<table>
<tr>
<th rowspan="2">Model</th><th colspan="6" align="center">Sequence Length</th>
</tr>
<tr>
<th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th><th align="center">32768</th>
</tr>
<tr>
<td>Qwen-7B (original)</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.56</td><td align="center">4.62</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.58</td><td align="center">3.49</td><td align="center">4.32</td><td align="center">-</td>
</tr>
<tr>
<tr>
<td>Qwen-1.8B</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.13</b></td><td align="center"><b>3.89</b></td><td align="center">17.42</td><td align="center">433.85</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.14</b></td><td align="center"><b>3.93</b></td><td align="center"><b>3.82</b></td><td align="center"><b>3.83</b></td>
</tr>
<tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
</tr>
<tr>
<td>Qwen-14B</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center">22.79</td><td align="center">334.65</td><td align="center">3168.35</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
</tr>
<tr>
<td>Qwen-72B</td><td align="center"><b>-</b></td><td align="center"><b>-</b></td><td align="center">-</td><td align="center"><b>2.83</b></td><td align="center"><b>2.73</b></td><td align="center"><b>2.72</b></td>
</tr>
</tr>
</table>
Furthermore, to verify the ability of Qwen-72B-Chat on long text understanding, we tested it on [L-Eval](https://arxiv.org/abs/2307.11088) (closed-ended tasks). The results are as follows:
| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition |
|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 |
| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** |
We conducted the "needle in a haystack" experiment (the idea came from [@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393)) to test whether the model can retrieve information at different positions in the inputs of different lengths, the result is as follows:

The above results show that Qwen-72B-Chat can accurately retrieve information placed in various positions within an input length of 32k, proving its excellent long text understanding capabilities.
## Tokenizer
Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
<br><br>
## Reproduction
For your reproduction of the model performance on benchmark datasets, we provide scripts for you to reproduce the results. Check [eval/EVALUATION.md](eval/EVALUATION.md) for more information. Note that the reproduction may lead to slight differences from our reported results.
<br><br>
## FAQ
If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue.
<br><br>
## Citation
If you find our work helpful, feel free to give us a cite.
```
@article{qwen,
title={Qwen Technical Report},
author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
journal={arXiv preprint arXiv:2309.16609},
year={2023}
}
```
<br>
## License Agreement
The source code provided at <https://github.com/QwenLM/Qwen> is licensed under the [Apache 2.0 License](./LICENSE) that can be found at the root directory.
Researchers and developers are free to use the codes and model weights of both Qwen and Qwen-Chat. For their commercial use, please check the License Agreement accompanying each model.
- Qwen-72B, Qwen-14B, and Qwen-7B are licensed under the [Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT) that can be found at the corresponding HuggingFace and ModelScope repository. For commercial use, please fill out the form ([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat), [14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat), and [7B](https://dashscope.console.aliyun.com/openModelApply/qianwen)) to apply.
- Qwen-1.8B is licensed under the [Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT) that can be found at the corresponding HuggingFace and ModelScope repository. For commercial use, please contact us.
<br><br>
## Contact Us
If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.
================================================
FILE: README_CN.md
================================================
<p align="left">
中文</a>  |  <a href="README.md">English</a>  |  <a href="README_JA.md">日本語</a> |  <a href="README_FR.md">Français</a> |  <a href="README_ES.md">Español</a>
</p>
<br><br>
<p align="center">
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" width="400"/>
<p>
<br>
<p align="center">
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>   |    📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a>    |   🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-72B-Chat-Demo/summary">Demo</a>
<br>
<a href="assets/wechat.png">WeChat (微信)</a>   |   <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>   |   <a href="https://dashscope.aliyun.com">API</a>   |   <a href="https://qianwen.aliyun.com">Web</a>   |   <a href="https://apps.apple.com/cn/app/%E9%80%9A%E4%B9%89%E5%8D%83%E9%97%AE/id6466733523">APP</a>
</p>
<br><br>
> [!Important]
> Qwen2已开,欢迎关注!看这里:[QwenLM/Qwen2](https://github.com/QwenLM/Qwen2)
>
> Qwen2模型代码和用法相比此前版本有较大不同,因此我们使用新的repo进行维护。此repo ([QwenLM/Qwen](https://github.com/QwenLM/Qwen)) 已停止主要更新维护。
> [!Warning]
> 请勿混用[Qwen](https://github.com/QwenLM/Qwen)和[Qwen2](https://github.com/QwenLM/Qwen2)代码,两者并不兼容。
<br>
| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen |
|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 1.8B | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B">🤗</a> |
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a> |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
| 72B | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B">🤗</a> |
我们开源了**Qwen**(通义千问)系列工作,当前开源模型的参数规模为18亿(1.8B)、70亿(7B)、140亿(14B)和720亿(72B)。本次开源包括基础模型**Qwen**,即**Qwen-1.8B**、**Qwen-7B**、**Qwen-14B**、**Qwen-72B**,以及对话模型**Qwen-Chat**,即**Qwen-1.8B-Chat**、**Qwen-7B-Chat**、**Qwen-14B-Chat**和**Qwen-72B-Chat**。模型链接在表格中,请点击了解详情。同时,我们公开了我们的<b><a href="https://arxiv.org/abs/2309.16609">技术报告</a></b>,请点击上方论文链接查看。
当前基础模型已经稳定训练了大规模高质量且多样化的数据,覆盖多语言(当前以中文和英文为主),总量高达3万亿token。在相关基准评测中,Qwen系列模型拿出非常有竞争力的表现,显著超出同规模模型并紧追一系列最强的闭源模型。此外,我们利用SFT和RLHF技术实现对齐,从基座模型训练得到对话模型。Qwen-Chat具备聊天、文字创作、摘要、信息抽取、翻译等能力,同时还具备一定的代码生成和简单数学推理的能力。在此基础上,我们针对LLM对接外部系统等方面针对性地做了优化,当前具备较强的工具调用能力,以及最近备受关注的Code Interpreter的能力和扮演Agent的能力。我们将各个大小模型的特点列到了下表。
| 模型 | 开源日期 | 最大上下文长度 | System Prompt强化 | 预训练token数 | 微调(Q-Lora)最小GPU用量 | 生成2048个token的最小显存占用(Int4) | 工具调用 |
|:----------|:--------:|:-------:|:---------------:|:---------:|:-----------------:|:-------------------:|:----:|
| Qwen-1.8B | 23.11.30 | 32K | ✅ | 2.2T | 5.8GB | 2.9GB | ✅ |
| Qwen-7B | 23.08.03 | 32K | ❎ | 2.4T | 11.5GB | 8.2GB | ✅ |
| Qwen-14B | 23.09.25 | 8K | ❎ | 3.0T | 18.7GB | 13.0GB | ✅ |
| Qwen-72B | 23.11.30 | 32K | ✅ | 3.0T | 61.4GB | 48.9GB | ✅ |
在这个项目中,你可以了解到以下内容
* 快速上手Qwen-Chat教程,玩转大模型推理
* 量化模型相关细节,包括GPTQ和KV cache量化
* 推理性能数据,包括推理速度和显存占用
* 微调的教程,帮你实现全参数微调、LoRA以及Q-LoRA
* 部署教程,以vLLM和FastChat为例
* 搭建Demo的方法,包括WebUI和CLI Demo
* 搭建API的方法,我们提供的示例为OpenAI风格的API
* 更多关于Qwen在工具调用、Code Interpreter、Agent方面的内容
* 长序列理解能力及评测
* 使用协议
* ...
如果遇到问题,请优先考虑查询[FAQ](FAQ.md)。如仍未解决,随时提出issue(但建议使用英语或提供翻译,有助于帮助更多用户)。如果想帮助我们提升,欢迎提交Pull Requests!
想和我们一起讨论和聊天的话,赶紧加入我们的微信群和Discord server(入口见文档开头部分)!
<br><br>
## 新闻
* 2023.11.30 🔥 我们推出 **Qwen-72B** 和 **Qwen-72B-Chat**,它们在 3T tokens上进行训练,并支持 32k 上下文。同时也发布了 **Qwen-1.8B** 和 **Qwen-1.8B-Chat**。我们还增强了 Qwen-72B-Chat 和 Qwen-1.8B-Chat 的系统指令(System Prompt)功能,请参阅[示例文档](examples/system_prompt.md)。此外,我们还对**昇腾910**以及**海光DCU**实现了推理的支持,详情请查看`ascend-support`及`dcu-support`文件夹。
* 2023年10月17日 我们推出了Int8量化模型**Qwen-7B-Chat-Int8**和**Qwen-14B-Chat-Int8**。
* 2023年9月25日 在魔搭社区(ModelScope)和Hugging Face推出**Qwen-14B**和**Qwen-14B-Chat**模型,并开源 [qwen.cpp](https://github.com/QwenLM/qwen.cpp) 和 [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)。**Qwen-7B**和**Qwen-7B-Chat**的代码和模型也同步得到更新。**请使用最新的代码和模型!**
- 相比原版Qwen-7B,新版用了更多训练数据(从2.2T增加到2.4T tokens),序列长度从2048扩展至8192。整体中文能力以及代码能力均有所提升。
* 2023年9月12日 支持Qwen-7B和Qwen-7B-Chat的微调,其中包括全参数微调、LoRA以及Q-LoRA。
* 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型,Qwen-7B-Chat-Int4。该模型显存占用低,推理速度相比半精度模型显著提升,在基准评测上效果损失较小。
* 2023年8月3日 在魔搭社区(ModelScope)和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时,我们发布了技术备忘录,介绍了相关的训练细节和模型表现。
<br>
## 评测表现
Qwen系列模型相比同规模模型均实现了效果的显著提升。我们评测的数据集包括MMLU、C-Eval、 GSM8K、 MATH、HumanEval、MBPP、BBH等数据集,考察的能力包括自然语言理解、知识、数学计算和推理、代码生成、逻辑推理等。Qwen-72B在所有任务上均超越了LLaMA2-70B的性能,同时在10项任务中的7项任务中超越GPT-3.5.
<p align="left">
<img src="assets/radar_72b.jpg" width="600"/>
<p>
<br>
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 |
| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - |
| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 |
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 |
| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** |
对于以上所有对比模型,我们列出了其官方汇报结果与[OpenCompass](https://opencompass.org.cn/leaderboard-llm)结果之间的最佳分数。
更多的实验结果和细节请查看我们的技术备忘录。点击[这里](https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf)。
<br><br>
## 要求
* python 3.8及以上版本
* pytorch 1.12及以上版本,推荐2.0及以上版本
* transformers 4.32及以上版本
* 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
<br>
## 快速使用
我们提供简单的示例来说明如何利用🤖 ModelScope和🤗 Transformers快速使用Qwen-7B和Qwen-7B-Chat。
你可以使用我们预构建好的Docker镜像,省去大部分配置环境的操作,详情见[“使用预构建的docker镜像”](#-使用预构建的docker镜像)一节。
如不使用Docker,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。
```bash
pip install -r requirements.txt
```
如果你的显卡支持fp16或bf16精度,我们还推荐安装[flash-attention](https://github.com/Dao-AILab/flash-attention)(**当前已支持flash attention 2**)来提高你的运行效率以及降低显存占用。(**flash-attention只是可选项,不安装也可正常运行该项目**)
```bash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 下方安装可选,安装可能比较缓慢。
# pip install csrc/layer_norm
# 如果flash-attn版本高于2.1.1,下方无需安装。
# pip install csrc/rotary
```
接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。
### 🤗 Transformers
如希望使用Qwen-chat进行推理,所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码,并指定正确的模型名称和路径,如`Qwen/Qwen-7B-Chat`和`Qwen/Qwen-14B-Chat`**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# 可选的模型包括: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理,需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# 默认使用自动模式,根据设备自动选择精度
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
# 可指定不同的生成长度、top_p等相关超参
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 第一轮对话
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好!很高兴为你提供帮助。
# 第二轮对话
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
# 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
# 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
# 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
# 第三轮对话
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)
# 《奋斗创业:一个年轻人的成功之路》
```
运行Qwen同样非常简单。
<details>
<summary>运行Qwen</summary>
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# 可选的模型包括: "Qwen/Qwen-7B", "Qwen/Qwen-14B"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理,需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# 默认使用自动模式,根据设备自动选择精度
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
# 可指定不同的生成长度、top_p等相关超参
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
```
</details>
<p id="DownloadModel">
若在使用上述代码时由于各种原因无法从 HuggingFace 拉取模型和代码,可以先从 ModelScope 下载模型及代码至本地,再从本地加载模型:
</p>
```python
from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
# Downloading model checkpoint to a local dir model_dir
# model_dir = snapshot_download('qwen/Qwen-7B')
# model_dir = snapshot_download('qwen/Qwen-7B-Chat')
# model_dir = snapshot_download('qwen/Qwen-14B')
model_dir = snapshot_download('qwen/Qwen-14B-Chat')
# Loading local checkpoints
# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map="auto",
trust_remote_code=True
).eval()
```
### 🤖 ModelScope
魔搭(ModelScope)是开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单,代码如下所示:
```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig
# 可选的模型包括: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
response, history = model.chat(tokenizer, "浙江的省会在哪里?", history=history)
print(response)
response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
print(response)
```
### Batch推理
千问支持batch批量推理。在开启flash-attention的状态下,使用batch推理可以约40%的提速。示例代码如下所示:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig
from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids
tokenizer = AutoTokenizer.from_pretrained(
'./',
pad_token='<|extra_0|>',
eos_token='<|endoftext|>',
padding_side='left',
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'./',
pad_token_id=tokenizer.pad_token_id,
device_map="auto",
trust_remote_code=True
).eval()
model.generation_config = GenerationConfig.from_pretrained('./', pad_token_id=tokenizer.pad_token_id)
all_raw_text = ["我想听你说爱我。", "今天我想吃点啥,甜甜的,推荐下", "我马上迟到了,怎么做才能不迟到"]
batch_raw_text = []
for q in all_raw_text:
raw_text, _ = make_context(
tokenizer,
q,
system="You are a helpful assistant.",
max_window_size=model.generation_config.max_window_size,
chat_format=model.generation_config.chat_format,
)
batch_raw_text.append(raw_text)
batch_input_ids = tokenizer(batch_raw_text, padding='longest')
batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)
batch_out_ids = model.generate(
batch_input_ids,
return_dict_in_generate=False,
generation_config=model.generation_config
)
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
batch_response = [
decode_tokens(
batch_out_ids[i][padding_lens[i]:],
tokenizer,
raw_text_len=len(batch_raw_text[i]),
context_length=(batch_input_ids[i].size(0)-padding_lens[i]),
chat_format="chatml",
verbose=False,
errors='replace'
) for i in range(len(all_raw_text))
]
print(batch_response)
response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None)
print(response)
response, _ = model.chat(tokenizer, "今天我想吃点啥,甜甜的,推荐下", history=None)
print(response)
response, _ = model.chat(tokenizer, "我马上迟到了,怎么做才能不迟到", history=None)
print(response)
```
### CPU
我们推荐你使用 [qwen.cpp](https://github.com/QwenLM/qwen.cpp) 来实现CPU部署和推理。qwen.cpp是Qwen和tiktoken的C++实现。你可以点击链接进入repo了解详情。
当然,直接在CPU上运行模型也是可以的,示例如下:
```python
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
```
但是,这样的推理效率大概率会非常低。
### 多GPU
如果你遇到显存不足的问题而希望使用多张GPU进行推理,可以使用上述的默认的使用方法读取模型。此前提供的脚本`utils.py`已停止维护。
尽管这个方法很简单,但它的效率相对较低。我们建议使用vLLM和FastChat并请阅读部署章节。
### x86 平台
在 酷睿™/至强® 可扩展处理器或 Arc™ GPU 上部署量化模型时,建议使用 [OpenVINO™ Toolkit](https://docs.openvino.ai/2023.3/gen_ai_guide.html) 以充分利用硬件,实现更好的推理性能。您可以安装并运行此[example notebook](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/254-llm-chatbot)。相关问题,您可在 [OpenVINO repo](https://github.com/openvinotoolkit/openvino_notebooks/issues)中提交。
### 阿里云灵积(DashScope)API服务
最简单的使用Qwen模型API服务的方法就是通过DashScope(阿里云灵积API模型服务)。我们提供了简单介绍说明使用方法。同时,我们还提供了自己部署OpenAI格式的API的方法。
DashScope是阿里云提供的大语言模型的API服务,目前支持Qwen。但请注意,目前提供服务的Qwen模型为内部模型,暂无更多具体细节对外透露。模型服务包括`qwen-turbo`、`qwen-plus`和`qwen-max`,`qwen-turbo`速度更快,`qwen-plus`效果更优,`qwen-max`是最新发布的千亿级通义千问2.0模型。详情请查看[文档](https://dashscope.aliyun.com)。
请首先前往[官网](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn)开通DashScope,获得API Key(AK)。建议通过环境变量设置AK:
```bash
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
随后安装相关代码包,点击[此处](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk)查看安装文档。如使用python,则直接通过pip安装:
```bash
pip install dashscope
```
如安装JAVA SDK,则通过如下命令安装:
```xml
<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>the-latest-version</version>
</dependency>
```
最简单的使用方法就是通过messages调用,用法类似OpenAI API。示例如下:
```python
import random
from http import HTTPStatus
from dashscope import Generation
def call_with_messages():
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '如何做西红柿鸡蛋?'}]
gen = Generation()
response = gen.call(
Generation.Models.qwen_turbo,
messages=messages,
seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set
result_format='message', # set the result to be "message" format.
)
return response
if __name__ == '__main__':
response = call_with_messages()
if response.status_code == HTTPStatus.OK:
print(response)
else:
print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
```
更多用法请查看官方文档了解详情。
<br><br>
## 量化
### GPTQ
我们提供了基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化方案,并开源了Int4和Int8量化模型。量化模型的效果损失很小,但能显著降低显存占用并提升推理速度。
以下我们提供示例说明如何使用Int4量化模型。在开始使用前,请先保证满足要求(如torch 2.0及以上,transformers版本为4.32.0及以上,等等),并安装所需安装包:
```bash
pip install auto-gptq optimum
```
如安装`auto-gptq`遇到问题,我们建议您到官方[repo](https://github.com/PanQiWei/AutoGPTQ)搜索合适的wheel。
> 注意:预编译的`auto-gptq`版本对`torch`版本及其CUDA版本要求严格。同时,由于
> 其近期更新,你可能会遇到`transformers`、`optimum`或`peft`抛出的版本错误。
> 我们建议使用符合以下要求的最新版本:
> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
> - torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
随后即可使用和上述一致的用法调用量化模型:
```python
# 可选模型包括:"Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "Hi", history=None)
```
我们对BF16,Int8和Int4模型在基准评测上做了测试,发现量化模型效果损失较小,结果如下所示:
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|----------------------|:----:|:-----------:|:-----:|:---------:|
| Qwen-1.8B-Chat (BF16)| 43.3 | 55.6 | 33.7 | 26.2 |
| Qwen-1.8B-Chat (Int8)| 43.1 | 55.8 | 33.0 | 27.4 |
| Qwen-1.8B-Chat (Int4)| 42.9 | 52.8 | 31.2 | 25.0 |
| Qwen-7B-Chat (BF16) | 55.8 | 59.7 | 50.3 | 37.2 |
| Qwen-7B-Chat (Int8) | 55.4 | 59.4 | 48.3 | 34.8 |
| Qwen-7B-Chat (Int4) | 55.1 | 59.2 | 49.7 | 29.9 |
| Qwen-14B-Chat (BF16) | 64.6 | 69.8 | 60.1 | 43.9 |
| Qwen-14B-Chat (Int8) | 63.6 | 68.6 | 60.0 | 48.2 |
| Qwen-14B-Chat (Int4) | 63.3 | 69.0 | 59.8 | 45.7 |
| Qwen-72B-Chat (BF16) | 74.4 | 80.1 | 76.4 | 64.6 |
| Qwen-72B-Chat (Int8) | 73.5 | 80.1 | 73.5 | 62.2 |
| Qwen-72B-Chat (Int4) | 73.4 | 80.1 | 75.3 | 61.6 |
<br>
### KV cache量化
> 注意:由于Hugging Face的内部实现,本功能的支持文件`cache_autogptq_cuda_256.cpp`与`cache_autogptq_cuda_kernel_256.cu`可能没被下载。如需开启使用,请手动从相关位置下载,并放置到相应文件中。
在模型推理时,我们可以将中间结果key以及value的值量化后压缩存储,这样便可以在相同的卡上存储更多的key以及value,增加样本吞吐。
我们在`config.json`里提供了`use_cache_quantization`和`use_cache_kernel`两个参数来控制是否启用KV cache量化,具体使用方法如下:
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True,
use_cache_quantization=True,
use_cache_kernel=True,
use_flash_attn=False
)
```
注意:当前该功能不支持与flash attention同时开启,如果你开了KV cache量化的同时又开了flash attention(`use_flash_attn=True`, `use_cache_quantization=True`, `use_cache_kernel=True`),程序默认将关闭`use_flash_attn`。
效果方面,我们验证过Int8 KV Cache的使用对模型整体的精度指标基本无损。我们做了针对显存占用的性能测试。评测运行于单张A100-SXM4-80G GPU,模型默认使用BF16格式,默认生成1024个token,其中OOM表示内存不足。
开启了KV cache量化之后,模型在推理的时候可以开启更大的batch size (bs)。
| USE KV Cache | bs=1 | bs=4 | bs=16 | bs=32 | bs=64 | bs=100 |
|--------------|:------:|:------:|:------:|:------:|:------:|:------:|
| No | 16.3GB | 24.1GB | 31.7GB | 48.7GB | oom | oom |
| Yes | 15.5GB | 17.2GB | 22.3GB | 30.2GB | 48.2GB | 72.4GB |
开启了KV cache量化之后,模型在推理时可在生成更长的序列(sl,生成的token数)时,节约更多的显存。
| USE KV Cache | sl=512 | sl=1024 | sl=2048 | sl=4096 | sl=8192 |
|--------------|:------:|:-------:|:-------:|:-------:|:-------:|
| no | 15.2GB | 16.3GB | 17.6GB | 19.5GB | 23.2GB |
| yes | 15GB | 15.5GB | 15.8GB | 16.6GB | 17.6GB |
开启KV cache量化后,模型在推理时会将原始存进`layer-past`的float格式的key/value转换成int8格式,同时存储量化部分的参数。
具体操作如下:
1. 将key/value进行量化操作
```
qv,scale,zero_point=quantize_cache_v(v)
```
2. 存入`layer_past`中:
量化格式的`layer-past`:
```
layer_past=((q_key,key_scale,key_zero_point),
(q_value,value_scale,value_zero_point))
```
原始格式的`layer-past`:
```
layer_past=(key,value)
```
如果需要将`layer-past`中存好的key,value直接取出使用,可以使用反量化操作将Int8格式的key/value转回float格式:
```
v=dequantize_cache_torch(qv,scale,zero_point)
```
<br>
### 推理性能
这一部分将介绍模型推理的速度和显存占用的相关数据。下文的性能测算使用 [此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py) 完成。
我们测算了BF16、Int8和Int4模型在生成2048个token时的平均推理速度(tokens/s)和显存使用。结果如下所示:
<table>
<tr>
<td>Model Size</td>
<td>Quantization</td>
<td>Speed (Tokens/s)</td>
<td>GPU Memory Usage</td>
</tr>
<tr>
<td rowspan="3">1.8B</td>
<td>BF16</td>
<td>54.09</td>
<td>4.23GB</td>
</tr>
<tr>
<td>Int8</td>
<td>55.56</td>
<td>3.48GB</td>
</tr>
<tr>
<td>Int4</td>
<td>71.07</td>
<td>2.91GB</td>
</tr>
<tr>
<td rowspan="3">7B</td>
<td>BF16</td>
<td>40.93</td>
<td>16.99GB</td>
</tr>
<tr>
<td>Int8</td>
<td>37.47</td>
<td>11.20GB</td>
</tr>
<tr>
<td>Int4</td>
<td>50.09</td>
<td>8.21GB</td>
</tr>
<tr>
<td rowspan="3">14B</td>
<td>BF16</td>
<td>32.22</td>
<td>30.15GB</td>
</tr>
<tr>
<td>Int8</td>
<td>29.28</td>
<td>18.81GB</td>
</tr>
<tr>
<td>Int4</td>
<td>38.72</td>
<td>13.01GB</td>
</tr>
<tr>
<td rowspan="3">72B</td>
<td>BF16</td>
<td>8.48</td>
<td>144.69GB (2xA100)</td>
</tr>
<tr>
<td>Int8</td>
<td>9.05</td>
<td>81.27GB (2xA100)</td>
</tr>
<tr>
<td>Int4</td>
<td>11.32</td>
<td>48.86GB</td>
</tr>
<tr>
<td>72B + vLLM</td>
<td>BF16</td>
<td>17.60</td>
<td>2xA100</td>
</tr>
</table>
评测运行于单张A100-SXM4-80G GPU(除非提到使用2xA100),使用PyTorch 2.0.1、CUDA 11.8和Flash-Attention2。(72B + vLLM 使用 PyTorch 2.1.0和Cuda 11.8.)推理速度是生成2048个token的速度均值。
注意:以上Int4/Int8模型生成速度使用autogptq库给出,当前``AutoModelForCausalLM.from_pretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队,若有解决方案将即时更新。
我们还测量了不同上下文长度、生成长度、Flash-Attention版本的推理速度和 GPU 内存使用情况。可以在 Hugging Face 或 ModelScope 上的相应的模型介绍页面找到结果。
## 微调
### 使用方法
我们提供了`finetune.py`这个脚本供用户实现在自己的数据上进行微调的功能,以接入下游任务。此外,我们还提供了shell脚本减少用户的工作量。这个脚本支持 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/) 。我们提供的shell脚本使用了DeepSpeed,因此建议您确保已经安装DeepSpeed和Peft(注意:DeepSpeed可能不兼容最新的pydantic版本,请确保`pydantic<2.0`)。你可以使用如下命令安装:
```bash
pip install "peft<0.8.0" deepspeed
```
首先,你需要准备你的训练数据。你需要将所有样本放到一个列表中并存入json文件中。每个样本对应一个字典,包含id和conversation,其中后者为一个列表。示例如下所示:
```json
[
{
"id": "identity_0",
"conversations": [
{
"from": "user",
"value": "你好"
},
{
"from": "assistant",
"value": "我是一个语言模型,我叫通义千问。"
}
]
}
]
```
准备好数据后,你可以使用我们提供的shell脚本实现微调。注意,你需要在脚本中指定你的数据的路径。
微调脚本能够帮你实现:
- 全参数微调
- LoRA
- Q-LoRA
全参数微调在训练过程中更新所有参数。你可以运行这个脚本开始训练:
```bash
# 分布式训练。由于显存限制将导致单卡训练失败,我们不提供单卡训练脚本。
bash finetune/finetune_ds.sh
```
尤其注意,你需要在脚本中指定正确的模型名称或路径、数据路径、以及模型输出的文件夹路径。在这个脚本中我们使用了DeepSpeed ZeRO 3。如果你想修改这个配置,可以删除掉`--deepspeed`这个输入或者自行根据需求修改DeepSpeed配置json文件。此外,我们支持混合精度训练,因此你可以设置`--bf16 True`或者`--fp16 True`。在使用fp16时,请使用DeepSpeed支持混合精度训练。经验上,如果你的机器支持bf16,我们建议使用bf16,这样可以和我们的预训练和对齐训练保持一致,这也是为什么我们把默认配置设为它的原因。
运行LoRA的方法类似全参数微调。但在开始前,请确保已经安装`peft`代码库。另外,记住要设置正确的模型、数据和输出路径。我们建议你为模型路径使用绝对路径。这是因为LoRA仅存储adapter部分参数,而adapter配置json文件记录了预训练模型的路径,用于读取预训练模型权重。同样,你可以设置bf16或者fp16。
```bash
# 单卡训练
bash finetune/finetune_lora_single_gpu.sh
# 分布式训练
bash finetune/finetune_lora_ds.sh
```
与全参数微调不同,LoRA ([论文](https://arxiv.org/abs/2106.09685)) 只更新adapter层的参数而无需更新原有语言模型的参数。这种方法允许用户用更低的显存开销来训练模型,也意味着更小的计算开销。
注意,如果你使用预训练模型进行LoRA微调,而非chat模型,模型的embedding和输出层的参数将被设为可训练的参数。这是因为预训练模型没有学习过ChatML格式中的特殊token,因此需要将这部分参数设为可训练才能让模型学会理解和预测这些token。这也意味着,假如你的训练引入新的特殊token,你需要通过代码中的`modules_to_save`将这些参数设为可训练的参数。此外,这部分训练参数的引入会影响ZeRO 3的使用,因此我们默认推荐使用ZeRO 2。当然,如果你不需要引入这部分训练参数,你可以通过替换DeepSpeed的配置文件来使用ZeRO 3。如果你想节省显存占用,可以考虑使用chat模型进行LoRA微调,显存占用将大幅度降低。下文的显存占用和训练速度的记录将详细介绍这部分细节。
如果你依然遇到显存不足的问题,可以考虑使用Q-LoRA ([论文](https://arxiv.org/abs/2305.14314)) 。该方法使用4比特量化模型以及paged attention等技术实现更小的显存开销。
注意:如你使用单卡Q-LoRA,你可能需要安装`mpi4py`。你可以通过`pip`或者`conda`来安装。
运行Q-LoRA你只需运行如下脚本:
```bash
# 单卡训练
bash finetune/finetune_qlora_single_gpu.sh
# 分布式训练
bash finetune/finetune_qlora_ds.sh
```
我们建议你使用我们提供的Int4量化模型进行训练,即Qwen-7B-Chat-Int4。请**不要使用**非量化模型!与全参数微调以及LoRA不同,Q-LoRA仅支持fp16。注意,由于我们发现torch amp支持的fp16混合精度训练存在问题,因此当前的单卡训练Q-LoRA必须使用DeepSpeed。此外,上述LoRA关于特殊token的问题在Q-LoRA依然存在。并且,Int4模型的参数无法被设为可训练的参数。所幸的是,我们只提供了Chat模型的Int4模型,因此你不用担心这个问题。但是,如果你执意要在Q-LoRA中引入新的特殊token,很抱歉,我们无法保证你能成功训练。
> 注意:由于Hugging Face的内部实现,模型在保存时,一些非Python文件未保存(例如`*.cpp`与`*.cu`),如需要支持相关功能,请手动复制有关文件。
与全参数微调不同,LoRA和Q-LoRA的训练只需存储adapter部分的参数。假如你需要使用LoRA训练后的模型,你需要使用如下方法。假设你使用Qwen-7B训练模型,你可以用如下代码读取模型:
```python
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
path_to_adapter, # path to the output directory
device_map="auto",
trust_remote_code=True
).eval()
```
> 注意: 如果`peft>=0.8.0`,加载模型同时会尝试加载tokenizer,但peft内部未相应设置`trust_remote_code=True`,导致`ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.`要避过这一问题,你可以降级`peft<0.8.0`或将tokenizer相关文件移到其它文件夹。
如果你觉得这样一步到位的方式让你很不安心或者影响你接入下游应用,你可以选择先合并并存储模型(LoRA支持合并,Q-LoRA不支持),再用常规方式读取你的新模型,示例如下:
```python
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
path_to_adapter, # path to the output directory
device_map="auto",
trust_remote_code=True
).eval()
merged_model = model.merge_and_unload()
# max_shard_size and safe serialization are not necessary.
# They respectively work for sharding checkpoint and save the model to safetensors
merged_model.save_pretrained(new_model_directory, max_shard_size="2048MB", safe_serialization=True)
```
`new_model_directory`目录将包含合并后的模型参数与相关模型代码。请注意`*.cu`和`*.cpp`文件可能没被保存,请手动复制。另外,`merge_and_unload`仅保存模型,并未保存tokenizer,如有需要,请复制相关文件或使用以以下代码保存
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
path_to_adapter, # path to the output directory
trust_remote_code=True
)
tokenizer.save_pretrained(new_model_directory)
```
注意:分布式训练需要根据你的需求和机器指定正确的分布式训练超参数。此外,你需要根据你的数据、显存情况和训练速度预期,使用`--model_max_length`设定你的数据长度。
### 量化微调后模型
这一小节用于量化全参/LoRA微调后的模型。(注意:你不需要量化Q-LoRA模型因为它本身就是量化过的。)
如果你需要量化LoRA微调后的模型,请先根据上方说明去合并你的模型权重。
我们推荐使用[auto_gptq](https://github.com/PanQiWei/AutoGPTQ)去量化你的模型。
```bash
pip install auto-gptq optimum
```
注意: 当前AutoGPTQ有个bug,可以在该[issue](https://github.com/PanQiWei/AutoGPTQ/issues/370)查看。这里有个[修改PR](https://github.com/PanQiWei/AutoGPTQ/pull/495),你可以使用该分支从代码进行安装。
首先,准备校准集。你可以重用微调你的数据,或者按照微调相同的方式准备其他数据。
第二步,运行以下命令:
```bash
python run_gptq.py \
--model_name_or_path $YOUR_LORA_MODEL_PATH \
--data_path $DATA \
--out_path $OUTPUT_PATH \
--bits 4 # 4 for int4; 8 for int8
```
这一步需要使用GPU,根据你的校准集大小和模型大小,可能会消耗数个小时。
接下来, 将原模型中所有 `*.py`, `*.cu`, `*.cpp` 文件和 `generation_config.json` 文件复制到输出模型目录下。同时,使用官方对应版本的量化模型的 `config.json` 文件覆盖输出模型目录下的文件
(例如, 如果你微调了 `Qwen-7B-Chat`和`--bits 4`, 那么你可以从 [Qwen-7B-Chat-Int4](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4/blob/main/config.json) 仓库中找到对应的`config.json` )。
并且,你需要将 ``gptq.safetensors`` 重命名为 ``model.safetensors``。
最后,像官方量化模型一样测试你的模型。例如:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("/path/to/your/model", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"/path/to/your/model",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
```
### 多机微调
我们提供的脚本支持多机微调,可以参考[脚本](./finetune/finetune_lora_ds.sh)中的注释,在每个节点上正确设置相应的参数并启动训练脚本。关于多机分布式训练的更多信息,请参考[torchrun](https://pytorch.org/docs/stable/elastic/run.html)。
注意: DeepSpeed ZeRO 3 对节点间通信速率的要求远大于 ZeRO 2,在多机微调的情况下会大幅降低训练速度。因此,我们不建议在多机微调的情况下使用 DeepSpeed ZeRO 3 配置。
### 显存占用及训练速度
下面记录7B和14B模型在单GPU使用LoRA(LoRA (emb)指的是embedding和输出层参与训练,而LoRA则不优化这部分参数)和QLoRA时处理不同长度输入的显存占用和训练速度的情况。本次评测运行于单张A100-SXM4-80G GPU,使用CUDA 11.8和Pytorch 2.0,并使用了flash attention 2。我们统一使用batch size为1,gradient accumulation为8的训练配置,记录输入长度分别为256、512、1024、2048、4096和8192的显存占用(GB)和训练速度(s/iter)。我们还使用2张A100测了Qwen-7B的全参数微调。受限于显存大小,我们仅测试了256、512和1024token的性能。
对于 Qwen-7B,我们额外测试了多机微调的性能。我们在两台服务器上运行评测,每台服务器包含两张A100-SXM4-80G GPU,其余配置与Qwen-7B的其他评测相同。多机微调的结果在表中以 LoRA (multinode) 标示。
对于 Qwen-72B,我们测试了两种方案:1)使用4个 A100-SXM4-80G GPUs,通过 Lora + DeepSpeed ZeRO 3 微调和2)使用单张A100-SXM4-80G GPU,通过 QLora (int4) 微调。请注意,使用 LoRA (emb) 微调和不带 DeepSpeed ZeRO 3 的 LoRA 微调在4个A100-SXM4-80G GPUs 上都会出现OOM(你可以通过将`--deepspeed finetune/ds_config_zero3.json`参数传给[`finetune/finetune_lora_ds.sh`](finetune/finetune_lora_ds.sh)来打开 DeepSpeed ZeRO 3 配置)。
具体数值如下所示:
<table>
<tr>
<th rowspan="2">Model Size</th><th rowspan="2">Method</th><th rowspan="2">#Nodes</th><th rowspan="2">#GPUs per node</th><th colspan="6" align="center">Sequence Length</th>
</tr>
<tr>
<th align="center">256</th><th align="center">512</th><th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th>
</tr>
<tr>
<th rowspan="4">1.8B</th><td>LoRA</td>
<td>1</td><td>1</td>
<td align="center">6.7G / 1.0s/it</td><td align="center">7.4G / 1.0s/it</td><td align="center">8.4G / 1.1s/it</td><td align="center">11.0G / 1.7s/it</td><td align="center">16.2G / 3.3s/it</td><td align="center">21.8G / 6.8s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td>
<td>1</td><td>1</td>
<td align="center">13.7G / 1.0s/it</td><td align="center">14.0G / 1.0s/it</td><td align="center">14.0G / 1.1s/it</td><td align="center">15.1G / 1.8s/it</td><td align="center">19.7G / 3.4s/it</td><td align="center">27.7G / 7.0s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">5.8G / 1.4s/it</td><td align="center">6.0G / 1.4s/it</td><td align="center">6.6G / 1.4s/it</td><td align="center">7.8G / 2.0s/it</td><td align="center">10.2G / 3.4s/it</td><td align="center">15.8G / 6.5s/it</td>
</tr>
<tr>
<td>Full-parameter</td>
<td>1</td><td>1</td>
<td align="center">43.5G / 2.1s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.2s/it</td><td align="center">43.5G / 2.3s/it</td><td align="center">47.1G / 2.8s/it</td><td align="center">48.3G / 5.6s/it</td>
</tr>
<tr>
<th rowspan="5">7B</th>
<td>LoRA</td>
<td>1</td><td>1</td>
<td align="center">20.1G / 1.2s/it</td><td align="center">20.4G / 1.5s/it</td><td align="center">21.5G / 2.8s/it</td><td align="center">23.8G / 5.2s/it</td><td align="center">29.7G / 10.1s/it</td><td align="center">36.6G / 21.3s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td>
<td>1</td><td>1</td>
<td align="center">33.7G / 1.4s/it</td><td align="center">34.1G / 1.6s/it</td><td align="center">35.2G / 2.9s/it</td><td align="center">35.1G / 5.3s/it</td><td align="center">39.2G / 10.3s/it</td><td align="center">48.5G / 21.7s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">11.5G / 3.0s/it</td><td align="center">11.5G / 3.0s/it</td><td align="center">12.3G / 3.5s/it</td><td align="center">13.9G / 7.0s/it</td><td align="center">16.9G / 11.6s/it</td><td align="center">23.5G / 22.3s/it</td>
</tr>
<tr>
<td>Full-parameter</td>
<td>1</td><td>2</td>
<td align="center">139.2G / 4.0s/it</td><td align="center">148.0G / 4.0s/it</td><td align="center">162.0G / 4.5s/it</td><td align="center">-</td><td align="center">-</td><td align="center">-</td>
</tr>
<tr>
<td>LoRA (multinode)</td>
<td>2</td><td>2</td>
<td align="center">74.7G / 2.09s/it</td><td align="center">77.6G / 3.16s/it</td><td align="center">84.9G / 5.17s/it</td><td align="center">95.1G / 9.25s/it</td><td align="center">121.1G / 18.1s/it</td><td align="center">155.5G / 37.4s/it</td>
</tr>
<tr>
<th rowspan="3">14B</th>
<td>LoRA</td>
<td>1</td><td>1</td>
<td align="center">34.6G / 1.6s/it</td><td align="center">35.1G / 2.4s/it</td><td align="center">35.3G / 4.4s/it</td><td align="center">37.4G / 8.4s/it</td><td align="center">42.5G / 17.0s/it</td><td align="center">55.2G / 36.0s/it</td>
</tr>
<tr>
<td>LoRA (emb)</td>
<td>1</td><td>1</td>
<td align="center">51.2 / 1.7s/it</td><td align="center">51.1G / 2.6s/it</td><td align="center">51.5G / 4.6s/it</td><td align="center">54.1G / 8.6s/it</td><td align="center">56.8G / 17.2s/it</td><td align="center">67.7G / 36.3s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">18.7G / 5.3s/it</td><td align="center">18.4G / 6.3s/it</td><td align="center">18.9G / 8.2s/it</td><td align="center">19.9G / 11.8s/it</td><td align="center">23.0G / 20.1s/it</td><td align="center">27.9G / 38.3s/it</td>
</tr>
<tr>
<th rowspan="2">72B</th>
<td>LoRA + Deepspeed Zero3</td>
<td>1</td><td>4</td>
<td align="center">215.4G / 17.6s/it</td><td align="center">217.7G / 20.5s/it</td><td align="center">222.6G / 29.4s/it</td><td align="center">228.8G / 45.7s/it</td><td align="center">249.0G / 83.4s/it</td><td align="center">289.2G / 161.5s/it</td>
</tr>
<tr>
<td>Q-LoRA</td>
<td>1</td><td>1</td>
<td align="center">61.4G / 27.4s/it</td><td align="center">61.4G / 31.5s/it</td><td align="center">62.9G / 41.4s/it</td><td align="center">64.1G / 59.5s/it</td><td align="center">68.0G / 97.7s/it</td><td align="center">75.6G / 179.8s/it</td>
</tr>
</table>
<br>
## 部署
### vLLM
如希望部署及加速推理,我们建议你使用vLLM。
如果你使用**CUDA 12.1和PyTorch 2.1**,可以直接使用以下命令安装vLLM。
```bash
pip install vllm
```
否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
#### vLLM + 类Transformer接口
请下载[接口封装代码](examples/vllm_wrapper.py)到当前文件夹,并执行以下命令进行多轮对话交互。(注意:该方法当前只支持``model.chat()``接口。)
```python
from vllm_wrapper import vLLMWrapper
model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)
response, history = model.chat(query="你好", history=None)
print(response)
response, history = model.chat(query="给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
response, history = model.chat(query="给这个故事起一个标题", history=history)
print(response)
```
#### vLLM + 网页Demo / 类OpenAI API
你可以使用FastChat去搭建一个网页Demo或类OpenAI API服务器。首先,请安装FastChat:
```bash
pip install "fschat[model_worker,webui]"
```
使用vLLM和FastChat运行Qwen之前,首先启动一个controller:
```bash
python -m fastchat.serve.controller
```
然后启动model worker读取模型。如使用单卡推理,运行如下命令:
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype bfloat16
# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --dtype float16 # 运行int4模型
```
然而,如果你希望使用多GPU加速推理或者增大显存,你可以使用vLLM支持的模型并行机制。假设你需要在4张GPU上运行你的模型,命令如下所示:
```bash
python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype bfloat16
# python -m fastchat.serve.vllm_worker --model-path $model_path --trust-remote-code --tensor-parallel-size 4 --dtype float16 # 运行int4模型
```
启动model worker后,你可以启动一个:
* Web UI Demo
```bash
python -m fastchat.serve.gradio_web_server
```
* OpenAI API
使用OpenAI API前,请阅读我们的API章节配置好环境,然后运行如下命令:
```bash
python -m fastchat.serve.openai_api_server --host localhost --port 8000
```
然而,如果你觉得使用vLLM和FastChat比较困难,你也可以尝试以下我们提供的最简单的方式部署Web Demo、CLI Demo和OpenAI API。
<br>
### Web UI
我们提供了Web UI的demo供用户使用 (感谢 @wysaid 支持)。在开始前,确保已经安装如下代码库:
```bash
pip install -r requirements_web_demo.txt
```
随后运行如下命令,并点击生成链接:
```bash
python web_demo.py
```
<p align="center">
<br>
<img src="assets/web_demo.gif" width="600" />
<br>
<p>
### 交互式Demo
我们提供了一个简单的交互式Demo示例,请查看`cli_demo.py`。当前模型已经支持流式输出,用户可通过输入文字的方式和Qwen-7B-Chat交互,模型将流式输出返回结果。运行如下命令:
```bash
python cli_demo.py
```
<p align="center">
<br>
<img src="assets/cli_demo.gif" width="600" />
<br>
<p>
<br>
### API
我们提供了OpenAI API格式的本地API部署方法(感谢@hanpenggit)。在开始之前先安装必要的代码库:
```bash
pip install fastapi uvicorn "openai<1.0" pydantic sse_starlette
```
随后即可运行以下命令部署你的本地API:
```bash
python openai_api.py
```
你也可以修改参数,比如`-c`来修改模型名称或路径, `--cpu-only`改为CPU部署等等。如果部署出现问题,更新上述代码库往往可以解决大多数问题。
使用API同样非常简单,示例如下:
```python
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"
# 使用流式回复的请求
for chunk in openai.ChatCompletion.create(
model="Qwen",
messages=[
{"role": "user", "content": "你好"}
],
stream=True
# 流式输出的自定义stopwords功能尚未支持,正在开发中
):
if hasattr(chunk.choices[0].delta, "content"):
print(chunk.choices[0].delta.content, end="", flush=True)
# 不使用流式回复的请求
response = openai.ChatCompletion.create(
model="Qwen",
messages=[
{"role": "user", "content": "你好"}
],
stream=False,
stop=[] # 在此处添加自定义的stop words 例如ReAct prompting时需要增加: stop=["Observation:"]。
)
print(response.choices[0].message.content)
```
<p align="center">
<br>
<img src="assets/openai_api.gif" width="600" />
<br>
<p>
该接口也支持函数调用(**Function Calling**),但暂时仅限 `stream=False` 时能生效。用法见[函数调用示例](examples/function_call_examples.py)。
<br><br>
## 🐳 使用预构建的Docker镜像
为简化部署流程,我们提供了预配置好相应环境的Docker镜像:[qwenllm/qwen](https://hub.docker.com/r/qwenllm/qwen),只需安装驱动、下载模型文件即可启动Demo、部署OpenAI API以及进行微调。
### 准备操作
1. 根据需要使用的镜像版本,安装相应版本的Nvidia驱动:
- `qwenllm/qwen:cu117`(**推荐**):`>= 515.48.07`
- `qwenllm/qwen:cu114`(不支持flash-attention):`>= 470.82.01`
- `qwenllm/qwen:cu121`:`>= 530.30.02`
- `qwenllm/qwen:latest`:与`qwenllm/qwen:cu117`相同
2. 安装并配置[docker](https://docs.docker.com/engine/install/)和[nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html):
```bash
# 配置docker
sudo systemctl start docker
# 测试docker是否安装正确
sudo docker run hello-world
# 配置nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 测试nvidia-container-toolkit是否安装正确
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```
3. 下载模型及代码至本地(参考[此处说明](#DownloadModel))
### 部署
下面我们以Qwen-7B-Chat为例。在启动Web Demo或者部署API前,请先参照下方代码完成配置工作:
```bash
IMAGE_NAME=qwenllm/qwen:cu117
PORT=8901
CHECKPOINT_PATH=/path/to/Qwen-7B-Chat # 下载到本地的模型及代码路径
```
如下脚本可以帮你部署:
* OpenAI API
```bash
bash docker/docker_openai_api.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* Web UI
```bash
bash docker/docker_web_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH} --port ${PORT}
```
* 交互式Demo
```bash
bash docker/docker_cli_demo.sh -i ${IMAGE_NAME} -c ${CHECKPOINT_PATH}
```
这些命令将自动下载所需镜像以及后台启动Web UI Demo。你可以打开`http://localhost:${PORT}` 来使用该Demo。
如果输出如下内容,则说明Demo启动成功:
```text
Successfully started web demo. Open '...' to try!
Run `docker logs ...` to check demo status.
Run `docker rm -f ...` to stop and remove the demo.
```
如果你想查看Demo的状态,你可以使用这个命令来展示输出结果:`docker logs qwen`。
你可以使用这个命令`docker rm -f qwen`来停止服务并删除容器。
### 微调
使用预配置好的Docker镜像进行微调的方法与[上一章](#微调)基本一致(我们已经在镜像中安装了相关依赖):
以下是一个单卡LoRA微调的示例:
```bash
IMAGE_NAME=qwenllm/qwen:cu117
CHECKPOINT_PATH=/path/to/Qwen-7B # 下载的模型和代码路径
#CHECKPOINT_PATH=/path/to/Qwen-7B-Chat-Int4 # 下载的模型和代码路径 (Q-LoRA)
DATA_PATH=/path/to/data/root # 准备微调数据放在 ${DATA_PATH}/example.json
OUTPUT_PATH=/path/to/output/checkpoint # 微调输出路径
# 默认使用主机所有GPU
DEVICE=all
# 如果需要指定用于训练的GPU,按照以下方式设置device(注意:内层的引号不可省略)
#DEVICE='"device=0,1,2,3"'
mkdir -p ${OUTPUT_PATH}
# 单卡LoRA微调
docker run --gpus ${DEVICE} --rm --name qwen \
--mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B \
--mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \
--mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \
--shm-size=2gb \
-it ${IMAGE_NAME} \
bash finetune/finetune_lora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B/ -d /data/shared/Qwen/data/example.json
```
如需修改为单卡Q-LoRA微调示例,只要修改`docker run`中的bash命令:
```bash
bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json
```
<br>
## 🔥 系统指令 (System Prompt)
Qwen-1.8-Chat 和 Qwen-72B-Chat 通义千问在多样且存在多轮复杂交互的系统指令上进行了充分训练,使模型可以跟随多样的系统指令,实现上下文(in-context)中的模型定制化,进一步提升了通义千问的可扩展性。
通过系统指令,Qwen-Chat能够实现**角色扮演**,**语言风格迁移**,**任务设定**,和**行为设定**等能力。


更多关于系统指令的介绍信息可以参考[示例文档](examples/system_prompt.md).
## 工具调用
Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以开发基于Qwen的Agent、LangChain应用、甚至Code Interpreter。
我们提供了文档说明如何根据ReAct Prompting的原理实现工具调用,请参见[ReAct示例](examples/react_prompt.md)。基于该原理,我们在 [openai_api.py](openai_api.py) 里提供了函数调用(Function Calling)的支持。
我们在已开源的中文[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力,并发现Qwen-Chat能够取得稳定的表现:
<table>
<tr>
<th colspan="4" align="center">中文工具调用评测基准(版本 20231206)</th>
</tr>
<tr>
<th align="center">Model</th><th align="center">Tool Selection (Acc.↑)</th><th align="center">Tool Input (Rouge-L↑)</th><th align="center">False Positive Error↓</th>
</tr>
<tr>
<td>GPT-4</td><td align="center">98.0%</td><td align="center">0.953</td><td align="center">23.9%</td>
</tr>
<tr>
<td>GPT-3.5</td><td align="center">74.5%</td><td align="center">0.807</td><td align="center">80.6%</td>
</tr>
<tr>
<td>Qwen-1_8B-Chat</td><td align="center">85.0%</td><td align="center">0.839</td><td align="center">27.6%</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td><td align="center">95.5%</td><td align="center">0.900</td><td align="center">11.6%</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td><td align="center">96.9%</td><td align="center">0.917</td><td align="center">5.6%</td>
</tr>
<tr>
<td>Qwen-72B-Chat</td><td align="center">98.2%</td><td align="center">0.927</td><td align="center">1.1%</td>
</tr>
</table>
为了考察Qwen使用Python Code Interpreter完成数学解题、数据可视化、及文件处理与爬虫等任务的能力,我们专门建设并开源了一个评测这方面能力的[评测基准](https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark)。
我们发现Qwen在生成代码的可执行率、结果正确性上均表现较好:
<table>
<tr>
<th colspan="5" align="center">Code Interpreter Benchmark (Version 20231206)</th>
</tr>
<tr>
<th rowspan="2" align="center">Model</th>
<th colspan="3" align="center">代码执行结果正确性 (%)</th>
<th colspan="1" align="center">生成代码的可执行率 (%)</th>
</tr>
<tr>
<th align="center">Math↑</th><th align="center">Visualization-Hard↑</th><th align="center">Visualization-Easy↑</th><th align="center">General↑</th>
</tr>
<tr>
<td>GPT-4</td>
<td align="center">82.8</td>
<td align="center">66.7</td>
<td align="center">60.8</td>
<td align="center">82.8</td>
</tr>
<tr>
<td>GPT-3.5</td>
<td align="center">47.3</td>
<td align="center">33.3</td>
<td align="center">55.7</td>
<td align="center">74.1</td>
</tr>
<tr>
<td>LLaMA2-13B-Chat</td>
<td align="center">8.3</td>
<td align="center">1.2</td>
<td align="center">15.2</td>
<td align="center">48.3</td>
</tr>
<tr>
<td>CodeLLaMA-13B-Instruct</td>
<td align="center">28.2</td>
<td align="center">15.5</td>
<td align="center">21.5</td>
<td align="center">74.1</td>
</tr>
<tr>
<td>InternLM-20B-Chat</td>
<td align="center">34.6</td>
<td align="center">10.7</td>
<td align="center">25.1</td>
<td align="center">65.5</td>
</tr>
<tr>
<td>ChatGLM3-6B</td>
<td align="center">54.2</td>
<td align="center">4.8</td>
<td align="center">15.2</td>
<td align="center">67.1</td>
</tr>
<tr>
<td>Qwen-1.8B-Chat</td>
<td align="center">25.6</td>
<td align="center">21.4</td>
<td align="center">22.8</td>
<td align="center">65.5</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td align="center">41.9</td>
<td align="center">23.8</td>
<td align="center">38.0</td>
<td align="center">67.2</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td align="center">58.4</td>
<td align="center">31.0</td>
<td align="center">45.6</td>
<td align="center">65.5</td>
</tr>
<tr>
<td>Qwen-72B-Chat</td>
<td align="center">72.7</td>
<td align="center">41.7</td>
<td align="center">43.0</td>
<td align="center">82.8</td>
</tr>
</table>
<p align="center">
<br>
<img src="assets/code_interpreter_showcase_001.jpg" />
<br>
<p>
<br>
## 长文本理解
我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制,原生长度为2K的Qwen-14B可以扩展到8K的序列长度,而原生长度8K的Qwen-1.8B/7B能够在32K长序列的设置下取得不错的表现。
对于Qwen-72B,我们基于RoPE采用更大的旋转Base来适应更长的上下文。Qwen-72B支持32K的上下文长度。
通过arXiv数据集上的语言模型实验,发现 Qwen 在长上下文场景下可以达到出色的性能。结果如下:
<table>
<tr>
<th rowspan="2">Model</th><th colspan="6" align="center">Sequence Length</th>
</tr>
<tr>
<th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th><th align="center">32768</th>
</tr>
<tr>
<td>Qwen-7B (original)</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk</td><td align="center">4.23</td><td align="center">3.78</td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td><td align="center">-</td>
</tr>
<tr>
<td>Qwen-1.8B</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.13</b></td><td align="center"><b>3.89</b></td><td align="center">17.42</td><td align="center">433.85</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>5.00</b></td><td align="center"><b>4.48</b></td><td align="center"><b>4.14</b></td><td align="center"><b>3.93</b></td><td align="center"><b>3.82</b></td><td align="center"><b>3.83</b></td>
</tr>
<tr>
<td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center">7.27</td><td align="center">181.49</td>
</tr>
<tr>
<td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.31</b></td><td align="center"><b>3.23</b></td><td align="center">3.33</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.81</b></td><td align="center"><b>3.52</b></td><td align="center"><b>3.33</b></td><td align="center"><b>3.22</b></td><td align="center"><b>3.17</b></td>
</tr>
<tr>
<td>Qwen-14B</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center">22.79</td><td align="center">334.65</td><td align="center">3168.35</td><td align="center">-</td>
</tr>
<tr>
<td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>-</b></td><td align="center"><b>3.46</b></td><td align="center"><b>3.29</b></td><td align="center"><b>3.18</b></td><td align="center">3.42</td><td align="center">-</td>
</tr>
<tr>
<td>Qwen-72B</td><td align="center"><b>-</b></td><td align="center"><b>-</b></td><td align="center">-</td><td align="center"><b>2.83</b></td><td align="center"><b>2.73</b></td><td align="center"><b>2.72</b></td>
</tr>
</table>
进一步,我们为了验证Qwen-72B-Chat在长文本任务上的能力,在[L-Eval](https://arxiv.org/abs/2307.11088)客观题上进行了测试,评分结果如下:
| Model | Input Length | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition |
|:------------------|:------------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| ChatGPT-3.5-16k | 16K | 60.73 | **63.51** | **84.00** | 61.38 | 78.43 | **12.22** | 64.84 |
| **Qwen-72B-Chat** | 32K | **62.30** | 58.13 | 76.00 | **77.22** | **86.24** | 6.66 | **69.53** |
我们进一步进行了“大海捞针”实验(想法来自于[@Greg Kamradt](https://twitter.com/GregKamradt/status/1727018183608193393)),测试模型在不同长度的输入下,是否能检索到文章不同位置的信息,结果如下:

以上结果说明,Qwen-72B-Chat可以能准确检索到32K以内的输入长度中放在各种位置的信息,证明了其具有优秀的长文本处理能力。
## Tokenizer
> 注:作为术语的“tokenizer”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。
基于tiktoken的tokenizer有别于其他分词器,比如sentencepiece tokenizer。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅[文档](tokenization_note_zh.md)。
<br><br>
## 复现
我们提供了评测脚本以供复现我们的实验结果。注意,由于内部代码和开源代码存在少许差异,评测结果可能与汇报结果存在细微的结果不一致。请阅读[eval/EVALUATION.md](eval/EVALUATION.md)了解更多信息。
<br><br>
## FAQ
如遇到问题,敬请查阅[FAQ](FAQ_zh.md)以及issue区,如仍无法解决再提交issue。
<br><br>
## 引用
如果你觉得我们的工作对你有帮助,欢迎引用!
```
@article{qwen,
title={Qwen Technical Report},
author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
journal={arXiv preprint arXiv:2309.16609},
year={2023}
}
```
<br>
## 使用协议
<https://github.com/QwenLM/Qwen>中的源代码采用[Apache 2.0协议](./LICENSE)授权,您可在该仓库根目录找到协议全文。
研究人员与开发者可使用Qwen和Qwen-Chat或进行二次开发。对于商业使用,请查看模型各自的LICENSE。
- Qwen-72B、Qwen-14B和Qwen-7B采用[Tongyi Qianwen LICENSE AGREEMENT](./Tongyi%20Qianwen%20LICENSE%20AGREEMENT)授权,您可在相应模型的HuggingFace或ModelScope仓库找到协议原文。如需商用,您只需遵循使用协议进行商用即可,我们欢迎您填写问卷([72B](https://dashscope.console.aliyun.com/openModelApply/Qwen-72B-Chat)、[14B](https://dashscope.console.aliyun.com/openModelApply/Qwen-14B-Chat)、[7B](https://dashscope.console.aliyun.com/openModelApply/qianwen))。
- Qwen-1.8B采用[Tongyi Qianwen RESEARCH LICENSE AGREEMENT](./Tongyi%20Qianwen%20RESEARCH%20LICENSE%20AGREEMENT)授权,您可在相应模型的HuggingFace或ModelScope仓库找到协议原文。如需商用,请联系我们。
<br><br>
## 联系我们
如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群和Discord server。当然也可以通过邮件(qianwen_opensource@alibabacloud.com)联系我们。
================================================
FILE: README_ES.md
================================================
<p align="left">
<a href="README_CN.md">中文</a>  |  <a href="README.md">English</a>  |  <a href="README_JA.md">日本語</a> |  <a href="README_FR.md">Français</a> |  Español
</p>
<br><br>
<p align="center">
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" width="400"/>
<p>
<br>
<p align="center">
🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>   |    📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a>    |   🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-72B-Chat-Demo/summary">Demo</a>
<br>
<a href="assets/wechat.png">WeChat (微信)</a>   |   <a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>   |   <a href="https://dashscope.aliyun.com">API</a>
</p>
<br><br>
> [!Important]
> ¡Qwen2 está aquí! Estás invitado a seguir [QwenLM/Qwen2](https://github.com/QwenLM/Qwen2) y compartir tu experiencia allí.
>
> Este repositorio ([QwenLM/Qwen](https://github.com/QwenLM/Qwen)) ya no se mantiene activamente, debido a diferencias sustanciales en la base de código.
<br>
| | Qwen-Chat | Qwen-Chat (Int4) | Qwen-Chat (Int8) | Qwen |
|-----|:------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
| 1.8B | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-1_8B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-1_8B">🤗</a> |
| 7B | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a> |
| 14B | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-14B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-14B">🤗</a> |
| 72B | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int4">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B-Chat-Int8">🤗</a> | <a href="https://modelscope.cn/models/qwen/Qwen-72B/summary">🤖</a> <a href="https://huggingface.co/Qwen/Qwen-72B">🤗</a> |
Abrimos nuestra serie **Qwen**, que ahora incluye **Qwen**, los modelos de lenguaje, es decir **Qwen-7B** y **Qwen-14B**, así como **Qwen-Chat**, los modelos de chat, es decir **Qwen-7B-Chat** y **Qwen-14B-Chat**. Los enlaces se encuentran en la tabla anterior. Haz clic en ellos y comprueba las fichas de los modelos. Además, publicamos el **[informe técnico](https://arxiv.org/abs/2309.16609)**. Haz clic en el enlace y compruébalo.
En resumen, disponemos de modelos lingüísticos sólidos, que han sido preentrenados de forma estable para hasta 3 billones de tokens de datos multilingües con una amplia cobertura de dominios, idiomas (con especial atención al chino y al inglés), etc. Son capaces de lograr un rendimiento competitivo en conjuntos de datos de referencia. Además, disponemos de modelos de chat alineados con las preferencias humanas basados en SFT y RLHF (aún no publicados), que son capaces de chatear, crear contenidos, extraer información, resumir, traducir, codificar, resolver problemas matemáticos, etc., y son capaces de utilizar herramientas, jugar como agentes o incluso jugar como intérpretes de código, etc.
| Modelo | Fecha de Publicación | Longitud Máx. | Mejora del Sistema de Avisos | # de Fichas Preentrenadas | Uso Mínimo de Memoria GPU de Finetuning (Q-Lora) | Uso Mínimo de la GPU para Generar 2048 Tokens (Int4) | Uso de Herramientas |
|:----------|:--------------------:|:-------------:|:----------------------------:|:-------------------------:|:------------------------------------------------:|:----------------------------------------------------:|:-------------------:|
| Qwen-1.8B | 23.11.30 | 32K | ✅ | 2.2T | 5.8GB | 2.9GB | ✅ |
| Qwen-7B | 23.08.03 | 32K | ❎ | 2.4T | 11.5GB | 8.2GB | ✅ |
| Qwen-14B | 23.09.25 | 8K | ❎ | 3.0T | 18.7GB | 13.0GB | ✅ |
| Qwen-72B | 23.11.30 | 32K | ✅ | 3.0T | 61.4GB | 48.9GB | ✅ |
En este repo, usted puede averiguar:
* Inicio rápido con Qwen, y disfrute de la simple inferencia.
* Detalles sobre los modelos de cuantificación, incluyendo GPTQ y cuantización de caché KV.
* Estadísticas de rendimiento de la inferencia, incluyendo velocidad y memoria.
* Tutoriales sobre ajuste fino, incluyendo ajuste de parámetros completos, LoRA y Q-LoRA.
* Instrucciones de despliegue, con el ejemplo de vLLM y FastChat.
* Instrucciones para construir demos, incluyendo WebUI, CLI demo, etc.
* Introducción al servicio API de DashScope, así como instrucciones para crear una API de estilo OpenAI para tu modelo.
* Información sobre Qwen para el uso de herramientas, agente e intérprete de código.
* Estadísticas de la evaluación de la comprensión del contexto largo
* Acuerdo de licencia
* ...
Además, si tienes problemas, consulta primero [FAQ](FAQ.md) para obtener ayuda. ¿Sigues teniendo problemas? No dudes en plantearnos tus problemas (mejor en inglés para que te entienda más gente). Si quieres ayudarnos, ¡envíanos pull requests sin dudarlo! ¡Siempre nos entusiasman los PR!
¿Quieres charlar con nosotros o quedar para tomar un café? ¡Bienvenido a nuestro Discord o WeChat!
<br><br>
## Noticias y Actualizaciones
* 2023.11.30 🔥 Lanzamos **Qwen-72B** y **Qwen-72B-Chat**, que están entrenados en tokens 3T y soportan 32k contextos, junto con **Qwen-1.8B**, y **Qwen-1.8B-Chat**, en ModelScope y Hugging Face. También hemos reforzado las capacidades de System Prompt de Qwen-72B-Chat y Qwen-1.8B-Chat, ver [documentación de ejemplo](examples/system_prompt.md). Adicionalmente, soporta la inferencia en **Ascend 910** y **Hygon DCU**. Consulta `ascend-support` y `dcu-support` para más detalles.
* 2023.10.17 Publicamos el modelo cuantizado Int8 **Qwen-7B-Chat-Int8** y **Qwen-14B-Chat-Int8**.
* 2023.9.25 Publicamos **Qwen-14B** y **Qwen-14B-Chat** en ModelScope y Hugging Face, junto con [qwen.cpp](https://github.com/QwenLM/qwen.cpp) y [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent). También se actualizan los códigos y pesos de **Qwen-7B** y **Qwen-7B-Chat**. **POR FAVOR, DESCARGA LA ÚLTIMA VERSIÓN!**
- En comparación con **Qwen-7B** (original), **Qwen-7B** utiliza más tokens de entrenamiento, pasando de 2,2T tokens a 2,4T tokens, mientras que la longitud del contexto se amplía de 2048 a 8192. El conocimiento del chino y la capacidad de codificación de **Qwen-7B** se han mejorado aún más.
* 2023.9.12 Ahora es posible el ajuste fino de los modelos Qwen-7B, incluido el ajuste fino de parámetros completos, LoRA y Q-LoRA.
* 2023.8.21 Publicamos el modelo cuantizado Int4 para Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, que requiere bajos costes de memoria pero consigue mejorar la velocidad de inferencia. Además, no se produce una degradación significativa del rendimiento en la evaluación comparativa.
* 2023.8.3 Publicamos **Qwen-7B** y **Qwen-7B-Chat** en ModelScope y Hugging Face. También proporcionamos una nota técnica para más detalles sobre el modelo, incluidos los detalles de entrenamiento y el rendimiento del modelo.
<br>
## Rendimiento
Los modelos Qwen superan a los modelos de referencia de tamaños de modelo similares en una serie de conjuntos de datos de referencia, como MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., que evalúan las capacidades de los modelos en comprensión del lenguaje natural, resolución de problemas matemáticos, codificación, etc. Qwen-72B obtiene mejores resultados que LLaMA2-70B en todas las tareas y supera a GPT-3.5 en 7 de cada 10 tareas.
<p align="left">
<img src="assets/radar_72b.jpg" width=600px/>
<p>
<br>
| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
|:------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|
| | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot |
| LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 |
| LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 |
| LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - |
| ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - |
| InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 |
| InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 |
| Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 |
| Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 |
| Yi-34B | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 82.6 |
| XVERSE-65B | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - |
| **Qwen-1.8B** | 45.3 | 56.1 | 32.3 | 2.3 | 15.2 | 14.2 | 22.3 | 52.1 |
| **Qwen-7B** | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 |
| **Qwen-14B** | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 |
| **Qwen-72B** | **77.4** | **83.3** | **78.9** | **35.2** | **35.4** | **52.2** | **67.7** | **83.6** |
Para todos los modelos comparados, presentamos las mejores puntuaciones entre sus resultados oficiales y [OpenCompass](https://opencompass.org.cn/leaderboard-llm).
Para más resultados experimentales (rendimiento detallado del modelo en más conjuntos de datos de referencia) y detalles, consulte nuestro informe técnico haciendo clic [aquí](https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf).
<br><br>
## Requisitos
* python 3.8 y superior
* pytorch 1.12 y superior, se recomienda 2.0 y superior
* transformers 4.32 y superiores
* Se recomienda CUDA 11.4 y superior (esto es para usuarios de GPU, usuarios de flash-attention, etc.)
<br>
## Inicio rápido
A continuación, proporcionamos ejemplos sencillos para mostrar cómo utilizar Qwen-Chat con 🤖 ModelScope y 🤗 Transformers.
Puedes usar nuestras imágenes docker pre-construidas para saltarte la mayoría de los pasos de configuración del entorno, mira la Sección ["Usando Imágenes Docker Pre-construidas"](#-docker) para más detalles.
Si no utiliza Docker, asegúrese de haber configurado el entorno e instalado los paquetes necesarios. Asegúrese de que cumple los requisitos anteriores y, a continuación, instale las bibliotecas dependientes.
```bash
pip install -r requirements.txt
```
Si tu dispositivo soporta fp16 o bf16, te recomendamos instalar [flash-attention](https://github.com/Dao-AILab/flash-attention) (**ahora soportamos flash attention 2.**) para una mayor eficiencia y un menor uso de memoria. (**flash-attention es opcional y el proyecto puede ejecutarse normalmente sin instalarlo**)
```bash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# pip install csrc/rotary
```
Ahora puedes empezar con ModelScope o Transformers.
### 🤗 Transformers
Para utilizar Qwen-Chat para la inferencia, todo lo que tienes que hacer es introducir unas pocas líneas de código como se demuestra a continuación. Recuerda introducir los nombres o rutas correctos de los modelos, como "Qwen/Qwen-7B-Chat" y "Qwen/Qwen-14B-Chat". Sin embargo, **por favor, asegúrese de que está utilizando el código más reciente.**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True
).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好!很高兴为你提供帮助。
# 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
# 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
# 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
# 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
# 3rd dialogue turn
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)
# 《奋斗创业:一个年轻人的成功之路》
```
Ejecutar Qwen, el modelo lingüístico base, también es sencillo.
<details>
<summary>Ejecutar Qwen</summary>
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B",
device_map="auto",
trust_remote_code=True
).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
```
</details>
En caso de que se produzca un problema de red al intentar descargar puntos de control y códigos de modelos desde Hugging Face, un método alternativo consiste en obtener inicialmente el punto de control desde ModelScope y luego cargarlo desde el directorio local como se indica a continuación:
```python
from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
# Downloading model checkpoint to a local dir model_dir
# model_dir = snapshot_download('qwen/Qwen-7B', revision='v1.1.4')
# model_dir = snapshot_download('qwen/Qwen-7B-Chat', revision='v1.1.4')
# model_dir = snapshot_download('qwen/Qwen-14B', revision='v1.0.4')
model_dir = snapshot_download('qwen/Qwen-14B-Chat', revision='v1.0.4')
# Loading local checkpoints
# trust_remote_code is still set as True since we still load codes from local dir instead of transformers
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
device_map="auto",
trust_remote_code=True
).eval()
```
### 🤖 ModelScope
ModelScope es una plataforma de código abierto para Model-as-a-Service (MaaS), que proporciona un servicio de modelos flexible y rentable a los desarrolladores de IA. Del mismo modo, puede ejecutar los modelos con ModelScope como se muestra a continuación:
```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig
# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
response, history = model.chat(tokenizer, "浙江的省会在哪里?", history=history)
print(response)
response, history = model.chat(tokenizer, "它有什么好玩的景点", history=history)
print(response)
```
### Inferencia por lotes
Qwen admite la inferencia por lotes. Con la atención flash activada, el uso de la inferencia por lotes puede suponer un aumento de velocidad del 40%. El código de ejemplo se muestra a continuación:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig
from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids
tokenizer = AutoTokenizer.from_pretrained(
'./',
pad_token='<|extra_0|>',
eos_token='<|endoftext|>',
padding_side='left',
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'./',
pad_token_id=tokenizer.pad_token_id,
device_map="auto",
trust_remote_code=True
).eval()
model.generation_config = GenerationConfig.from_pretrained('./', pad_token_id=tokenizer.pad_token_id)
all_raw_text = ["我想听你说爱我。", "今天我想吃点啥,甜甜的,推荐下", "我马上迟到了,怎么做才能不迟到"]
batch_raw_text = []
for q in all_raw_text:
raw_text, _ = make_context(
tokenizer,
q,
system="You are a helpful assistant.",
max_window_size=model.generation_config.max_window_size,
chat_format=model.generation_config.chat_format,
)
batch_raw_text.append(raw_text)
batch_input_ids = tokenizer(batch_raw_text, padding='longest')
batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)
batch_out_ids = model.generate(
batch_input_ids,
return_dict_in_generate=False,
generation_config=model.generation_config
)
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
batch_response = [
decode_tokens(
batch_out_ids[i][padding_lens[i]:],
tokenizer,
raw_text_len=len(batch_raw_text[i]),
context_length=(batch_input_ids[i].size(0)-padding_lens[i]),
chat_format="chatml",
verbose=False,
errors='replace'
) for i in range(len(all_raw_text))
]
print(batch_response)
response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None)
print(response)
response, _ = model.chat(tokenizer, "今天我想吃点啥,甜甜的,推荐下", history=None)
print(response)
response, _ = model.chat(tokenizer, "我马上迟到了,怎么做才能不迟到", history=None)
print(response)
```
### CPU
Para desplegar nuestros modelos en la CPU, le recomendamos encarecidamente que utilice [qwen.cpp](https://github.com/QwenLM/qwen.cpp), que es una implementación C++ pura de Qwen y tiktoken. Comprueba el repositorio para más detalles.
Además, también es sencillo ejecutar directamente el modelo en la CPU, lo que requiere que especifiques el dispositivo:
```python
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
```
Pero es probable que sufra una eficacia de inferencia extremadamente baja.
### Múltiples GPU
Si sufres de falta de memoria en la GPU y quieres ejecutar el modelo en más de 1 GPU, puedes utilizar directamente el método de carga por defecto, que ahora es soportado por Transformers. El método anterior basado en `utils.py` está obsoleto.
Sin embargo, aunque este método es sencillo, la eficiencia del paralelismo del pipeline nativo es baja. Le aconsejamos que utilice vLLM con FastChat y por favor lea la sección para el despliegue.
### DashScope
La forma más sencilla de utilizar Qwen a través de APIs es el servicio DashScope API a través de Alibaba Cloud. Damos una introducción al uso. Además, proporcionamos un script para que despliegues una API estilo OpenAI en tus propios servidores.
DashScope es el gran servicio de API de modelos lingüísticos proporcionado por Alibaba Cloud, que ahora es compatible con Qwen. Tenga en cuenta que los modelos detrás de DashScope son versiones internas temporalmente sin detalles proporcionados. Los servicios incluyen `qwen-turbo` y `qwen-plus`, donde el primero se ejecuta más rápido y el segundo consigue un mejor rendimiento. Para más información, visita la documentación [aquí](https://dashscope.aliyun.com).
Dirígete al sitio web oficial [enlace](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.6c2774fahtfXdn) para crear una cuenta DashScope y obtener la clave API (AK). Recomendamos configurar la AK con una variable de entorno:
```bash
export DASHSCOPE_API_KEY="YOUR_DASHSCOPE_API_KEY"
```
A continuación, instala los paquetes y haz clic [aquí](https://help.aliyun.com/zh/dashscope/developer-reference/install-dashscope-sdk) para consultar la documentación. Si utilizas Python, puedes instalar DashScope con pip:
```bash
pip install dashscope
```
Si utiliza JAVA SDK, puede instalarlo de esta forma:
```xml
<!-- https://mvnrepository.com/artifact/com.alibaba/dashscope-sdk-java -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>the-latest-version</version>
</dependency>
```
La forma más sencilla de utilizar DashScope es el uso con mensajes, que es similar a la API OpenAI. El ejemplo se muestra a continuación:
```python
import random
from http import HTTPStatus
from dashscope import Generation
def call_with_messages():
messages = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': '如何做西红柿鸡蛋?'}]
gen = Generation()
response = gen.call(
Generation.Models.qwen_turbo,
messages=messages,
seed=random.randint(1, 10000), # set the random seed, optional, default to 1234 if not set
result_format='message', # set the result to be "message" format.
)
return response
if __name__ == '__main__':
response = call_with_messages()
if response.status_code == HTTPStatus.OK:
print(response)
else:
print('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
```
Para más usos, visite el sitio web oficial.
<br><br>
## Cuantización
### GPTQ
Proporcionamos una solución basada en [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), y liberamos los modelos cuantificados Int4 e Int8, que consiguen efectos de modelo casi sin pérdidas pero un rendimiento mejorado tanto en costes de memoria como en velocidad de inferencia.
Aquí demostramos cómo utilizar los modelos cuantizados que proporcionamos para la inferencia. Antes de empezar, asegúrese de que cumple los requisitos de auto-gptq (por ejemplo, torch 2.0 y superior, transformers 4.32.0 y superior, etc.) e instale los paquetes necesarios:
```bash
pip install auto-gptq optimum
```
Si tiene problemas para instalar `auto-gptq`, le aconsejamos que consulte el [repo] oficial (https://github.com/PanQiWei/AutoGPTQ) para encontrar una rueda.
> Nota: Los paquetes `auto-gptq` precompilados dependen en gran medida de la versión de `torch` y de su versión CUDA. Además, debido a la reciente actualización
> también puede encontrar errores de versión no soportada de `transformers`, `optimum`, o `peft`.
> Recomendamos utilizar las últimas versiones que cumplan los siguientes requisitos:
> - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
> - antorcha>=2.0,<2.1 auto-gptq<0.5.0 transformadores<4.35.0 óptimo<1.14.0 peft>=0.5.0,<0.6.0
A continuación, puede cargar el modelo cuantizado fácilmente y ejecutar la inferencia como de costumbre:
```python
# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat-Int4",
device_map="auto",
t
gitextract_zv37hsfa/ ├── .dockerignore ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── bug_report.yaml │ │ ├── config.yaml │ │ └── feature_request.yaml │ └── workflows/ │ └── stale.yml ├── .gitignore ├── FAQ.md ├── FAQ_ja.md ├── FAQ_zh.md ├── LICENSE ├── NOTICE ├── README.md ├── README_CN.md ├── README_ES.md ├── README_FR.md ├── README_JA.md ├── Tongyi Qianwen LICENSE AGREEMENT ├── Tongyi Qianwen RESEARCH LICENSE AGREEMENT ├── ascend-support/ │ ├── README.md │ └── docker_qwen.sh ├── cli_demo.py ├── dcu-support/ │ ├── README.md │ ├── cli_demo.py │ ├── cli_demo_batch.py │ ├── model.properties │ ├── package/ │ │ ├── fastllm_pytools/ │ │ │ ├── __init__.py │ │ │ ├── hf_model.py │ │ │ ├── llm.py │ │ │ └── torch2flm.py │ │ └── setup.py │ ├── qwen2flm.py │ ├── requirements.txt │ └── web_demo.py ├── docker/ │ ├── Dockerfile │ ├── Dockerfile-cu114 │ ├── Dockerfile-cu121 │ ├── docker_cli_demo.sh │ ├── docker_openai_api.sh │ └── docker_web_demo.sh ├── eval/ │ ├── EVALUATION.md │ ├── evaluate_ceval.py │ ├── evaluate_chat_ceval.py │ ├── evaluate_chat_gsm8k.py │ ├── evaluate_chat_humaneval.py │ ├── evaluate_chat_mmlu.py │ ├── evaluate_cmmlu.py │ ├── evaluate_gsm8k.py │ ├── evaluate_humaneval.py │ ├── evaluate_mmlu.py │ ├── evaluate_plugin.py │ └── gsm8k_prompt.txt ├── examples/ │ ├── add_merges.py │ ├── auto_comments.md │ ├── auto_comments.py │ ├── function_call_examples.py │ ├── function_call_finetune_examples.py │ ├── langchain_tooluse.ipynb │ ├── qwen_extra.tiktoken │ ├── qwen_extra_vocab.txt │ ├── react_demo.py │ ├── react_prompt.md │ ├── system_prompt.md │ ├── tokenizer_showcase.ipynb │ ├── transformers_agent.md │ └── vllm_wrapper.py ├── finetune/ │ ├── ds_config_zero2.json │ ├── ds_config_zero3.json │ ├── finetune_ds.sh │ ├── finetune_lora_ds.sh │ ├── finetune_lora_single_gpu.sh │ ├── finetune_qlora_ds.sh │ └── finetune_qlora_single_gpu.sh ├── finetune.py ├── openai_api.py ├── recipes/ │ ├── applications/ │ │ ├── chatbot/ │ │ │ └── qwen_chatbot.ipynb │ │ ├── domain_finetune/ │ │ │ └── qwen_domain_finetune.ipynb │ │ └── retrieval/ │ │ └── retrieval.ipynb │ ├── finetune/ │ │ ├── ascend/ │ │ │ └── README.md │ │ ├── deepspeed/ │ │ │ ├── finetune_fullparameter_multi_gpu.ipynb │ │ │ ├── finetune_fullparameter_single_gpu.ipynb │ │ │ ├── finetune_lora_multi_gpu.ipynb │ │ │ ├── finetune_lora_single_gpu.ipynb │ │ │ ├── finetune_qlora_multi_gpu.ipynb │ │ │ ├── finetune_qlora_single_gpu.ipynb │ │ │ ├── readme.md │ │ │ └── requirements.txt │ │ └── swift/ │ │ ├── README.md │ │ └── README_CN.md │ ├── inference/ │ │ ├── dashscope/ │ │ │ └── README.md │ │ ├── hf_modelscope/ │ │ │ └── README.md │ │ ├── quantization/ │ │ │ └── README.md │ │ ├── tensorrt/ │ │ │ ├── README.md │ │ │ └── docker/ │ │ │ └── Dockerfile │ │ └── vllm/ │ │ ├── README.md │ │ ├── template_chatml.jinja │ │ └── vllm_wrapper.py │ ├── quickstart/ │ │ └── qwen.ipynb │ └── tests/ │ ├── README.md │ ├── __init__.py │ ├── assets/ │ │ └── test_sampled_qwen.json │ ├── test_finetune/ │ │ └── test_finetune_ds.py │ ├── test_inference/ │ │ ├── test_inference_api.py │ │ └── test_inference_vllm_fschat.py │ ├── ut_config.py │ └── utils.py ├── requirements.txt ├── requirements_web_demo.txt ├── run_gptq.py ├── tech_memo.md ├── tokenization_note.md ├── tokenization_note_ja.md ├── tokenization_note_zh.md ├── utils.py └── web_demo.py
SYMBOL INDEX (218 symbols across 33 files)
FILE: cli_demo.py
function _load_model_tokenizer (line 44) | def _load_model_tokenizer(args):
function _gc (line 68) | def _gc():
function _clear_screen (line 75) | def _clear_screen():
function _print_history (line 82) | def _print_history(history):
function _get_input (line 91) | def _get_input() -> str:
function main (line 105) | def main():
FILE: dcu-support/cli_demo.py
function args_parser (line 5) | def args_parser():
FILE: dcu-support/cli_demo_batch.py
function args_parser (line 5) | def args_parser():
FILE: dcu-support/package/fastllm_pytools/hf_model.py
function create (line 17) | def create(model,
FILE: dcu-support/package/fastllm_pytools/llm.py
function set_cpu_threads (line 70) | def set_cpu_threads(threads: int):
function get_cpu_threads (line 73) | def get_cpu_threads() -> int:
function print_ins_info (line 76) | def print_ins_info():
function set_cpu_kvcache (line 79) | def set_cpu_kvcache(cpu_kvcache):
function get_cpu_kvcache (line 82) | def get_cpu_kvcache():
function set_cpu_low_mem (line 85) | def set_cpu_low_mem(low_mem):
function get_cpu_low_mem (line 88) | def get_cpu_low_mem():
function set_device_map (line 91) | def set_device_map(device_map):
function from_hf (line 112) | def from_hf(model,
class model (line 118) | class model:
method __init__ (line 119) | def __init__ (self, path : str,
method get_prompt (line 140) | def get_prompt(self,
method save (line 151) | def save(self, path : str):
method eval (line 154) | def eval(self):
method build_tokenizer_decode_token_cache (line 157) | def build_tokenizer_decode_token_cache(self):
method tokenizer_encode_string (line 168) | def tokenizer_encode_string(self, content: str) -> List[int]:
method tokenizer_decode_token (line 191) | def tokenizer_decode_token(self, token_id: int) -> bytes:
method response_logits (line 219) | def response_logits(self,
method response (line 241) | def response(self,
method stream_response (line 257) | def stream_response(self,
method stream_response_raw (line 290) | def stream_response_raw(self,
method chat (line 317) | def chat(self, tokenizer, query: str, history: List[Tuple[str, str]] =...
method stream_chat (line 358) | def stream_chat(self, tokenizer, query: str, history: List[Tuple[str, ...
method set_adapter (line 406) | def set_adapter(self, name: str):
method disable_adapter (line 409) | def disable_adapter(self):
method process_chatglm3_response (line 412) | def process_chatglm3_response(self, output, history):
method build_chatglm3_input (line 433) | def build_chatglm3_input(self, tokenizer, query, history=None, role="u...
method response_batch (line 446) | def response_batch(self, querys: List[str],
method chat_batch (line 468) | def chat_batch(self, tokenizer, querys: List[str], historys: List[List...
FILE: dcu-support/package/fastllm_pytools/torch2flm.py
function writeString (line 5) | def writeString(fo, s):
function writeKeyValue (line 9) | def writeKeyValue(fo, key, value):
function write_int8 (line 30) | def write_int8(fo, v):
function write_int4 (line 41) | def write_int4(fo, v):
function tofile (line 64) | def tofile(exportPath,
FILE: dcu-support/web_demo.py
function get_model (line 12) | def get_model():
FILE: eval/evaluate_ceval.py
function load_models_tokenizer (line 21) | def load_models_tokenizer(args):
function format_example (line 43) | def format_example(line, include_answer=True):
function generate_few_shot_prompt (line 55) | def generate_few_shot_prompt(k, subject, dev_df):
function get_logits (line 67) | def get_logits(tokenizer, model, inputs: List[str]):
function eval_subject (line 80) | def eval_subject(
function cal_ceval (line 158) | def cal_ceval(res):
function main (line 384) | def main(args):
FILE: eval/evaluate_chat_ceval.py
function load_models_tokenizer (line 23) | def load_models_tokenizer(args):
function process_before_extraction (line 37) | def process_before_extraction(gen, question, choice_dict):
function count_substr (line 64) | def count_substr(gen, pattern):
function extract_choice (line 68) | def extract_choice(gen, prompt, choice_list):
function format_example (line 95) | def format_example(line):
function extract_answer (line 102) | def extract_answer(response, row):
function eval_subject (line 114) | def eval_subject(
function cal_ceval (line 179) | def cal_ceval(res):
function main (line 405) | def main(args):
FILE: eval/evaluate_chat_gsm8k.py
function doc_to_text (line 20) | def doc_to_text(doc, use_fewshot):
function generate_sample (line 37) | def generate_sample(model, tokenizer, question):
function extract_answer (line 49) | def extract_answer(s):
function is_correct (line 62) | def is_correct(completion, answer):
FILE: eval/evaluate_chat_humaneval.py
function extract_code (line 22) | def extract_code(text, entry_point):
function generate_sample (line 46) | def generate_sample(model, tokenizer, question, entry_point):
FILE: eval/evaluate_chat_mmlu.py
function load_models_tokenizer (line 23) | def load_models_tokenizer(args):
function format_example (line 42) | def format_example(line):
function process_before_extraction (line 53) | def process_before_extraction(gen, choice_dict):
function extract_choice (line 62) | def extract_choice(gen, choice_list):
function extract_answer (line 89) | def extract_answer(response, row):
function eval_subject (line 98) | def eval_subject(
function cal_mmlu (line 159) | def cal_mmlu(res):
function main (line 185) | def main(args):
FILE: eval/evaluate_cmmlu.py
function load_models_tokenizer (line 24) | def load_models_tokenizer(args):
function format_example (line 49) | def format_example(line, include_answer=True):
function generate_few_shot_prompt (line 61) | def generate_few_shot_prompt(k, subject, dev_df):
function get_logits (line 73) | def get_logits(tokenizer, model, inputs: List[str]):
function eval_subject (line 86) | def eval_subject(
function cal_cmmlu (line 164) | def cal_cmmlu(res):
function main (line 281) | def main(args):
FILE: eval/evaluate_gsm8k.py
function doc_to_text (line 16) | def doc_to_text(doc):
function decode (line 25) | def decode(tokens_list, tokenizer, raw_text_len):
function generate_sample (line 39) | def generate_sample(model, tokenizer, input_txt):
function extract_answer_hf (line 50) | def extract_answer_hf(completion):
function extract_answer (line 60) | def extract_answer(completion):
function is_correct (line 68) | def is_correct(completion, answer):
FILE: eval/evaluate_humaneval.py
function decode (line 15) | def decode(tokens_list, tokenizer, raw_text_len):
function generate_sample (line 29) | def generate_sample(model, tokenizer, input_txt):
FILE: eval/evaluate_mmlu.py
function load_models_tokenizer (line 22) | def load_models_tokenizer(args):
function format_example (line 44) | def format_example(line, include_answer=True):
function generate_few_shot_prompt (line 56) | def generate_few_shot_prompt(k, subject, dev_df):
function get_logits (line 78) | def get_logits(tokenizer, model, inputs: List[str]):
function eval_subject (line 94) | def eval_subject(
function cal_mmlu (line 166) | def cal_mmlu(res):
function main (line 194) | def main(args):
FILE: eval/evaluate_plugin.py
function is_callable (line 18) | def is_callable(response, golden):
function process_res (line 22) | def process_res(response):
class _DummyTokenizer (line 61) | class _DummyTokenizer:
method tokenize (line 62) | def tokenize(self, text: str):
function _get_tokenized_string (line 66) | def _get_tokenized_string(tokenizer, text_list):
function eval_action (line 79) | def eval_action(job):
function eval_action_input (line 90) | def eval_action_input(job, tokenizer):
class QWenAgent (line 118) | class QWenAgent(Agent):
method __init__ (line 130) | def __init__(
method generate_one (line 164) | def generate_one(self, prompt, stop):
function load_models_tokenizer (line 185) | def load_models_tokenizer(args):
function load_jobs (line 203) | def load_jobs(filename):
function react_inference (line 211) | def react_inference(filename, model, tokenizer):
function main (line 228) | def main(args):
FILE: examples/add_merges.py
function load_tiktoken_bpe (line 21) | def load_tiktoken_bpe(tiktoken_bpe_file: str) -> "dict[bytes, int]":
function dump_tiktoken_bpe (line 29) | def dump_tiktoken_bpe(bpe_ranks: "dict[bytes, int]", tiktoken_bpe_file: ...
function bytes_to_pieces (line 35) | def bytes_to_pieces(the_bytes: bytes) -> "tuple[bytes]":
function get_pairs (line 39) | def get_pairs(pieces: "tuple[bytes]") -> "set[tuple[bytes, bytes]]":
function get_stats (line 43) | def get_stats(
function merge_vocab (line 53) | def merge_vocab(
function apply_bp (line 59) | def apply_bp(
function bpe (line 84) | def bpe(word: bytes, merges: "dict[bytes,int]") -> "tuple[bytes, ...]":
function best_pair_sort_key (line 97) | def best_pair_sort_key(
function learn_bpe (line 109) | def learn_bpe(
function load_expand_vocab (line 133) | def load_expand_vocab(path: Path) -> "dict[str, int]":
function make_new_merges_by_bpe (line 161) | def make_new_merges_by_bpe(
function main (line 194) | def main():
FILE: examples/auto_comments.py
function parse_args (line 14) | def parse_args():
class QWenChat (line 21) | class QWenChat():
method __init__ (line 22) | def __init__(self):
method chat (line 38) | def chat(self, query, system = ""):
function gen_code_comments (line 49) | def gen_code_comments(context, model = None, **kwargs):
function read_file (line 53) | def read_file(path):
function write_file (line 58) | def write_file(path, context):
function split_context_by_maxline (line 63) | def split_context_by_maxline(text):
function split_context_by_splitkey (line 75) | def split_context_by_splitkey(text):
function merge_code_and_comments (line 80) | def merge_code_and_comments(original_file, comments_path):
function deal_one_file (line 138) | def deal_one_file(model, path, args):
function deal_folder (line 164) | def deal_folder(model, path, args):
function transfer (line 175) | def transfer(args):
FILE: examples/function_call_examples.py
function call_qwen (line 18) | def call_qwen(messages, functions=None):
function test_1 (line 37) | def test_1():
function test_2 (line 58) | def test_2():
function test_3 (line 171) | def test_3():
function test_4 (line 220) | def test_4():
FILE: examples/function_call_finetune_examples.py
function format_train_sample (line 19) | def format_train_sample(messages):
function build_react_instruction (line 60) | def build_react_instruction(functions):
function main (line 86) | def main():
FILE: examples/react_demo.py
function llm_with_plugin (line 72) | def llm_with_plugin(prompt: str, history, list_of_plugin_info=()):
function build_input_text (line 99) | def build_input_text(chat_history, list_of_plugin_info) -> str:
function text_completion (line 144) | def text_completion(input_text: str, stop_words) -> str: # 作为一个文本续写模型来使用
function parse_latest_plugin_call (line 165) | def parse_latest_plugin_call(text):
function call_plugin (line 190) | def call_plugin(plugin_name: str, plugin_args: str) -> str:
function test (line 210) | def test():
FILE: examples/vllm_wrapper.py
function get_stop_words_ids (line 23) | def get_stop_words_ids(chat_format, tokenizer):
function make_context (line 32) | def make_context(
class vLLMWrapper (line 104) | class vLLMWrapper:
method __init__ (line 105) | def __init__(self,
method chat (line 147) | def chat(self,
FILE: finetune.py
class ModelArguments (line 25) | class ModelArguments:
class DataArguments (line 30) | class DataArguments:
class TrainingArguments (line 41) | class TrainingArguments(transformers.TrainingArguments):
class LoraArguments (line 54) | class LoraArguments:
function maybe_zero_3 (line 66) | def maybe_zero_3(param):
function get_peft_state_maybe_zero_3 (line 77) | def get_peft_state_maybe_zero_3(named_params, bias):
function rank0_print (line 104) | def rank0_print(*args):
function safe_save_model_for_hf_trainer (line 109) | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output...
function preprocess (line 125) | def preprocess(
class SupervisedDataset (line 179) | class SupervisedDataset(Dataset):
method __init__ (line 182) | def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokeniz...
method __len__ (line 193) | def __len__(self):
method __getitem__ (line 196) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
class LazySupervisedDataset (line 204) | class LazySupervisedDataset(Dataset):
method __init__ (line 207) | def __init__(self, raw_data, tokenizer: transformers.PreTrainedTokeniz...
method __len__ (line 217) | def __len__(self):
method __getitem__ (line 220) | def __getitem__(self, i) -> Dict[str, torch.Tensor]:
function make_supervised_data_module (line 235) | def make_supervised_data_module(
function train (line 256) | def train():
FILE: openai_api.py
class BasicAuthMiddleware (line 29) | class BasicAuthMiddleware(BaseHTTPMiddleware):
method __init__ (line 31) | def __init__(self, app, username: str, password: str):
method dispatch (line 36) | async def dispatch(self, request: Request, call_next):
function _gc (line 50) | def _gc(forced: bool = False):
function lifespan (line 63) | async def lifespan(app: FastAPI): # collects GPU memory
class ModelCard (line 79) | class ModelCard(BaseModel):
class ModelList (line 89) | class ModelList(BaseModel):
class ChatMessage (line 94) | class ChatMessage(BaseModel):
class DeltaMessage (line 100) | class DeltaMessage(BaseModel):
class ChatCompletionRequest (line 105) | class ChatCompletionRequest(BaseModel):
class ChatCompletionResponseChoice (line 117) | class ChatCompletionResponseChoice(BaseModel):
class ChatCompletionResponseStreamChoice (line 123) | class ChatCompletionResponseStreamChoice(BaseModel):
class ChatCompletionResponse (line 129) | class ChatCompletionResponse(BaseModel):
function list_models (line 138) | async def list_models():
function add_extra_stop_words (line 145) | def add_extra_stop_words(stop_words):
function trim_stop_words (line 157) | def trim_stop_words(response, stop_words):
function parse_messages (line 191) | def parse_messages(messages, functions):
function parse_response (line 310) | def parse_response(response):
function text_complete_last_message (line 356) | def text_complete_last_message(history, stop_words_ids, gen_kwargs, syst...
function create_chat_completion (line 386) | async def create_chat_completion(request: ChatCompletionRequest):
function _dump_json (line 460) | def _dump_json(data: BaseModel, *args, **kwargs) -> str:
function predict (line 467) | async def predict(
function _get_args (line 536) | def _get_args():
FILE: recipes/inference/vllm/vllm_wrapper.py
function get_stop_words_ids (line 23) | def get_stop_words_ids(chat_format, tokenizer):
function make_context (line 32) | def make_context(
class vLLMWrapper (line 104) | class vLLMWrapper:
method __init__ (line 105) | def __init__(self,
method chat (line 147) | def chat(self,
FILE: recipes/tests/test_finetune/test_finetune_ds.py
function test_finetune (line 50) | def test_finetune(num_gpus, train_type, is_chat, docker_version, deepspe...
FILE: recipes/tests/test_inference/test_inference_api.py
function test_inference_api (line 34) | def test_inference_api(docker_version, use_cpu, use_int4):
FILE: recipes/tests/test_inference/test_inference_vllm_fschat.py
function test_inference_vllm_fschat (line 29) | def test_inference_vllm_fschat(num_gpus, use_int4):
FILE: recipes/tests/utils.py
function run_in_subprocess (line 7) | def run_in_subprocess(cmd):
function simple_openai_api (line 37) | def simple_openai_api(model):
function TelnetPort (line 51) | def TelnetPort(server_ip, port):
FILE: run_gptq.py
function preprocess (line 13) | def preprocess(
FILE: utils.py
function _device_map (line 6) | def _device_map(num_gpus, num_layers):
function load_model_on_gpus (line 28) | def load_model_on_gpus(model_name_or_path, num_gpus: int = 2):
FILE: web_demo.py
function _get_args (line 21) | def _get_args():
function _load_model_tokenizer (line 40) | def _load_model_tokenizer(args):
function postprocess (line 64) | def postprocess(self, y):
function _parse_text (line 78) | def _parse_text(text):
function _gc (line 110) | def _gc():
function _launch_demo (line 117) | def _launch_demo(args, model, tokenizer, config):
function main (line 200) | def main():
Condensed preview — 114 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,039K chars).
[
{
"path": ".dockerignore",
"chars": 119,
"preview": "__pycache__\n*.so\nbuild\n.coverage_*\n*.egg-info\n*~\n.vscode/\n.idea/\n.git/\n.github/\n.DS_Store\n\n/private/\n/README-docker.md\n"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.yaml",
"chars": 2806,
"preview": "name: 🐞 Bug\ndescription: 提交错误报告 | File a bug/issue\ntitle: \"[BUG] <title>\"\nlabels: []\nbody:\n - type: checkboxes\n attr"
},
{
"path": ".github/ISSUE_TEMPLATE/config.yaml",
"chars": 27,
"preview": "blank_issues_enabled: true\n"
},
{
"path": ".github/ISSUE_TEMPLATE/feature_request.yaml",
"chars": 2063,
"preview": "name: \"💡 Feature Request\"\ndescription: 创建新功能请求 | Create a new ticket for a new feature request\ntitle: \"💡 [REQUEST] - <ti"
},
{
"path": ".github/workflows/stale.yml",
"chars": 807,
"preview": "name: Close stale issues\non:\n schedule:\n - cron: \"0 8 * * *\"\n\njobs:\n close-issues:\n runs-on: ubuntu-latest\n p"
},
{
"path": ".gitignore",
"chars": 86,
"preview": "__pycache__\n*.so\nbuild\n.coverage_*\n*.egg-info\n*~\n.vscode/\n.idea/\n.DS_Store\n\n/private/\n"
},
{
"path": "FAQ.md",
"chars": 3626,
"preview": "# FAQ\n\n## Installation & Environment\n\n#### Failure in installing flash attention\n\nFlash attention is an option for accel"
},
{
"path": "FAQ_ja.md",
"chars": 2314,
"preview": "# FAQ\n\n## インストールと環境\n\n#### Flash attention 導入の失敗例\n\nFlash attention は、トレーニングと推論を加速するオプションです。H100、A100、RTX 3090、T4、RTX 2080"
},
{
"path": "FAQ_zh.md",
"chars": 2404,
"preview": "# FAQ\n\n## 安装&环境\n\n#### flash attention 安装失败\n\nflash attention是一个用于加速模型训练推理的可选项,且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显"
},
{
"path": "LICENSE",
"chars": 11342,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "NOTICE",
"chars": 15305,
"preview": "------------- LICENSE FOR NVIDIA Megatron-LM code --------------\n\nCopyright (c) 2022, NVIDIA CORPORATION. All rights re"
},
{
"path": "README.md",
"chars": 74467,
"preview": "<p align=\"left\">\n <a href=\"README_CN.md\">中文</a>  |  English  |  <a href=\"README_JA.md\">日本語</a> | &nbs"
},
{
"path": "README_CN.md",
"chars": 55084,
"preview": "<p align=\"left\">\n 中文</a>  |  <a href=\"README.md\">English</a>  |  <a href=\"README_JA.md\">日本語</a> | &nb"
},
{
"path": "README_ES.md",
"chars": 71639,
"preview": "<p align=\"left\">\n <a href=\"README_CN.md\">中文</a>  |  <a href=\"README.md\">English</a>  |  <a href=\"READ"
},
{
"path": "README_FR.md",
"chars": 73351,
"preview": "<p align=\"left\">\n <a href=\"README_CN.md\">中文</a>  |  <a href=\"README.md\">English</a>  |  <a href=\"READ"
},
{
"path": "README_JA.md",
"chars": 55905,
"preview": "<p align=\"left\">\n <a href=\"README_CN.md\">中文</a>  |  <a href=\"README.md\">English</a>  |  日本語 |  <a"
},
{
"path": "Tongyi Qianwen LICENSE AGREEMENT",
"chars": 6894,
"preview": "Tongyi Qianwen LICENSE AGREEMENT\n\nTongyi Qianwen Release Date: August 3, 2023\n\nBy clicking to agree or by using or distr"
},
{
"path": "Tongyi Qianwen RESEARCH LICENSE AGREEMENT",
"chars": 7280,
"preview": "Tongyi Qianwen RESEARCH LICENSE AGREEMENT\n\nTongyi Qianwen Release Date: November 30, 2023\n\nBy clicking to agree or by us"
},
{
"path": "ascend-support/README.md",
"chars": 690,
"preview": "# 昇腾910架构基于mindformers推理Qwen-7B-Chat模型\n\n## 环境要求\n\n- 硬件:Ascend 910A/B\n\n## 运行步骤\n\n首先参考Qwen README下载官方模型到`/path/to/Qwen-7B-Ch"
},
{
"path": "ascend-support/docker_qwen.sh",
"chars": 1607,
"preview": "#!/bin/bash\n\nIMAGE_NAME=qwenllm/qwen-mindspore:v23.0.RC3\nCONTAINER_NAME=qwen-mindspore\nCHECKPOINT_PATH='NOT_SET'\n\nDOCKER"
},
{
"path": "cli_demo.py",
"chars": 7259,
"preview": "# Copyright (c) Alibaba Cloud.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the roo"
},
{
"path": "dcu-support/README.md",
"chars": 1216,
"preview": "# DCU 架构基于 fastllm 推理 Qwen 模型\n\n\n## 环境配置\n\n### 环境准备\n\n```\ndocker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13."
},
{
"path": "dcu-support/cli_demo.py",
"chars": 1034,
"preview": "# coding=utf-8\nimport argparse\nfrom fastllm_pytools import llm\n\ndef args_parser():\n parser = argparse.ArgumentParser("
},
{
"path": "dcu-support/cli_demo_batch.py",
"chars": 1039,
"preview": "import argparse\nfrom fastllm_pytools import llm\nimport time\n\ndef args_parser():\n parser = argparse.ArgumentParser(des"
},
{
"path": "dcu-support/model.properties",
"chars": 181,
"preview": "# 模型唯一标识\nmodelCode = 411\n# 模型名称\nmodelName=qwen-7b_fastllm\n# 模型描述\nmodelDescription=qwen-7b是阿里云研发的通义千问大模型系列的70亿参数规模的模型\n# 应"
},
{
"path": "dcu-support/package/fastllm_pytools/__init__.py",
"chars": 17,
"preview": "__all__ = [\"llm\"]"
},
{
"path": "dcu-support/package/fastllm_pytools/hf_model.py",
"chars": 6908,
"preview": "from fastllm_pytools import llm;\nimport torch;\nimport ctypes;\nimport numpy as np;\n\nfastllm_data_type_dict = {\n \"int4\""
},
{
"path": "dcu-support/package/fastllm_pytools/llm.py",
"chars": 24113,
"preview": "import ctypes;\nimport math\nimport os;\nimport threading\nfrom typing import Optional, Tuple, Union, List, Callable, Dict, "
},
{
"path": "dcu-support/package/fastllm_pytools/torch2flm.py",
"chars": 8142,
"preview": "import struct\nimport numpy as np\nimport torch\n\ndef writeString(fo, s):\n fo.write(struct.pack('i', len(s)))\n fo.wri"
},
{
"path": "dcu-support/package/setup.py",
"chars": 307,
"preview": "from setuptools import setup, find_packages\n\nsetup (\n name = \"fastllm_pytools\",\n version = \"0.0.1\",\n descriptio"
},
{
"path": "dcu-support/qwen2flm.py",
"chars": 755,
"preview": "import sys\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom transformers.generation import GenerationCo"
},
{
"path": "dcu-support/requirements.txt",
"chars": 141,
"preview": "transformers==4.32.0\ntiktoken\nstreamlit>=1.24.0\nsentencepiece\nurllib3==1.26.16\ntransformers_stream_generator==0.0.4\nacce"
},
{
"path": "dcu-support/web_demo.py",
"chars": 1087,
"preview": "import streamlit as st\nfrom streamlit_chat import message\nfrom fastllm_pytools import llm\nimport sys\n\nst.set_page_config"
},
{
"path": "docker/Dockerfile",
"chars": 2457,
"preview": "ARG CUDA_VERSION=11.7.1\nARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04\n\nFROM ${from} as base\n\nARG from\n\nR"
},
{
"path": "docker/Dockerfile-cu114",
"chars": 2307,
"preview": "ARG CUDA_VERSION=11.4.3\nARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04\n\nFROM ${from} as base\n\nARG from\n\nR"
},
{
"path": "docker/Dockerfile-cu121",
"chars": 2758,
"preview": "ARG CUDA_VERSION=12.1.0\nARG from=nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04\n\nFROM ${from} as base\n\nARG from\n\nR"
},
{
"path": "docker/docker_cli_demo.sh",
"chars": 1357,
"preview": "#!/usr/bin/env bash\n#\n# This script will automatically pull docker image from DockerHub, and start a container to run th"
},
{
"path": "docker/docker_openai_api.sh",
"chars": 1818,
"preview": "#!/usr/bin/env bash\n#\n# This script will automatically pull docker image from DockerHub, and start a daemon container to"
},
{
"path": "docker/docker_web_demo.sh",
"chars": 1797,
"preview": "#!/usr/bin/env bash\n#\n# This script will automatically pull docker image from DockerHub, and start a daemon container to"
},
{
"path": "eval/EVALUATION.md",
"chars": 3281,
"preview": "## 评测复现\n\n- CEVAL\n\n```Shell\nwget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip\nmkdir data/"
},
{
"path": "eval/evaluate_ceval.py",
"chars": 14054,
"preview": "import os\nfrom typing import List\nimport argparse\nimport torch\nimport pandas as pd\nimport numpy as np\nfrom tqdm import t"
},
{
"path": "eval/evaluate_chat_ceval.py",
"chars": 14676,
"preview": "import os\r\nimport argparse\r\nimport re\r\nimport torch\r\nimport pandas as pd\r\nfrom thefuzz import process\r\nfrom tqdm import "
},
{
"path": "eval/evaluate_chat_gsm8k.py",
"chars": 8745,
"preview": "import json\r\nimport re\r\nfrom pathlib import Path\r\nimport argparse\r\nimport requests\r\nimport math\r\nimport numpy as np\r\nimp"
},
{
"path": "eval/evaluate_chat_humaneval.py",
"chars": 4213,
"preview": "import re\r\nimport textwrap\r\nimport argparse\r\nfrom pathlib import Path\r\nimport tqdm\r\nimport jsonlines\r\nfrom transformers "
},
{
"path": "eval/evaluate_chat_mmlu.py",
"chars": 9684,
"preview": "import os\r\nimport argparse\r\nimport re\r\nimport torch\r\nimport pandas as pd\r\nfrom tqdm import tqdm\r\nfrom thefuzz import pro"
},
{
"path": "eval/evaluate_cmmlu.py",
"chars": 10703,
"preview": "import os\nimport pandas as pd\nimport numpy as np\nimport argparse\nimport datasets\nimport torch\nfrom collections import de"
},
{
"path": "eval/evaluate_gsm8k.py",
"chars": 3770,
"preview": "import re\nimport torch\nimport argparse\nimport jsonlines\nimport numpy as np\nimport datasets\nfrom datasets import load_fro"
},
{
"path": "eval/evaluate_humaneval.py",
"chars": 2683,
"preview": "import argparse\nimport tqdm\nimport torch\nimport jsonlines\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nf"
},
{
"path": "eval/evaluate_mmlu.py",
"chars": 9890,
"preview": "import os\nfrom typing import List\nimport pandas as pd\nimport numpy as np\nimport argparse\nimport torch\nfrom tqdm import t"
},
{
"path": "eval/evaluate_plugin.py",
"chars": 10233,
"preview": "import argparse\nimport json\nimport os\nimport pprint\n\nimport json5\nimport jsonlines\nfrom rouge_score import rouge_scorer\n"
},
{
"path": "eval/gsm8k_prompt.txt",
"chars": 4047,
"preview": "Question: In 2004, there were 60 kids at a cookout. In 2005, half the number of kids came to the cookout as compared to "
},
{
"path": "examples/add_merges.py",
"chars": 7664,
"preview": "import argparse\r\nimport base64\r\nimport collections\r\nimport logging\r\nimport unicodedata\r\nfrom pathlib import Path\r\n\r\nimpo"
},
{
"path": "examples/auto_comments.md",
"chars": 1323,
"preview": "# Auto Comments \n本文档介绍Auto Comments,这是一个利用Qwen模型为代码文件自动生成注释的使用案例。\n\n# 使用方法\n您可以直接执行如下命令,为提供的代码文件生成注释:\n```\npython auto_comm"
},
{
"path": "examples/auto_comments.py",
"chars": 7212,
"preview": "# 运行方式:python auto_comments.py --path 'path of file or folder'\n# 脚本功能:使用QWen-7B-Chat为提供的代码文件自动生成注释。(详见auto_comments.md)\n"
},
{
"path": "examples/function_call_examples.py",
"chars": 7292,
"preview": "# Reference: https://openai.com/blog/function-calling-and-other-api-updates\nimport json\nfrom pprint import pprint\n\nimpor"
},
{
"path": "examples/function_call_finetune_examples.py",
"chars": 7515,
"preview": "#\n# # Fine-tuning Script:\n# Please start by reading the Fine-tuning section of README.md.\n#\n# # Fine-tuning Data Prepara"
},
{
"path": "examples/langchain_tooluse.ipynb",
"chars": 33498,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"30e24ef3\",\n \"metadata\": {\n \"tags\": []\n },\n \"source\": [\n"
},
{
"path": "examples/qwen_extra.tiktoken",
"chars": 144,
"preview": "5LiA5Y+q54yr 151851\n5Y+q54yr 151852\n5piv5LiA5Y+q54yr 151853\n5oiR5piv5LiA5Y+q54yr 151854\n5L2g5piv5LiA5Y+q54yr 151855\n5LuW"
},
{
"path": "examples/qwen_extra_vocab.txt",
"chars": 57,
"preview": "我是一只猫\t20\r\n你是一只猫\t10\r\n他是一只猫\t5\r\n一只\t200\r\n一只猫\t100\r\n夸张的 比喻手法\t20"
},
{
"path": "examples/react_demo.py",
"chars": 10287,
"preview": "#\n# 相关材料:\n# ReAct Prompting 原理简要介绍,不包含代码实现:\n# https://github.com/QwenLM/Qwen-7B/blob/main/examples/react_prompt."
},
{
"path": "examples/react_prompt.md",
"chars": 8981,
"preview": "# ReAct Prompting 示例\n\n本文档将介绍如何用 ReAct Prompting 技术命令千问使用工具。\n\n本文档主要基本的原理概念介绍,并在文末附上了一些具体实现相关的 FAQ,但不含被调用插件的实际实现。如果您更喜欢一边调"
},
{
"path": "examples/system_prompt.md",
"chars": 3842,
"preview": "# 系统指令 (System Prompts)\n\n## 什么是系统指令? (What is the System Prompts?)\n\n系统指令设定了AI助手的行为模式,例如人物设定、语言风格、任务模式、甚至针对具体问题的具体行为。\n\nSy"
},
{
"path": "examples/tokenizer_showcase.ipynb",
"chars": 25819,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [\n {\n \"name\":"
},
{
"path": "examples/transformers_agent.md",
"chars": 4941,
"preview": "## 什么是HuggingFace Agent\n使用大模型作为Agent,仅需自然语言就可调用HuggingFace中的模型,目前支持两种模式:\n\n- run模式:单轮对话,没有上下文,单个prompt多tool组合调用能力好\n- chat"
},
{
"path": "examples/vllm_wrapper.py",
"chars": 9867,
"preview": "from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList\r\nfrom typing import Optional, Calla"
},
{
"path": "finetune/ds_config_zero2.json",
"chars": 1227,
"preview": "{\n \"fp16\": {\n \"enabled\": \"auto\",\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"initial_"
},
{
"path": "finetune/ds_config_zero3.json",
"chars": 1498,
"preview": "{\n \"fp16\": {\n \"enabled\": \"auto\",\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"initial_"
},
{
"path": "finetune/finetune_ds.sh",
"chars": 2595,
"preview": "#!/bin/bash\nexport CUDA_DEVICE_MAX_CONNECTIONS=1\nDIR=`pwd`\n\n# Guide:\n# This script supports distributed training on mult"
},
{
"path": "finetune/finetune_lora_ds.sh",
"chars": 2760,
"preview": "#!/bin/bash\nexport CUDA_DEVICE_MAX_CONNECTIONS=1\nDIR=`pwd`\n\n# Guide:\n# This script supports distributed training on mult"
},
{
"path": "finetune/finetune_lora_single_gpu.sh",
"chars": 1615,
"preview": "#!/bin/bash\nexport CUDA_DEVICE_MAX_CONNECTIONS=1\n\nMODEL=\"Qwen/Qwen-7B\" # Set the path if you do not want to load from hu"
},
{
"path": "finetune/finetune_qlora_ds.sh",
"chars": 2696,
"preview": "#!/bin/bash\nexport CUDA_DEVICE_MAX_CONNECTIONS=1\nDIR=`pwd`\n\n# Guide:\n# This script supports distributed training on mult"
},
{
"path": "finetune/finetune_qlora_single_gpu.sh",
"chars": 1637,
"preview": "#!/bin/bash\nexport CUDA_DEVICE_MAX_CONNECTIONS=1\nDIR=`pwd`\n\nMODEL=\"Qwen/Qwen-7B-Chat-Int4\" # Set the path if you do not "
},
{
"path": "finetune.py",
"chars": 12491,
"preview": "# This code is based on the revised code from fastchat based on tatsu-lab/stanford_alpaca.\n\n\nfrom dataclasses import dat"
},
{
"path": "openai_api.py",
"chars": 21097,
"preview": "# Requirement:\n# pip install \"openai<1.0\"\n# Usage:\n# python openai_api.py\n# Visit http://localhost:8000/docs for doc"
},
{
"path": "recipes/applications/chatbot/qwen_chatbot.ipynb",
"chars": 3059,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"54d5d255-aa98-4655-8dd1-bc726430d86a\",\n \"metadata\": {},\n "
},
{
"path": "recipes/applications/domain_finetune/qwen_domain_finetune.ipynb",
"chars": 15779,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"0e7993c3-3999-4ac5-b1dc-77875d80e4c8\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/applications/retrieval/retrieval.ipynb",
"chars": 14556,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"245ab07a-fb2f-4cf4-ab9a-5c05a9b44daa\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/finetune/ascend/README.md",
"chars": 4095,
"preview": "# Fine-tuning Qwen by Ascend NPU\nBelow, we provide a simple example to show how to finetune Qwen by Ascend NPU. Currentl"
},
{
"path": "recipes/finetune/deepspeed/finetune_fullparameter_multi_gpu.ipynb",
"chars": 6494,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"6e6981ab-2d9a-4280-923f-235a166855ba\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/finetune/deepspeed/finetune_fullparameter_single_gpu.ipynb",
"chars": 7233,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"6e6981ab-2d9a-4280-923f-235a166855ba\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/finetune/deepspeed/finetune_lora_multi_gpu.ipynb",
"chars": 8287,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"6e6981ab-2d9a-4280-923f-235a166855ba\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/finetune/deepspeed/finetune_lora_single_gpu.ipynb",
"chars": 8147,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"6e6981ab-2d9a-4280-923f-235a166855ba\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/finetune/deepspeed/finetune_qlora_multi_gpu.ipynb",
"chars": 9044,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"6e6981ab-2d9a-4280-923f-235a166855ba\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/finetune/deepspeed/finetune_qlora_single_gpu.ipynb",
"chars": 8693,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"6e6981ab-2d9a-4280-923f-235a166855ba\",\n \"metadata\": {},\n \"so"
},
{
"path": "recipes/finetune/deepspeed/readme.md",
"chars": 21044,
"preview": "# Fine-tuning Qwen Using Deepspeed\n\n\n## TL;DR\n\nWe provide the official training script `finetune.py` and serveral notebo"
},
{
"path": "recipes/finetune/deepspeed/requirements.txt",
"chars": 14,
"preview": "deepspeed\npeft"
},
{
"path": "recipes/finetune/swift/README.md",
"chars": 12895,
"preview": "## Introduction\n[SWIFT](https://github.com/modelscope/swift) (Scalable lightWeight Infrastructure for Fine-Tuning) is an"
},
{
"path": "recipes/finetune/swift/README_CN.md",
"chars": 11381,
"preview": "## 介绍\n[SWIFT](https://github.com/modelscope/swift)(Scalable lightWeight Infrastructure for Fine-Tuning)是一个可扩展的轻量级一站式训练、推"
},
{
"path": "recipes/inference/dashscope/README.md",
"chars": 2682,
"preview": "# Inference Qwen Using DashScope\n\nThe most simple way to use Qwen through APIs is DashScope API service through Alibaba "
},
{
"path": "recipes/inference/hf_modelscope/README.md",
"chars": 10295,
"preview": "# Inference Qwen Using 🤖 ModelScope and 🤗 Transformers\n\nBelow, we provide simple examples to show how to inference Qwen "
},
{
"path": "recipes/inference/quantization/README.md",
"chars": 5897,
"preview": "# Quantization\n\n## GPTQ\n\nWe provide a solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release th"
},
{
"path": "recipes/inference/tensorrt/README.md",
"chars": 1858,
"preview": "# Inference Qwen Using TensorRT-LLM\nBelow, we provide a simple example to show how to inference Qwen by TensorRT-LLM. We"
},
{
"path": "recipes/inference/tensorrt/docker/Dockerfile",
"chars": 565,
"preview": "FROM nvidia/cuda:12.1.0-devel-ubuntu22.04\n\nRUN apt-get update && \\\n apt-get -y install python3.10 python3-pip openmpi"
},
{
"path": "recipes/inference/vllm/README.md",
"chars": 7292,
"preview": "# Inference Qwen Using vLLM\n\nFor deployment and fast inference, we suggest using vLLM. \n\n## Installation\n\nIf you use cud"
},
{
"path": "recipes/inference/vllm/template_chatml.jinja",
"chars": 345,
"preview": "{% for message in messages %}\n{% if loop.first and message['role'] != 'system' %}{{ '<|im_start|>system\\nYou are a helpf"
},
{
"path": "recipes/inference/vllm/vllm_wrapper.py",
"chars": 9959,
"preview": "from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList\r\nfrom typing import Optional, Calla"
},
{
"path": "recipes/quickstart/qwen.ipynb",
"chars": 13935,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"## Qwen Quick Start Notebook\\n\",\n "
},
{
"path": "recipes/tests/README.md",
"chars": 243,
"preview": "# Unit testing\n- Run all unit testing\n```bash\ncd tests && pytest -s \n```\n- Run unit testing under a single folder\n```bas"
},
{
"path": "recipes/tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "recipes/tests/assets/test_sampled_qwen.json",
"chars": 278,
"preview": "[{\"conversations\": [{\"from\": \"user\", \"value\": \"你好\"}, {\"from\": \"assistant\", \"value\": \"你好!很高兴为你提供帮助。\"}], \"id\": \"identity_0"
},
{
"path": "recipes/tests/test_finetune/test_finetune_ds.py",
"chars": 3508,
"preview": "import os\nimport sys\nimport pytest\nimport shutil\nfrom itertools import product\nimport torch\nfrom modelscope.hub.snapshot"
},
{
"path": "recipes/tests/test_inference/test_inference_api.py",
"chars": 2836,
"preview": "import os\nimport sys\nimport time\nimport pytest\nimport subprocess\nimport torch\nfrom modelscope.hub.snapshot_download impo"
},
{
"path": "recipes/tests/test_inference/test_inference_vllm_fschat.py",
"chars": 2479,
"preview": "import os\nimport sys\nimport time\nimport pytest\nimport subprocess\nimport torch\nfrom modelscope.hub.snapshot_download impo"
},
{
"path": "recipes/tests/ut_config.py",
"chars": 568,
"preview": "import os\n\n# common\nMODEL_TYPE = \"Qwen/Qwen-1_8B\"\nDOCKER_VERSION_CU114 = \"qwenllm/qwen:cu114\"\nDOCKER_VERSION_CU117 = \"qw"
},
{
"path": "recipes/tests/utils.py",
"chars": 1918,
"preview": "import logging\nimport subprocess\nimport socket\nimport openai\n\n\ndef run_in_subprocess(cmd):\n try:\n with subproc"
},
{
"path": "requirements.txt",
"chars": 99,
"preview": "transformers>=4.32.0,<4.38.0\naccelerate\ntiktoken\neinops\ntransformers_stream_generator==0.0.4\nscipy\n"
},
{
"path": "requirements_web_demo.txt",
"chars": 23,
"preview": "gradio<3.42\nmdtex2html\n"
},
{
"path": "run_gptq.py",
"chars": 4263,
"preview": "import argparse\r\nimport json\r\nfrom typing import Dict\r\nimport logging\r\n\r\nimport torch\r\nimport transformers\r\nfrom transfo"
},
{
"path": "tech_memo.md",
"chars": 21695,
"preview": "# Introducing Qwen-7B: Open foundation and human-aligned models (of the state-of-the-arts)\n\nLarge language models have r"
},
{
"path": "tokenization_note.md",
"chars": 12465,
"preview": "# Tokenization\r\n\r\nQwen-7B uses BPE tokenization on UTF-8 bytes using the `tiktoken` package.\r\nThere are two types of tok"
},
{
"path": "tokenization_note_ja.md",
"chars": 4597,
"preview": "# トークン化\r\n\r\nQwen-7B は `tiktoken` パッケージを使用して、UTF-8 バイトを BPE トークン化します。\r\nQwen-7B には 2 種類のトークンがあります。BPE の通常のトークン (`bytes` 型) "
},
{
"path": "tokenization_note_zh.md",
"chars": 8426,
"preview": "# Tokenization\r\n\r\n> 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。\r\n\r\nQwen-7B采用UTF-8字节级别的BPE tokenization方式,并依赖`tikto"
},
{
"path": "utils.py",
"chars": 1385,
"preview": "import torch\nfrom transformers import AutoModelForCausalLM\nfrom accelerate import dispatch_model\n\n\ndef _device_map(num_g"
},
{
"path": "web_demo.py",
"chars": 7287,
"preview": "# Copyright (c) Alibaba Cloud.\n#\n# This source code is licensed under the license found in the\n# LICENSE file in the roo"
}
]
About this extraction
This page contains the full source code of the QwenLM/Qwen GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 114 files (957.2 KB), approximately 299.7k tokens, and a symbol index with 218 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.