main 5e89e7703ab7 cached
3 files
724.3 KB
200.4k tokens
1 requests
Download .txt
Showing preview only (818K chars total). Download the full file or copy to clipboard to get everything.
Repository: createmomo/Open-Source-Language-Model-Pocket
Branch: main
Commit: 5e89e7703ab7
Files: 3
Total size: 724.3 KB

Directory structure:
gitextract_7o7yxv9l/

├── 01-colossalai-sft-kaggle.ipynb
├── 02-colossalai-sft-colab.ipynb
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: 01-colossalai-sft-kaggle.ipynb
================================================
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"**语言模型练习场01:**ColossalAI(SFT部分)\n- https://github.com/hpcaitech/ColossalAI\n\n**注意:**\n- 此notebook**只演示在kaggle notebook下如何跑通ColossalAI的SFT部分**,并不会包含超参数的调整、对结果的分析等\n- 类似的操作放到google colab理论上应该也可以跑通\n- **如果你有自己的机器**,则此notebook对你的帮助可能不大(因为你不需要在notebook上进行训练)\n- 此notebook的受众是**手里没有GPU资源,但是又想熟悉和浅浅尝试ColossalAI的小伙伴**","metadata":{}},{"cell_type":"markdown","source":"**数据的准备:**\n1. 根据[官方文档的提示](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples),在运行前需要准备好数据\n2. 数据可以在[这里下载](https://github.com/XueFuzhao/InstructionWild/tree/main/data)。注意不要下载seed文件(因为seed文件只有instruction,而没有response),要下载README里面提到的json文件,例如instinwild_ch.json\n3.将数据上传到Kaggle的Dataset中(需要创建自己的Dataset,可以为仅自己可见),按照Kaggle的步骤操作即可","metadata":{}},{"cell_type":"markdown","source":"**Kaggle Notebook的准备:**\n1. 在界面右方**添加Dataset**,选择自己创建的Dataset。选择后,在代码中可以通过绝对路径访问。比如刚才创建的数据集名字叫做“instructdata”,其中我们上传了文件“instinwild_ch_small.json”,则在代码中,我们就可以通过这个路径访问数据集:/kaggle/input/instructdata/instinwild_ch_small.json\n2. 最好选择**GPU T4x2**(Kaggle Notebook界面右方的Accelerator中选择)。如果选择P100可能会在安装过程中报错","metadata":{}},{"cell_type":"markdown","source":"**特别要注意的地方:**\n1. 每周有**30个小时**的GPU使用时间\n2. 每一次启动notebook**最长只能运行12个小时**(如果启动了但是不怎么使用,比如没有运行任何cell,也没有什么编辑的动作,可能也会在12个小时以内被强行终止。与google colab不同的时,长时间运行cell是可以的)\n3. 一旦被终止,则**不能再找回输出的数据**!(输出的数据会放到/kaggle/working/路径下,如果需要里面的数据,必须在终止运行之前就下载下来)","metadata":{}},{"cell_type":"markdown","source":"## 安装环境","metadata":{}},{"cell_type":"markdown","source":"大体上是按照官方文档说明安装。**但是**,如果严格按照文档安装会报错。原因是ColossalAI是一个非常活跃的项目,每日都会有不同的代码变化。文档的部分内容可能还不能及时得到更新。所以,我们针对目前的情况,对安装顺序和细节做了一点微调。 **(当前时间为:2023年4月30日)**\n\n相信在不远的将来,ColossalAI团队会把这些小bug修好,并且把文档立刻完善起来。","metadata":{}},{"cell_type":"markdown","source":"### 1 安装ColossalAI\n\n执行完下面的命令后,此时你会发现,下载的文件是放在了/kaggle/working/ColossalAI下","metadata":{}},{"cell_type":"code","source":"!git clone https://github.com/hpcaitech/ColossalAI.git","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:57:00.220637Z","iopub.execute_input":"2023-04-30T13:57:00.221204Z","iopub.status.idle":"2023-04-30T13:57:03.206251Z","shell.execute_reply.started":"2023-04-30T13:57:00.221169Z","shell.execute_reply":"2023-04-30T13:57:03.205056Z"},"trusted":true},"execution_count":1,"outputs":[{"name":"stdout","text":"Cloning into 'ColossalAI'...\nremote: Enumerating objects: 24949, done.\u001b[K\nremote: Counting objects: 100% (2362/2362), done.\u001b[K\nremote: Compressing objects: 100% (479/479), done.\u001b[K\nremote: Total 24949 (delta 1987), reused 2083 (delta 1881), pack-reused 22587\u001b[K\nReceiving objects: 100% (24949/24949), 23.09 MiB | 29.19 MiB/s, done.\nResolving deltas: 100% (16582/16582), done.\n","output_type":"stream"}]},{"cell_type":"markdown","source":"**安装ColossalAI**\n\n如果不执行这个安装,可能会出现这个[错误](https://github.com/hpcaitech/ColossalAI/issues/3629):“ImportError: cannot import name 'ColoInitContext' from 'colossalai.zero'”","metadata":{}},{"cell_type":"code","source":"import os\nos.chdir('./ColossalAI')\n!pip install .","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:57:03.208770Z","iopub.execute_input":"2023-04-30T13:57:03.209133Z","iopub.status.idle":"2023-04-30T13:57:28.240848Z","shell.execute_reply.started":"2023-04-30T13:57:03.209098Z","shell.execute_reply":"2023-04-30T13:57:28.239606Z"},"trusted":true},"execution_count":2,"outputs":[{"name":"stdout","text":"Processing /kaggle/working/ColossalAI\n  Preparing metadata (setup.py) ... \u001b[?25ldone\n\u001b[?25hRequirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (1.21.6)\nRequirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (4.64.1)\nRequirement already satisfied: psutil in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (5.9.3)\nRequirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (23.0)\nCollecting pre-commit\n  Downloading pre_commit-2.21.0-py2.py3-none-any.whl (201 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m201.9/201.9 kB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: rich in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (13.2.0)\nRequirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (8.1.3)\nCollecting fabric\n  Downloading fabric-3.0.1-py3-none-any.whl (53 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.3/53.3 kB\u001b[0m \u001b[31m4.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hCollecting contexttimer\n  Downloading contexttimer-0.3.3.tar.gz (4.9 kB)\n  Preparing metadata (setup.py) ... \u001b[?25ldone\n\u001b[?25hRequirement already satisfied: ninja in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (1.11.1)\nRequirement already satisfied: torch>=1.11 in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (1.13.0)\nCollecting safetensors\n  Downloading safetensors-0.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m33.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n\u001b[?25hRequirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from torch>=1.11->colossalai==0.2.8) (4.4.0)\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from click->colossalai==0.2.8) (4.11.4)\nCollecting paramiko>=2.4\n  Downloading paramiko-3.1.0-py3-none-any.whl (211 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.2/211.2 kB\u001b[0m \u001b[31m17.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hCollecting invoke>=2.0\n  Downloading invoke-2.1.0-py3-none-any.whl (159 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m159.9/159.9 kB\u001b[0m \u001b[31m17.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai==0.2.8) (6.0)\nCollecting cfgv>=2.0.0\n  Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)\nCollecting identify>=1.0.0\n  Downloading identify-2.5.23-py2.py3-none-any.whl (98 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.8/98.8 kB\u001b[0m \u001b[31m9.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: virtualenv>=20.10.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai==0.2.8) (20.17.1)\nCollecting nodeenv>=0.11.1\n  Downloading nodeenv-1.7.0-py2.py3-none-any.whl (21 kB)\nRequirement already satisfied: markdown-it-py<3.0.0,>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai==0.2.8) (2.1.0)\nRequirement already satisfied: pygments<3.0.0,>=2.6.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai==0.2.8) (2.14.0)\nRequirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.7/site-packages (from markdown-it-py<3.0.0,>=2.1.0->rich->colossalai==0.2.8) (0.1.2)\nRequirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from nodeenv>=0.11.1->pre-commit->colossalai==0.2.8) (59.8.0)\nCollecting pynacl>=1.5\n  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m856.7/856.7 kB\u001b[0m \u001b[31m41.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hCollecting bcrypt>=3.2\n  Downloading bcrypt-4.0.1-cp36-abi3-manylinux_2_28_x86_64.whl (593 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m593.7/593.7 kB\u001b[0m \u001b[31m42.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai==0.2.8) (38.0.2)\nRequirement already satisfied: filelock<4,>=3.4.1 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (3.9.0)\nRequirement already satisfied: platformdirs<3,>=2.4 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (2.6.2)\nRequirement already satisfied: distlib<1,>=0.3.6 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (0.3.6)\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->colossalai==0.2.8) (3.11.0)\nRequirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.7/site-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (1.15.1)\nRequirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (2.21)\nBuilding wheels for collected packages: colossalai, contexttimer\n  Building wheel for colossalai (setup.py) ... \u001b[?25ldone\n\u001b[?25h  Created wheel for colossalai: filename=colossalai-0.2.8-py3-none-any.whl size=1059097 sha256=50ed72a86bf2ae29440764d5c46f67d000fcafbe8d4a5d8f6a947f5e57e85c70\n  Stored in directory: /tmp/pip-ephem-wheel-cache-3th5gmed/wheels/3e/97/46/e40c7da8c6931df2650672912b14531b399ef776670745f133\n  Building wheel for contexttimer (setup.py) ... \u001b[?25ldone\n\u001b[?25h  Created wheel for contexttimer: filename=contexttimer-0.3.3-py3-none-any.whl size=5818 sha256=15b3da44f55d3cf68ee6b623d09715157fa4a60ff4f805c607bca5acf41c83f4\n  Stored in directory: /root/.cache/pip/wheels/f4/67/63/f276d2acab046618878e3eaf13c5a356c9a500baf21403f345\nSuccessfully built colossalai contexttimer\nInstalling collected packages: safetensors, contexttimer, nodeenv, invoke, identify, cfgv, bcrypt, pynacl, pre-commit, paramiko, fabric, colossalai\nSuccessfully installed bcrypt-4.0.1 cfgv-3.3.1 colossalai-0.2.8 contexttimer-0.3.3 fabric-3.0.1 identify-2.5.23 invoke-2.1.0 nodeenv-1.7.0 paramiko-3.1.0 pre-commit-2.21.0 pynacl-1.5.0 safetensors-0.3.1\n\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n\u001b[0m","output_type":"stream"}]},{"cell_type":"markdown","source":"**安装transformers**\n\n这里我们安装的是hpcaitech下的transformers,如果直接pip install transformers是否可行并没有测试","metadata":{}},{"cell_type":"code","source":"!git clone https://github.com/hpcaitech/transformers\nos.chdir('./transformers')\n!pip install .","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:57:28.243773Z","iopub.execute_input":"2023-04-30T13:57:28.244481Z","iopub.status.idle":"2023-04-30T13:58:15.962347Z","shell.execute_reply.started":"2023-04-30T13:57:28.244440Z","shell.execute_reply":"2023-04-30T13:58:15.960975Z"},"trusted":true},"execution_count":3,"outputs":[{"name":"stdout","text":"Cloning into 'transformers'...\nremote: Enumerating objects: 124468, done.\u001b[K\nremote: Total 124468 (delta 0), reused 0 (delta 0), pack-reused 124468\u001b[K\nReceiving objects: 100% (124468/124468), 127.08 MiB | 26.91 MiB/s, done.\nResolving deltas: 100% (93344/93344), done.\nProcessing /kaggle/working/ColossalAI/transformers\n  Installing build dependencies ... \u001b[?25ldone\n\u001b[?25h  Getting requirements to build wheel ... \u001b[?25ldone\n\u001b[?25h  Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n\u001b[?25hRequirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (23.0)\nRequirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (3.9.0)\nRequirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (1.21.6)\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (4.11.4)\nRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (6.0)\nRequirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (0.13.2)\nRequirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (0.13.3)\nRequirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (4.64.1)\nRequirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (2.28.2)\nRequirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (2021.11.10)\nRequirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.7/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.28.0.dev0) (4.4.0)\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->transformers==4.28.0.dev0) (3.11.0)\nRequirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (2022.12.7)\nRequirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (2.1.1)\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (1.26.14)\nRequirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (3.4)\nBuilding wheels for collected packages: transformers\n  Building wheel for transformers (pyproject.toml) ... \u001b[?25ldone\n\u001b[?25h  Created wheel for transformers: filename=transformers-4.28.0.dev0-py3-none-any.whl size=6790611 sha256=4220935232e4fb5bbdd639242eec8975f925c105da87c0d4d0137e013c5479a5\n  Stored in directory: /tmp/pip-ephem-wheel-cache-u7rm17k7/wheels/f8/7e/62/d660e4bfe297957f2a56ddb6284d5815eba12ca9dfe5b1cf73\nSuccessfully built transformers\nInstalling collected packages: transformers\n  Attempting uninstall: transformers\n    Found existing installation: transformers 4.27.4\n    Uninstalling transformers-4.27.4:\n      Successfully uninstalled transformers-4.27.4\nSuccessfully installed transformers-4.28.0.dev0\n\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n\u001b[0m","output_type":"stream"}]},{"cell_type":"markdown","source":"**安装Chat部分需要的库**","metadata":{}},{"cell_type":"code","source":"os.chdir('/kaggle/working/ColossalAI/applications/Chat/')\n!pip install .","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:58:15.966733Z","iopub.execute_input":"2023-04-30T13:58:15.967109Z","iopub.status.idle":"2023-04-30T13:58:28.362540Z","shell.execute_reply.started":"2023-04-30T13:58:15.967072Z","shell.execute_reply":"2023-04-30T13:58:28.361342Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"Processing /kaggle/working/ColossalAI/applications/Chat\n  Preparing metadata (setup.py) ... \u001b[?25ldone\n\u001b[?25hRequirement already satisfied: transformers>=4.20.1 in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (4.28.0.dev0)\nRequirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (4.64.1)\nRequirement already satisfied: datasets in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (2.1.0)\nCollecting loralib\n  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)\nRequirement already satisfied: colossalai>=0.2.4 in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.2.8)\nRequirement already satisfied: torch<2.0.0,>=1.12.1 in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (1.13.0)\nCollecting langchain\n  Downloading langchain-0.0.27-py3-none-any.whl (124 kB)\n\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.9/124.9 kB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n\u001b[?25hRequirement already satisfied: tokenizers in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.13.2)\nRequirement already satisfied: fastapi in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.89.1)\nCollecting sse_starlette\n  Downloading sse_starlette-0.10.3-py3-none-any.whl (8.0 kB)\nRequirement already satisfied: wandb in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.14.0)\nRequirement already satisfied: sentencepiece in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.1.97)\nRequirement already satisfied: gpustat in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (1.0.0)\nRequirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (1.21.6)\nRequirement already satisfied: safetensors in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.1)\nRequirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (8.1.3)\nRequirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (23.0)\nRequirement already satisfied: fabric in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (3.0.1)\nRequirement already satisfied: psutil in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (5.9.3)\nRequirement already satisfied: pre-commit in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (2.21.0)\nRequirement already satisfied: rich in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (13.2.0)\nRequirement already satisfied: ninja in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (1.11.1)\nRequirement already satisfied: contexttimer in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.3)\nRequirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from torch<2.0.0,>=1.12.1->coati==1.0.0) (4.4.0)\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (4.11.4)\nRequirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (2.28.2)\nRequirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (3.9.0)\nRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (6.0)\nRequirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (2021.11.10)\nRequirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (0.13.3)\nRequirement already satisfied: multiprocess in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (0.70.14)\nRequirement already satisfied: aiohttp in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (3.8.3)\nRequirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (1.3.5)\nRequirement already satisfied: pyarrow>=5.0.0 in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (5.0.0)\nRequirement already satisfied: xxhash in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (3.2.0)\nRequirement already satisfied: responses<0.19 in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (0.18.0)\nRequirement already satisfied: dill in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (0.3.6)\nRequirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (2023.1.0)\nRequirement already satisfied: starlette==0.22.0 in /opt/conda/lib/python3.7/site-packages (from fastapi->coati==1.0.0) (0.22.0)\nRequirement already satisfied: pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2 in /opt/conda/lib/python3.7/site-packages (from fastapi->coati==1.0.0) (1.10.4)\nRequirement already satisfied: anyio<5,>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from starlette==0.22.0->fastapi->coati==1.0.0) (3.6.2)\nRequirement already satisfied: nvidia-ml-py<=11.495.46,>=11.450.129 in /opt/conda/lib/python3.7/site-packages (from gpustat->coati==1.0.0) (11.495.46)\nRequirement already satisfied: six>=1.7 in /opt/conda/lib/python3.7/site-packages (from gpustat->coati==1.0.0) (1.16.0)\nRequirement already satisfied: blessed>=1.17.1 in /opt/conda/lib/python3.7/site-packages (from gpustat->coati==1.0.0) (1.19.1)\nRequirement already satisfied: sqlalchemy in /opt/conda/lib/python3.7/site-packages (from langchain->coati==1.0.0) (1.4.46)\nRequirement already satisfied: GitPython!=3.1.29,>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (3.1.30)\nRequirement already satisfied: pathtools in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (0.1.2)\nRequirement already satisfied: protobuf!=4.21.0,<5,>=3.12.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (3.20.3)\nRequirement already satisfied: setproctitle in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (1.3.2)\nRequirement already satisfied: sentry-sdk>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (1.18.0)\nRequirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (59.8.0)\nRequirement already satisfied: appdirs>=1.4.3 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (1.4.4)\nRequirement already satisfied: docker-pycreds>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (0.4.0)\nRequirement already satisfied: wcwidth>=0.1.4 in /opt/conda/lib/python3.7/site-packages (from blessed>=1.17.1->gpustat->coati==1.0.0) (0.2.6)\nRequirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (1.3.1)\nRequirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (6.0.4)\nRequirement already satisfied: charset-normalizer<3.0,>=2.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (2.1.1)\nRequirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (4.0.2)\nRequirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (1.8.2)\nRequirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (22.2.0)\nRequirement already satisfied: asynctest==0.13.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (0.13.0)\nRequirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (1.3.3)\nRequirement already satisfied: gitdb<5,>=4.0.1 in /opt/conda/lib/python3.7/site-packages (from GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0) (4.0.10)\nRequirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->transformers>=4.20.1->coati==1.0.0) (3.4)\nRequirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->transformers>=4.20.1->coati==1.0.0) (2022.12.7)\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->transformers>=4.20.1->coati==1.0.0) (1.26.14)\nRequirement already satisfied: paramiko>=2.4 in /opt/conda/lib/python3.7/site-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (3.1.0)\nRequirement already satisfied: invoke>=2.0 in /opt/conda/lib/python3.7/site-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (2.1.0)\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->transformers>=4.20.1->coati==1.0.0) (3.11.0)\nRequirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->datasets->coati==1.0.0) (2.8.2)\nRequirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->datasets->coati==1.0.0) (2023.3)\nRequirement already satisfied: virtualenv>=20.10.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (20.17.1)\nRequirement already satisfied: cfgv>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (3.3.1)\nRequirement already satisfied: nodeenv>=0.11.1 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (1.7.0)\nRequirement already satisfied: identify>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (2.5.23)\nRequirement already satisfied: markdown-it-py<3.0.0,>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.1.0)\nRequirement already satisfied: pygments<3.0.0,>=2.6.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.14.0)\nRequirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.7/site-packages (from sqlalchemy->langchain->coati==1.0.0) (2.0.1)\nRequirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.7/site-packages (from anyio<5,>=3.4.0->starlette==0.22.0->fastapi->coati==1.0.0) (1.3.0)\nRequirement already satisfied: smmap<6,>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from gitdb<5,>=4.0.1->GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0) (5.0.0)\nRequirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.7/site-packages (from markdown-it-py<3.0.0,>=2.1.0->rich->colossalai>=0.2.4->coati==1.0.0) (0.1.2)\nRequirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (38.0.2)\nRequirement already satisfied: pynacl>=1.5 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.5.0)\nRequirement already satisfied: bcrypt>=3.2 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (4.0.1)\nRequirement already satisfied: platformdirs<3,>=2.4 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (2.6.2)\nRequirement already satisfied: distlib<1,>=0.3.6 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (0.3.6)\nRequirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.7/site-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.15.1)\nRequirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (2.21)\nBuilding wheels for collected packages: coati\n  Building wheel for coati (setup.py) ... \u001b[?25ldone\n\u001b[?25h  Created wheel for coati: filename=coati-1.0.0-py3-none-any.whl size=73195 sha256=114e624d66aa7c22966144210804c68036dda419db454e1fb5a6b657d58b5879\n  Stored in directory: /tmp/pip-ephem-wheel-cache-pshxknq0/wheels/19/ab/40/58b2528cfb9dab45fa2cdceeff3538d85c2c72d65872c4de6a\nSuccessfully built coati\nInstalling collected packages: loralib, sse_starlette, langchain, coati\nSuccessfully installed coati-1.0.0 langchain-0.0.27 loralib-0.1.1 sse_starlette-0.10.3\n\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n\u001b[0m","output_type":"stream"}]},{"cell_type":"markdown","source":"我们还需要将**analyzer**的部分复制到当前操作系统的对应的**python packages**中(否则可能将来会报错:找不到analyzer)\n\n如果你问,我怎么知道packages的路径是哪里呢?其实在上面执行各种pip install的过程中,就可以发现这个路径。","metadata":{}},{"cell_type":"code","source":"!cp -r /kaggle/working/ColossalAI/colossalai/_analyzer/ /opt/conda/lib/python3.7/site-packages/colossalai/","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:58:28.364761Z","iopub.execute_input":"2023-04-30T13:58:28.365247Z","iopub.status.idle":"2023-04-30T13:58:29.347282Z","shell.execute_reply.started":"2023-04-30T13:58:28.365204Z","shell.execute_reply":"2023-04-30T13:58:29.345935Z"},"trusted":true},"execution_count":5,"outputs":[]},{"cell_type":"markdown","source":"### 2 预训练模型的下载(以bloom为例)","metadata":{}},{"cell_type":"markdown","source":"首先**需要执行这个命令**(如果你是在colab中做尝试,可能也需要)\n\n如果不安装下面的命令会出现什么情况呢?\n- 从huggingface中git clone下来的模型看似下载下来了,但是其实下载下来的并不是实质的模型文件(如果你检查文件的大小,只有几B)\n- 一旦下载下来的文件并不是实质的模型,则在运行SFT代码的时候会报错:**safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge**","metadata":{}},{"cell_type":"code","source":"!sudo apt-get install git-lfs\n!git lfs install","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:58:29.349334Z","iopub.execute_input":"2023-04-30T13:58:29.349759Z","iopub.status.idle":"2023-04-30T13:58:33.723349Z","shell.execute_reply.started":"2023-04-30T13:58:29.349713Z","shell.execute_reply":"2023-04-30T13:58:33.722173Z"},"trusted":true},"execution_count":6,"outputs":[{"name":"stdout","text":"Reading package lists... Done\nBuilding dependency tree       \nReading state information... Done\ngit-lfs is already the newest version (2.9.2-1).\n0 upgraded, 0 newly installed, 0 to remove and 76 not upgraded.\nUpdated git hooks.\nGit LFS initialized.\n","output_type":"stream"}]},{"cell_type":"markdown","source":"**下载ColossalAI支持的系列模型**,我们以bloomz-560m为例。在下面,我们将模型放在了/kaggle/working/中,但是这里并不是强制的,可以根据自己的喜欢变换位置。","metadata":{}},{"cell_type":"code","source":"os.chdir('/kaggle/working/')\n!git clone https://huggingface.co/bigscience/bloomz-560m","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:58:33.726131Z","iopub.execute_input":"2023-04-30T13:58:33.726802Z","iopub.status.idle":"2023-04-30T13:59:00.114858Z","shell.execute_reply.started":"2023-04-30T13:58:33.726758Z","shell.execute_reply":"2023-04-30T13:59:00.113640Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"Cloning into 'bloomz-560m'...\nremote: Enumerating objects: 1332, done.\u001b[K\nremote: Counting objects: 100% (10/10), done.\u001b[K\nremote: Compressing objects: 100% (10/10), done.\u001b[K\nremote: Total 1332 (delta 3), reused 0 (delta 0), pack-reused 1322\u001b[K\nReceiving objects: 100% (1332/1332), 7.18 MiB | 22.55 MiB/s, done.\nResolving deltas: 100% (616/616), done.\nFiltering content: 100% (8/8), 2.11 GiB | 88.79 MiB/s, done.\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### 3 运行SFT","metadata":{}},{"cell_type":"code","source":"os.chdir('/kaggle/working/ColossalAI/applications/Chat/examples')","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:59:00.116823Z","iopub.execute_input":"2023-04-30T13:59:00.117188Z","iopub.status.idle":"2023-04-30T13:59:00.124033Z","shell.execute_reply.started":"2023-04-30T13:59:00.117154Z","shell.execute_reply":"2023-04-30T13:59:00.123036Z"},"trusted":true},"execution_count":8,"outputs":[]},{"cell_type":"markdown","source":"**执行SFT代码**\n\n我们这里是直接运行的py文件,如果你按照[文档的说明](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples),运行sh脚本文件(!bash train_sft.sh)也是可以的。其本质上,都是运行这个train_sft.py文件。\n\n在下面的命令中,为了演示\n- 我们只是用一个非常非常非常小的数据集(--dataset)去跑程序(小到就只有5条数据)\n- model:改为了“bloom”\n- pretrain:改成了我们自己下载的模型地址\n- save_path: 改成了我们想放的目录地址\n\n**需要注意的是:**\n- Kaggle Notebook的GPU是T4x2,所以显存大概有12+12=24G。所以我们在运行训练的时候可以设置**--nproc_per_node=2**(如果是1的话,则有一块GPU会闲置,显存也会砍半)\n- 其他参数可能需要你自己去多多探索:比如lora、gradient checkingpoint等。更多参数的说明见[官方文档](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)","metadata":{}},{"cell_type":"code","source":"!torchrun --standalone --nproc_per_node=2 train_sft.py \\\n    --pretrain \"/kaggle/working/bloomz-560m\" \\\n    --model 'bloom' \\\n    --strategy colossalai_zero2 \\\n    --log_interval 50 \\\n    --save_path  \"/kaggle/working/bloomz-560m-finetuned\" \\\n    --dataset \"/kaggle/input/instructdata/instinwild_ch_small.json\" \\\n    --batch_size 4 \\\n    --accumulation_steps 8 \\\n    --lr 2e-5 \\\n    --max_datasets_size 512 \\\n    --max_epochs 1","metadata":{"execution":{"iopub.status.busy":"2023-04-30T13:59:00.125449Z","iopub.execute_input":"2023-04-30T13:59:00.126251Z","iopub.status.idle":"2023-04-30T14:03:20.357385Z","shell.execute_reply.started":"2023-04-30T13:59:00.126214Z","shell.execute_reply":"2023-04-30T14:03:20.354320Z"},"trusted":true},"execution_count":9,"outputs":[{"name":"stdout","text":"\u001b[2;36m[04/30/23 13:59:19]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/colossalai/c\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35montext/\u001b[0m\u001b[95mparallel_context.py\u001b[0m:\u001b[1;36m522\u001b[0m set_device          \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: process rank \u001b[1;36m0\u001b[0m is  \n\u001b[2;36m                    \u001b[0m         bound to device \u001b[1;36m0\u001b[0m                                  \n\u001b[2;36m[04/30/23 13:59:24]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/colossalai/c\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35montext/\u001b[0m\u001b[95mparallel_context.py\u001b[0m:\u001b[1;36m558\u001b[0m set_seed            \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: initialized seed on\n\u001b[2;36m                    \u001b[0m         rank \u001b[1;36m0\u001b[0m, numpy: \u001b[1;36m42\u001b[0m, python random: \u001b[1;36m42\u001b[0m,              \n\u001b[2;36m                    \u001b[0m         ParallelMode.DATA: \u001b[1;36m42\u001b[0m, ParallelMode.TENSOR: \u001b[1;36m42\u001b[0m,the \n\u001b[2;36m                    \u001b[0m         default parallel seed is ParallelMode.DATA.        \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/colossalai/\u001b[0m\u001b[95mi\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[95mnitialize.py\u001b[0m:\u001b[1;36m119\u001b[0m launch                            \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Distributed        \n\u001b[2;36m                    \u001b[0m         environment is initialized, data parallel size: \u001b[1;36m1\u001b[0m, \n\u001b[2;36m                    \u001b[0m         pipeline parallel size: \u001b[1;36m1\u001b[0m, tensor parallel size: \u001b[1;36m1\u001b[0m \n\u001b[2;36m[04/30/23 14:03:09]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35mt/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m121\u001b[0m __init__                      \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Loading data\u001b[33m...\u001b[0m    \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35mt/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m123\u001b[0m __init__                      \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Loaded \u001b[1;36m6\u001b[0m examples. \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35mt/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m126\u001b[0m __init__                      \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Limiting dataset to\n\u001b[2;36m                    \u001b[0m         \u001b[1;36m512\u001b[0m examples.                                      \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35mt/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m129\u001b[0m __init__                      \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Formatting         \n\u001b[2;36m                    \u001b[0m         inputs\u001b[33m...\u001b[0m                                          \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35mt/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m137\u001b[0m __init__                      \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Tokenizing         \n\u001b[2;36m                    \u001b[0m         inputs\u001b[33m...\u001b[0m This may take some time\u001b[33m...\u001b[0m               \nsteps: 0it [00:00, ?it/s]\u001b[2;36m[04/30/23 14:03:13]\u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING:                 \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/coati/traine\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35mr/\u001b[0m\u001b[95msft.py\u001b[0m:\u001b[1;36m86\u001b[0m fit                                    \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING: batch_i\u001b[1;92md:0\u001b[0m,     \n\u001b[2;36m                    \u001b[0m         abnormal loss: \u001b[1;36m3.6484375\u001b[0m                           \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING:                 \n\u001b[2;36m                    \u001b[0m         \u001b[35m/opt/conda/lib/python3.7/site-packages/coati/traine\u001b[0m\n\u001b[2;36m                    \u001b[0m         \u001b[35mr/\u001b[0m\u001b[95msft.py\u001b[0m:\u001b[1;36m86\u001b[0m fit                                    \n\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING: batch_i\u001b[1;92md:1\u001b[0m,     \n\u001b[2;36m                    \u001b[0m         abnormal loss: \u001b[1;36m4.1015625\u001b[0m                           \nsteps: 0it [00:02, ?it/s]\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### 4 下载SFT完成的模型","metadata":{}},{"cell_type":"code","source":"os.chdir('/kaggle/working/')","metadata":{"execution":{"iopub.status.busy":"2023-04-30T14:03:20.361678Z","iopub.execute_input":"2023-04-30T14:03:20.362417Z","iopub.status.idle":"2023-04-30T14:03:20.368009Z","shell.execute_reply.started":"2023-04-30T14:03:20.362368Z","shell.execute_reply":"2023-04-30T14:03:20.367092Z"},"trusted":true},"execution_count":10,"outputs":[]},{"cell_type":"markdown","source":"理论上,我们是可以通过Kaggle Notebook右边的界面,选择想要下载的文件进行下载。但是由于Kaggle界面做的并不是很好,**经常会出现目录不能显示的问题**。好在,你应该是知道训练好的模型是放在了哪里。我们可以找到另一种办法将文件下载下来。","metadata":{}},{"cell_type":"markdown","source":"首先,我们把**完成的模型进行打包**(这样我们就可以一下字全都下载下来了,而不用一个一个的下载)","metadata":{}},{"cell_type":"code","source":"!tar -czvf bloomz-560m-finetuned.tar.gz bloomz-560m-finetuned","metadata":{"execution":{"iopub.status.busy":"2023-04-30T14:03:20.369504Z","iopub.execute_input":"2023-04-30T14:03:20.369861Z","iopub.status.idle":"2023-04-30T14:04:11.809532Z","shell.execute_reply.started":"2023-04-30T14:03:20.369826Z","shell.execute_reply":"2023-04-30T14:04:11.808272Z"},"trusted":true},"execution_count":11,"outputs":[{"name":"stdout","text":"bloomz-560m-finetuned/\nbloomz-560m-finetuned/config.json\nbloomz-560m-finetuned/pytorch_model.bin\nbloomz-560m-finetuned/tokenizer_config.json\nbloomz-560m-finetuned/generation_config.json\nbloomz-560m-finetuned/special_tokens_map.json\nbloomz-560m-finetuned/tokenizer.json\n","output_type":"stream"}]},{"cell_type":"markdown","source":"完成打包后,我们**想办法获得下载链接**。执行下面的命令后,直接点击链接便可下载。","metadata":{}},{"cell_type":"code","source":"from IPython.display import FileLink\nFileLink(r'bloomz-560m-finetuned.tar.gz')","metadata":{"execution":{"iopub.status.busy":"2023-04-30T14:04:11.811207Z","iopub.execute_input":"2023-04-30T14:04:11.811637Z","iopub.status.idle":"2023-04-30T14:04:11.821318Z","shell.execute_reply.started":"2023-04-30T14:04:11.811584Z","shell.execute_reply":"2023-04-30T14:04:11.820184Z"},"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"/kaggle/working/bloomz-560m-finetuned.tar.gz","text/html":"<a href='bloomz-560m-finetuned.tar.gz' target='_blank'>bloomz-560m-finetuned.tar.gz</a><br>"},"metadata":{}}]},{"cell_type":"markdown","source":"### 5 小结\n仅仅跑起来只是一个开始。祝愿每个小伙伴都会最终获得更好的GPU资源、更棒的数据和更出色的属于自己的语言模型!","metadata":{}}]}


================================================
FILE: 02-colossalai-sft-colab.ipynb
================================================
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4",
      "collapsed_sections": [
        "NY7XsonIG8Ev",
        "bYe-_gnDHEEa"
      ]
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# 语言模型练习场02:ColossalAI(SFT部分)Google Colab版本"
      ],
      "metadata": {
        "id": "AamIytcNnxL1"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "- [https://github.com/hpcaitech/ColossalAI](https://github.com/hpcaitech/ColossalAI)"
      ],
      "metadata": {
        "id": "yAw_VzT6pVwW"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**注意:**\n",
        "- 此notebook**只演示在google colab下如何跑通ColossalAI的SFT部分**,并不会包含超参数的调整、对结果的分析等\n",
        "- 如果想在**Kaggle Notebook**中跑,可以看这篇文章的Kaggle版:[穷穷穷孩子如何体验ColossalAI SFT(Kaggle篇)](https://mp.weixin.qq.com/s/Q29uSNxvPMy0rC-QxHiGZA)\n",
        "- **如果你有自己的机器**,则此notebook对你的帮助可能不大(因为你不需要在notebook上进行训练)\n",
        "- 此notebook的受众是手里没有GPU资源,但是又想熟悉和浅浅尝试ColossalAI的小伙伴\n",
        "- Google Colab目前可以获得的免费运算资源是**T4(大概有15GB的显存)**\n",
        "- **运行太长时间会自动终止**。所以建议重要的运行结果保存路径要指向自己的google drive目录中,而不那么重要的文件可以直接安装到默认目录下(终止后会被自动删除)"
      ],
      "metadata": {
        "id": "hmoMi8BspejU"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**数据的准备:**\n",
        "- 根据[官方文档的提示](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples),在运行前需要准备好数据(https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)\n",
        "- 数据可以在[这里下载](https://github.com/XueFuzhao/InstructionWild/tree/main/data)。注意不要下载seed文件(因为seed文件只有instruction,而没有response),要下载README里面提到的json文件,例如instinwild_ch.json(https://github.com/XueFuzhao/InstructionWild/tree/main/data)"
      ],
      "metadata": {
        "id": "lvubgzEuGch2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 安装环境"
      ],
      "metadata": {
        "id": "zN_47P5fG3c_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 1 安装ColossalAI"
      ],
      "metadata": {
        "id": "NY7XsonIG8Ev"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PhpvqFt06DSm",
        "outputId": "62102879-51e9-4014-a248-7e2020d83ded"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Cloning into 'ColossalAI'...\n",
            "remote: Enumerating objects: 26116, done.\u001b[K\n",
            "remote: Counting objects: 100% (5/5), done.\u001b[K\n",
            "remote: Compressing objects: 100% (5/5), done.\u001b[K\n",
            "remote: Total 26116 (delta 0), reused 0 (delta 0), pack-reused 26111\u001b[K\n",
            "Receiving objects: 100% (26116/26116), 23.36 MiB | 34.12 MiB/s, done.\n",
            "Resolving deltas: 100% (17448/17448), done.\n"
          ]
        }
      ],
      "source": [
        "!git clone https://github.com/hpcaitech/ColossalAI.git"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "os.chdir('./ColossalAI')\n",
        "!pip install ."
      ],
      "metadata": {
        "id": "kWmQqVbu6Ncc",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "7554b306-038f-4b06-a5e0-a7734c8e8945"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Processing /content/ColossalAI\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (1.22.4)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (4.65.0)\n",
            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (5.9.5)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (23.1)\n",
            "Collecting pre-commit (from colossalai==0.2.8)\n",
            "  Downloading pre_commit-3.3.2-py2.py3-none-any.whl (202 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m202.8/202.8 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (13.3.4)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (8.1.3)\n",
            "Collecting fabric (from colossalai==0.2.8)\n",
            "  Downloading fabric-3.0.1-py3-none-any.whl (53 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.3/53.3 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting contexttimer (from colossalai==0.2.8)\n",
            "  Downloading contexttimer-0.3.3.tar.gz (4.9 kB)\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "Collecting ninja (from colossalai==0.2.8)\n",
            "  Downloading ninja-1.11.1-py2.py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (145 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m146.0/146.0 kB\u001b[0m \u001b[31m21.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: torch>=1.11 in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (2.0.1+cu118)\n",
            "Collecting safetensors (from colossalai==0.2.8)\n",
            "  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m39.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (3.12.0)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (4.5.0)\n",
            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (1.11.1)\n",
            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (3.1)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (3.1.2)\n",
            "Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (2.0.0)\n",
            "Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.11->colossalai==0.2.8) (3.25.2)\n",
            "Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.11->colossalai==0.2.8) (16.0.5)\n",
            "Collecting invoke>=2.0 (from fabric->colossalai==0.2.8)\n",
            "  Downloading invoke-2.1.2-py3-none-any.whl (160 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m160.1/160.1 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting paramiko>=2.4 (from fabric->colossalai==0.2.8)\n",
            "  Downloading paramiko-3.1.0-py3-none-any.whl (211 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.2/211.2 kB\u001b[0m \u001b[31m26.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting cfgv>=2.0.0 (from pre-commit->colossalai==0.2.8)\n",
            "  Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)\n",
            "Collecting identify>=1.0.0 (from pre-commit->colossalai==0.2.8)\n",
            "  Downloading identify-2.5.24-py2.py3-none-any.whl (98 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.8/98.8 kB\u001b[0m \u001b[31m13.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting nodeenv>=0.11.1 (from pre-commit->colossalai==0.2.8)\n",
            "  Downloading nodeenv-1.8.0-py2.py3-none-any.whl (22 kB)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai==0.2.8) (6.0)\n",
            "Collecting virtualenv>=20.10.0 (from pre-commit->colossalai==0.2.8)\n",
            "  Downloading virtualenv-20.23.0-py3-none-any.whl (3.3 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.3/3.3 MB\u001b[0m \u001b[31m81.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: markdown-it-py<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai==0.2.8) (2.2.0)\n",
            "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai==0.2.8) (2.14.0)\n",
            "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py<3.0.0,>=2.2.0->rich->colossalai==0.2.8) (0.1.2)\n",
            "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nodeenv>=0.11.1->pre-commit->colossalai==0.2.8) (67.7.2)\n",
            "Collecting bcrypt>=3.2 (from paramiko>=2.4->fabric->colossalai==0.2.8)\n",
            "  Downloading bcrypt-4.0.1-cp36-abi3-manylinux_2_28_x86_64.whl (593 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m593.7/593.7 kB\u001b[0m \u001b[31m54.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: cryptography>=3.3 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai==0.2.8) (40.0.2)\n",
            "Collecting pynacl>=1.5 (from paramiko>=2.4->fabric->colossalai==0.2.8)\n",
            "  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m856.7/856.7 kB\u001b[0m \u001b[31m67.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting distlib<1,>=0.3.6 (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8)\n",
            "  Downloading distlib-0.3.6-py2.py3-none-any.whl (468 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m468.5/468.5 kB\u001b[0m \u001b[31m50.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: platformdirs<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (3.3.0)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11->colossalai==0.2.8) (2.1.2)\n",
            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11->colossalai==0.2.8) (1.3.0)\n",
            "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (1.15.1)\n",
            "Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (2.21)\n",
            "Building wheels for collected packages: colossalai, contexttimer\n",
            "  Building wheel for colossalai (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for colossalai: filename=colossalai-0.2.8-py3-none-any.whl size=1106778 sha256=c48863e1d61e5a17a6d7d3fbd80022a7990e07e012015591dd40042ddb3b2760\n",
            "  Stored in directory: /tmp/pip-ephem-wheel-cache-bobgacrn/wheels/b1/fb/16/e46aa3127ee272b8cac710c8f76aa02445d96aaeed9da956ea\n",
            "  Building wheel for contexttimer (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for contexttimer: filename=contexttimer-0.3.3-py3-none-any.whl size=5803 sha256=21ac5cf4b74bf53b9a96c4f8705106a7107d03a3c3f118b2515e3f7419170e06\n",
            "  Stored in directory: /root/.cache/pip/wheels/72/1c/da/cfd97201d88ccce214427fa84a5caeb91fef7c5a1b4c4312b4\n",
            "Successfully built colossalai contexttimer\n",
            "Installing collected packages: safetensors, ninja, distlib, contexttimer, virtualenv, nodeenv, invoke, identify, cfgv, bcrypt, pynacl, pre-commit, paramiko, fabric, colossalai\n",
            "Successfully installed bcrypt-4.0.1 cfgv-3.3.1 colossalai-0.2.8 contexttimer-0.3.3 distlib-0.3.6 fabric-3.0.1 identify-2.5.24 invoke-2.1.2 ninja-1.11.1 nodeenv-1.8.0 paramiko-3.1.0 pre-commit-3.3.2 pynacl-1.5.0 safetensors-0.3.1 virtualenv-20.23.0\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 2 安装transformers\n",
        "这里我们安装的是hpcaitech下的transformers,如果直接pip install transformers是否可行并没有测试。"
      ],
      "metadata": {
        "id": "bYe-_gnDHEEa"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!git clone https://github.com/hpcaitech/transformers\n",
        "os.chdir('./transformers')\n",
        "!pip install ."
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PJTkDrHUhFpP",
        "outputId": "11383ade-42af-4f76-ab3e-f47f955a863d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Cloning into 'transformers'...\n",
            "remote: Enumerating objects: 124468, done.\u001b[K\n",
            "remote: Total 124468 (delta 0), reused 0 (delta 0), pack-reused 124468\u001b[K\n",
            "Receiving objects: 100% (124468/124468), 127.28 MiB | 27.99 MiB/s, done.\n",
            "Resolving deltas: 100% (93320/93320), done.\n",
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Processing /content/ColossalAI/transformers\n",
            "  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (3.12.0)\n",
            "Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0.dev0)\n",
            "  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m224.5/224.5 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (1.22.4)\n",
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (23.1)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (6.0)\n",
            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (2022.10.31)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (2.27.1)\n",
            "Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0.dev0)\n",
            "  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m64.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (4.65.0)\n",
            "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.28.0.dev0) (2023.4.0)\n",
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.28.0.dev0) (4.5.0)\n",
            "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (1.26.15)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (2022.12.7)\n",
            "Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (2.0.12)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (3.4)\n",
            "Building wheels for collected packages: transformers\n",
            "  Building wheel for transformers (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for transformers: filename=transformers-4.28.0.dev0-py3-none-any.whl size=6790611 sha256=92968d77f2b7dc7aa1557f1ab5447bb5ddbf00b2936e257790df7e00d1659f10\n",
            "  Stored in directory: /tmp/pip-ephem-wheel-cache-pyix5uo5/wheels/87/f3/6f/b220a07b1eb427c5c698eed3338325ec784fe66427d0989fa6\n",
            "Successfully built transformers\n",
            "Installing collected packages: tokenizers, huggingface-hub, transformers\n",
            "Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.0.dev0\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 3 安装Chat需要的库"
      ],
      "metadata": {
        "id": "Y_KsSNVYHM_I"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "os.chdir('../applications/Chat/')\n",
        "!pip install ."
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "F73wlOCahZRz",
        "outputId": "e6c92639-0202-47d4-d917-637dc8ba064e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Processing /content/ColossalAI/applications/Chat\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "Requirement already satisfied: transformers>=4.20.1 in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (4.28.0.dev0)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (4.65.0)\n",
            "Collecting datasets (from coati==1.0.0)\n",
            "  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m474.6/474.6 kB\u001b[0m \u001b[31m10.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting loralib (from coati==1.0.0)\n",
            "  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)\n",
            "Requirement already satisfied: colossalai>=0.2.4 in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (0.2.8)\n",
            "Collecting torch<2.0.0,>=1.12.1 (from coati==1.0.0)\n",
            "  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m887.5/887.5 MB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting langchain (from coati==1.0.0)\n",
            "  Downloading langchain-0.0.178-py3-none-any.whl (892 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m892.2/892.2 kB\u001b[0m \u001b[31m15.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: tokenizers in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (0.13.3)\n",
            "Collecting fastapi (from coati==1.0.0)\n",
            "  Downloading fastapi-0.95.2-py3-none-any.whl (56 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m57.0/57.0 kB\u001b[0m \u001b[31m7.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting sse_starlette (from coati==1.0.0)\n",
            "  Downloading sse_starlette-1.6.1-py3-none-any.whl (9.6 kB)\n",
            "Collecting wandb (from coati==1.0.0)\n",
            "  Downloading wandb-0.15.3-py3-none-any.whl (2.0 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m17.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting sentencepiece (from coati==1.0.0)\n",
            "  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m16.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting gpustat (from coati==1.0.0)\n",
            "  Downloading gpustat-1.1.tar.gz (97 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m97.9/97.9 kB\u001b[0m \u001b[31m13.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (1.22.4)\n",
            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (5.9.5)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (23.1)\n",
            "Requirement already satisfied: pre-commit in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (3.3.2)\n",
            "Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (13.3.4)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (8.1.3)\n",
            "Requirement already satisfied: fabric in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (3.0.1)\n",
            "Requirement already satisfied: contexttimer in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.3)\n",
            "Requirement already satisfied: ninja in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (1.11.1)\n",
            "Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.1)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch<2.0.0,>=1.12.1->coati==1.0.0) (4.5.0)\n",
            "Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\n",
            "  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m849.3/849.3 kB\u001b[0m \u001b[31m16.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting nvidia-cudnn-cu11==8.5.0.96 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\n",
            "  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m557.1/557.1 MB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting nvidia-cublas-cu11==11.10.3.66 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\n",
            "  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m317.1/317.1 MB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\n",
            "  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.0/21.0 MB\u001b[0m \u001b[31m26.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2.0.0,>=1.12.1->coati==1.0.0) (67.7.2)\n",
            "Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2.0.0,>=1.12.1->coati==1.0.0) (0.40.0)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (3.12.0)\n",
            "Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (0.14.1)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (6.0)\n",
            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (2022.10.31)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (2.27.1)\n",
            "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->coati==1.0.0) (9.0.0)\n",
            "Collecting dill<0.3.7,>=0.3.0 (from datasets->coati==1.0.0)\n",
            "  Downloading dill-0.3.6-py3-none-any.whl (110 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m110.5/110.5 kB\u001b[0m \u001b[31m13.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->coati==1.0.0) (1.5.3)\n",
            "Collecting xxhash (from datasets->coati==1.0.0)\n",
            "  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.5/212.5 kB\u001b[0m \u001b[31m20.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting multiprocess (from datasets->coati==1.0.0)\n",
            "  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.3/134.3 kB\u001b[0m \u001b[31m16.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets->coati==1.0.0) (2023.4.0)\n",
            "Collecting aiohttp (from datasets->coati==1.0.0)\n",
            "  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m28.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting responses<0.19 (from datasets->coati==1.0.0)\n",
            "  Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
            "Requirement already satisfied: pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2 in /usr/local/lib/python3.10/dist-packages (from fastapi->coati==1.0.0) (1.10.7)\n",
            "Collecting starlette<0.28.0,>=0.27.0 (from fastapi->coati==1.0.0)\n",
            "  Downloading starlette-0.27.0-py3-none-any.whl (66 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.0/67.0 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting nvidia-ml-py>=11.450.129 (from gpustat->coati==1.0.0)\n",
            "  Downloading nvidia_ml_py-11.525.112-py3-none-any.whl (35 kB)\n",
            "Collecting blessed>=1.17.1 (from gpustat->coati==1.0.0)\n",
            "  Downloading blessed-1.20.0-py2.py3-none-any.whl (58 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.4/58.4 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: SQLAlchemy<3,>=1.4 in /usr/local/lib/python3.10/dist-packages (from langchain->coati==1.0.0) (2.0.10)\n",
            "Collecting async-timeout<5.0.0,>=4.0.0 (from langchain->coati==1.0.0)\n",
            "  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)\n",
            "Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain->coati==1.0.0)\n",
            "  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)\n",
            "Requirement already satisfied: numexpr<3.0.0,>=2.8.4 in /usr/local/lib/python3.10/dist-packages (from langchain->coati==1.0.0) (2.8.4)\n",
            "Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain->coati==1.0.0)\n",
            "  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m90.0/90.0 kB\u001b[0m \u001b[31m12.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: tenacity<9.0.0,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from langchain->coati==1.0.0) (8.2.2)\n",
            "Collecting GitPython!=3.1.29,>=1.0.0 (from wandb->coati==1.0.0)\n",
            "  Downloading GitPython-3.1.31-py3-none-any.whl (184 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m184.3/184.3 kB\u001b[0m \u001b[31m21.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting sentry-sdk>=1.0.0 (from wandb->coati==1.0.0)\n",
            "  Downloading sentry_sdk-1.24.0-py2.py3-none-any.whl (206 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m206.5/206.5 kB\u001b[0m \u001b[31m18.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting docker-pycreds>=0.4.0 (from wandb->coati==1.0.0)\n",
            "  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)\n",
            "Collecting pathtools (from wandb->coati==1.0.0)\n",
            "  Downloading pathtools-0.1.2.tar.gz (11 kB)\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "Collecting setproctitle (from wandb->coati==1.0.0)\n",
            "  Downloading setproctitle-1.3.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)\n",
            "Requirement already satisfied: appdirs>=1.4.3 in /usr/local/lib/python3.10/dist-packages (from wandb->coati==1.0.0) (1.4.4)\n",
            "Requirement already satisfied: protobuf!=4.21.0,<5,>=3.19.0 in /usr/local/lib/python3.10/dist-packages (from wandb->coati==1.0.0) (3.20.3)\n",
            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->coati==1.0.0) (23.1.0)\n",
            "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->coati==1.0.0) (2.0.12)\n",
            "Collecting multidict<7.0,>=4.5 (from aiohttp->datasets->coati==1.0.0)\n",
            "  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m114.5/114.5 kB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting yarl<2.0,>=1.0 (from aiohttp->datasets->coati==1.0.0)\n",
            "  Downloading yarl-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (268 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m268.8/268.8 kB\u001b[0m \u001b[31m26.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting frozenlist>=1.1.1 (from aiohttp->datasets->coati==1.0.0)\n",
            "  Downloading frozenlist-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m149.6/149.6 kB\u001b[0m \u001b[31m19.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting aiosignal>=1.1.2 (from aiohttp->datasets->coati==1.0.0)\n",
            "  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\n",
            "Requirement already satisfied: wcwidth>=0.1.4 in /usr/local/lib/python3.10/dist-packages (from blessed>=1.17.1->gpustat->coati==1.0.0) (0.2.6)\n",
            "Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from blessed>=1.17.1->gpustat->coati==1.0.0) (1.16.0)\n",
            "Collecting marshmallow<4.0.0,>=3.3.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\n",
            "  Downloading marshmallow-3.19.0-py3-none-any.whl (49 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.1/49.1 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting marshmallow-enum<2.0.0,>=1.5.1 (from dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\n",
            "  Downloading marshmallow_enum-1.5.1-py2.py3-none-any.whl (4.2 kB)\n",
            "Collecting typing-inspect>=0.4.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\n",
            "  Downloading typing_inspect-0.8.0-py3-none-any.whl (8.7 kB)\n",
            "Collecting gitdb<5,>=4.0.1 (from GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0)\n",
            "  Downloading gitdb-4.0.10-py3-none-any.whl (62 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.7/62.7 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.20.1->coati==1.0.0) (1.26.15)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.20.1->coati==1.0.0) (2022.12.7)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.20.1->coati==1.0.0) (3.4)\n",
            "Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy<3,>=1.4->langchain->coati==1.0.0) (2.0.2)\n",
            "Requirement already satisfied: anyio<5,>=3.4.0 in /usr/local/lib/python3.10/dist-packages (from starlette<0.28.0,>=0.27.0->fastapi->coati==1.0.0) (3.6.2)\n",
            "Requirement already satisfied: invoke>=2.0 in /usr/local/lib/python3.10/dist-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (2.1.2)\n",
            "Requirement already satisfied: paramiko>=2.4 in /usr/local/lib/python3.10/dist-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (3.1.0)\n",
            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->coati==1.0.0) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->coati==1.0.0) (2022.7.1)\n",
            "Requirement already satisfied: cfgv>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (3.3.1)\n",
            "Requirement already satisfied: identify>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (2.5.24)\n",
            "Requirement already satisfied: nodeenv>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (1.8.0)\n",
            "Requirement already satisfied: virtualenv>=20.10.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (20.23.0)\n",
            "Requirement already satisfied: markdown-it-py<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.2.0)\n",
            "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.14.0)\n",
            "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette<0.28.0,>=0.27.0->fastapi->coati==1.0.0) (1.3.0)\n",
            "Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0)\n",
            "  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)\n",
            "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py<3.0.0,>=2.2.0->rich->colossalai>=0.2.4->coati==1.0.0) (0.1.2)\n",
            "Requirement already satisfied: bcrypt>=3.2 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (4.0.1)\n",
            "Requirement already satisfied: cryptography>=3.3 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (40.0.2)\n",
            "Requirement already satisfied: pynacl>=1.5 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.5.0)\n",
            "Collecting mypy-extensions>=0.3.0 (from typing-inspect>=0.4.0->dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\n",
            "  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)\n",
            "Requirement already satisfied: distlib<1,>=0.3.6 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (0.3.6)\n",
            "Requirement already satisfied: platformdirs<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (3.3.0)\n",
            "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.15.1)\n",
            "Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (2.21)\n",
            "Building wheels for collected packages: coati, gpustat, pathtools\n",
            "  Building wheel for coati (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for coati: filename=coati-1.0.0-py3-none-any.whl size=75334 sha256=13d5d09bdbb4b2b47045312ca337e68b88dfd2044087fbaf0c05bd593c1b4a35\n",
            "  Stored in directory: /tmp/pip-ephem-wheel-cache-tez1zypt/wheels/49/ba/eb/98b39707d3bcca1d3ecf646b531cdb25f480bd44ec5c0edafb\n",
            "  Building wheel for gpustat (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for gpustat: filename=gpustat-1.1-py3-none-any.whl size=26280 sha256=a748ad4da7967293e3504c25de991a95a408f0d9a9caf2a91b7d4e09b0ee58fd\n",
            "  Stored in directory: /root/.cache/pip/wheels/ee/d0/2c/1e02440645c2318ba03aea99993a44a9108dc8f74de0bd370b\n",
            "  Building wheel for pathtools (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for pathtools: filename=pathtools-0.1.2-py3-none-any.whl size=8791 sha256=48ee994acbcb9e407204168f303fd9c252c0f7f64f4948b689f062e60891e7e3\n",
            "  Stored in directory: /root/.cache/pip/wheels/e7/f3/22/152153d6eb222ee7a56ff8617d80ee5207207a8c00a7aab794\n",
            "Successfully built coati gpustat pathtools\n",
            "Installing collected packages: sentencepiece, pathtools, nvidia-ml-py, xxhash, smmap, setproctitle, sentry-sdk, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cublas-cu11, mypy-extensions, multidict, marshmallow, loralib, frozenlist, docker-pycreds, dill, blessed, async-timeout, yarl, typing-inspect, starlette, responses, openapi-schema-pydantic, nvidia-cudnn-cu11, multiprocess, marshmallow-enum, gpustat, gitdb, aiosignal, torch, sse_starlette, GitPython, fastapi, dataclasses-json, aiohttp, wandb, langchain, datasets, coati\n",
            "  Attempting uninstall: torch\n",
            "    Found existing installation: torch 2.0.1+cu118\n",
            "    Uninstalling torch-2.0.1+cu118:\n",
            "      Successfully uninstalled torch-2.0.1+cu118\n",
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\n",
            "torchdata 0.6.1 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\n",
            "torchtext 0.15.2 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\n",
            "torchvision 0.15.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0mSuccessfully installed GitPython-3.1.31 aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 blessed-1.20.0 coati-1.0.0 dataclasses-json-0.5.7 datasets-2.12.0 dill-0.3.6 docker-pycreds-0.4.0 fastapi-0.95.2 frozenlist-1.3.3 gitdb-4.0.10 gpustat-1.1 langchain-0.0.178 loralib-0.1.1 marshmallow-3.19.0 marshmallow-enum-1.5.1 multidict-6.0.4 multiprocess-0.70.14 mypy-extensions-1.0.0 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-ml-py-11.525.112 openapi-schema-pydantic-1.2.4 pathtools-0.1.2 responses-0.18.0 sentencepiece-0.1.99 sentry-sdk-1.24.0 setproctitle-1.3.2 smmap-5.0.0 sse_starlette-1.6.1 starlette-0.27.0 torch-1.13.1 typing-inspect-0.8.0 wandb-0.15.3 xxhash-3.2.0 yarl-1.9.2\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 预训练模型的下载(以bloom为例)"
      ],
      "metadata": {
        "id": "qbtHtV4pHXQ4"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "os.chdir('/content/ColossalAI/applications/Chat/examples')"
      ],
      "metadata": {
        "id": "WxRdvkAvhlJW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "如果不安装下面的命令会出现什么情况呢?\n",
        "- 从huggingface中git clone下来的模型看似下载下来了,但是其实下载下来的并不是实质的模型文件(如果你检查文件的大小,只有几B)\n",
        "- 一旦下载下来的文件并不是实质的模型,则在运行SFT代码的时候会报错:**safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge**"
      ],
      "metadata": {
        "id": "GdlFgQzmHo2t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!sudo apt-get install git-lfs\n",
        "!git lfs install"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "C4Pu4RmJiEK6",
        "outputId": "e49226b3-fd13-4f48-9f41-43f23fa017d8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "git-lfs is already the newest version (2.9.2-1).\n",
            "0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.\n",
            "Updated git hooks.\n",
            "Git LFS initialized.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "**下载ColossalAI支持的系列模型**,我们以bloomz-560m为例。在下面,我们将模型放在了当前的目录下(ColossalAI/applications/Chat/examples),但是这里并不是强制的,可以根据自己的喜欢变换位置。"
      ],
      "metadata": {
        "id": "9JeE6R7QH4dJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!git clone https://huggingface.co/bigscience/bloomz-560m"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3xUK88K7iydi",
        "outputId": "90626fc0-dd6c-46d9-d68b-08289ae034f6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Cloning into 'bloomz-560m'...\n",
            "remote: Enumerating objects: 1332, done.\u001b[K\n",
            "remote: Counting objects: 100% (10/10), done.\u001b[K\n",
            "remote: Compressing objects: 100% (7/7), done.\u001b[K\n",
            "remote: Total 1332 (delta 3), reused 10 (delta 3), pack-reused 1322\u001b[K\n",
            "Receiving objects: 100% (1332/1332), 7.18 MiB | 22.90 MiB/s, done.\n",
            "Resolving deltas: 100% (616/616), done.\n",
            "Filtering content: 100% (8/8), 2.11 GiB | 57.58 MiB/s, done.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 运行SFT"
      ],
      "metadata": {
        "id": "wnZ6xjQfIEpL"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "我们这里是直接运行的py文件,如果你按照[文档的说明](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples),运行sh脚本文件(!bash train_sft.sh)也是可以的。其本质上,都是运行这个train_sft.py文件。\n",
        "\n",
        "**训练完成的模型存放地址(save_path)**这里写的示例路径是临时路径,一旦notebook停止运行,这个路径就失效了。所以要么在notebook停止前先下载训练好的模型,要么这里可以考虑填写到google drive的目录下。\n",
        "\n",
        "在下面的命令中,为了演示\n",
        "- 我们只是用一个非常非常非常小的数据集(--dataset)去跑程序(小到就只有5条数据)\n",
        "- model:改为了“bloom”\n",
        "- pretrain:改成了我们自己下载的模型地址\n",
        "- save_path: 改成了我们想放的目录地址\n",
        "- 其他参数可能需要你自己去多多探索:比如lora、gradient checkingpoint等。更多参数的说明见[官方文档](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)"
      ],
      "metadata": {
        "id": "9w6QPTmIIr8J"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!torchrun --standalone --nproc_per_node=1 train_sft.py \\\n",
        "    --pretrain \"./bloomz-560m\" \\\n",
        "    --model 'bloom' \\\n",
        "    --strategy colossalai_zero2 \\\n",
        "    --log_interval 50 \\\n",
        "    --save_path  \"./bloomz-560m-finetuned\" \\\n",
        "    --dataset \"./instinwild_ch_small.json\" \\\n",
        "    --batch_size 4 \\\n",
        "    --accumulation_steps 8 \\\n",
        "    --lr 2e-5 \\\n",
        "    --max_datasets_size 512 \\\n",
        "    --max_epochs 1"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "WVu629ZLjf6G",
        "outputId": "6fd1ef2f-6483-42a4-b43a-65f9181bac2c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "2023-05-24 11:46:01.994320: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
            "\u001b[2;36m[05/24/23 11:46:05]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/colossalai/\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35mcontext/\u001b[0m\u001b[95mparallel_context.py\u001b[0m:\u001b[1;36m522\u001b[0m set_device         \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: process rank \u001b[1;36m0\u001b[0m is  \n",
            "\u001b[2;36m                    \u001b[0m         bound to device \u001b[1;36m0\u001b[0m                                  \n",
            "\u001b[2;36m[05/24/23 11:46:13]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/colossalai/\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35mcontext/\u001b[0m\u001b[95mparallel_context.py\u001b[0m:\u001b[1;36m558\u001b[0m set_seed           \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: initialized seed on\n",
            "\u001b[2;36m                    \u001b[0m         rank \u001b[1;36m0\u001b[0m, numpy: \u001b[1;36m42\u001b[0m, python random: \u001b[1;36m42\u001b[0m,              \n",
            "\u001b[2;36m                    \u001b[0m         ParallelMode.DATA: \u001b[1;36m42\u001b[0m, ParallelMode.TENSOR: \u001b[1;36m42\u001b[0m,the \n",
            "\u001b[2;36m                    \u001b[0m         default parallel seed is ParallelMode.DATA.        \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/colossalai/\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[95minitialize.py\u001b[0m:\u001b[1;36m115\u001b[0m launch                           \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Distributed        \n",
            "\u001b[2;36m                    \u001b[0m         environment is initialized, data parallel size: \u001b[1;36m1\u001b[0m, \n",
            "\u001b[2;36m                    \u001b[0m         pipeline parallel size: \u001b[1;36m1\u001b[0m, tensor parallel size: \u001b[1;36m1\u001b[0m \n",
            "/usr/local/lib/python3.10/dist-packages/colossalai/kernel/op_builder/utils.py:94: UserWarning: [extension] The CUDA version on the system (11.8) does not match with the version (11.7) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions\n",
            "  warnings.warn(\n",
            "/usr/local/lib/python3.10/dist-packages/colossalai/kernel/op_builder/utils.py:94: UserWarning: [extension] The CUDA version on the system (11.8) does not match with the version (11.7) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions\n",
            "  warnings.warn(\n",
            "\u001b[2;36m[05/24/23 11:52:04]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35met/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m121\u001b[0m __init__                     \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Loading data\u001b[33m...\u001b[0m    \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35met/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m123\u001b[0m __init__                     \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Loaded \u001b[1;36m6\u001b[0m examples. \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35met/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m126\u001b[0m __init__                     \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Limiting dataset to\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[1;36m512\u001b[0m examples.                                      \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35met/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m129\u001b[0m __init__                     \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Formatting         \n",
            "\u001b[2;36m                    \u001b[0m         inputs\u001b[33m...\u001b[0m                                          \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO:                    \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35met/\u001b[0m\u001b[95msft_dataset.py\u001b[0m:\u001b[1;36m137\u001b[0m __init__                     \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO    \u001b[0m colossalai - colossalai - INFO: Tokenizing         \n",
            "\u001b[2;36m                    \u001b[0m         inputs\u001b[33m...\u001b[0m This may take some time\u001b[33m...\u001b[0m               \n",
            "steps: 0it [00:00, ?it/s]\u001b[2;36m[05/24/23 11:52:09]\u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING:                 \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/coati/train\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35mer/\u001b[0m\u001b[95msft.py\u001b[0m:\u001b[1;36m86\u001b[0m fit                                   \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING: batch_i\u001b[1;92md:0\u001b[0m,     \n",
            "\u001b[2;36m                    \u001b[0m         abnormal loss: \u001b[1;36m3.6484375\u001b[0m                           \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING:                 \n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35m/usr/local/lib/python3.10/dist-packages/coati/train\u001b[0m\n",
            "\u001b[2;36m                    \u001b[0m         \u001b[35mer/\u001b[0m\u001b[95msft.py\u001b[0m:\u001b[1;36m86\u001b[0m fit                                   \n",
            "\u001b[2;36m                   \u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m colossalai - colossalai - WARNING: batch_i\u001b[1;92md:1\u001b[0m,     \n",
            "\u001b[2;36m                    \u001b[0m         abnormal loss: \u001b[1;36m4.109375\u001b[0m                            \n",
            "steps: 0it [00:03, ?it/s]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 小结\n",
        "仅仅跑起来只是一个开始。祝愿每个小伙伴最终都会获得更好的GPU资源、更棒的数据和更出色的属于自己的语言模型!"
      ],
      "metadata": {
        "id": "RQKDNZogLUNs"
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "H7HM6E5VLW2u"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}


================================================
FILE: README.md
================================================
# 开源语言模型百宝袋 (Ver. 3.6)
Open-Source Language Model Pocket

**注意**:由于此文本内容太多了,直接在Github网页阅览会出现内容不全(导致部分内容搜索不到)的问题。建议下载到本地或直接上传给语言模型助手查阅。

**Github**: https://github.com/createmomo/Open-Source-Language-Model-Pocket

## 开源模型一览 (Table of Contents)

*中文友好或国内主创的开源模型(Chinese Open Source Language Models)*

|多个领域/通用|||
|---|---|---|
|百川|中文Alpaca Luotuo|中文LLaMA&Alpaca大模型|
|中文LLaMA&Alpaca大模型2|流萤Firefly|凤凰|
|复旦MOSS|复旦MOSS-RLHF|悟道·天鹰Aquila&Aquila2|
|雅意大模型| 通义千问Qwen| 活字3.0|
| Anima |BayLing|BELLE|
|Bloom|BiLLa |BLOOMChat176B|
|Chinese-Llama-2-7b (LinkSoul-AI) |GPT2 for Multiple Language |InternLM 书生・浦语|
|Llama2-chat-Chinese-50W|Llama2-Chinese (FlagAlpha) |Linly伶荔说 中文 LLaMA1-2 & OpenLLaMA & Falcon 大模型 |
|ChatRWKV|ChatYuan|ChatGLM-6B|
|ChatGLM2-6B|Chinese-Transformer-XL|OpenKG-KnowLLM |
|PromptCLUE|SkyText-Chinese-GPT3|CPM-Bee|
|TigerBot|XVERSE-13B|YuLan-Chat & YuLan-Chat-2|
|Ziya-LLaMA |TechGPT|EVA|
|FLM-101B|TinyLlama|Colossal-LLaMA-2|
|OpenBA (Encoder-Decoder)|Ziya-Reader-13B|Firefly-LLaMA2-Chinese|
|MindLLM|ChatGLM3|Skywork大模型|
|Yi-6B/34B(零一万物)|Nanbeige-16B(南北阁-16B)|OrionStar-Yi-34B-Chat|
|源2.0|TechGPT2.0|SUS-Chat-34B|
|Alaya 元识|OpenBuddy|MiniGPT4Qwen|
|ChatLM-Chinese-0.2B|YAYI 2|DeepSeek LLM&MoE|
|MachineMindset(MBTI)|星辰语义(电信)|Chinese-Mixtral-8x7B|
|Baby-Llama2-Chinese|XVERSE-13B-256K|Eagle 7B(RWKV-v5)|
|iFlytekSpark-13B|MiniCPM|通义千问Qwen1.5|
|RethinkTinyLM|Chinese-Mixtral|RWKV_Pytorch|
|Qwen1.5-MoE-A2.7B|Symbol-LLM|Qwen1.5-32b|
|build_MiniLLM_from_scratch|RWKV-6 World|Mengzi3|
|Eurus|Chinese Tiny LLM|HammerLLM|
|360智脑|Steel-LLM|XVERSE-MoE-A4.2B|
|llama3-Chinese-chat|Llama3-Chinese-Chat(ORPO)|DeepSeek-V2|
|PanGu-π|Eurux-8x22B|Chinese-LLaMA-Alpaca-3|
|OpenBuddy-Llama3-70B-v21.1-8k|MAP-NEO|llms-from-scratch-cn|
|Yi-1.5|Yuan2.0-M32|Skywork-MoE|
|Index-1.9B|Qwen2|Gemma-2-9B-Chinese-Chat|
|Gemma-2-27B-Chinese-Chat|RWKV-6-World 14B|Tele-FLM-1T|
|Llama3.1-Chinese-Chat|INF-34B|InternLM2.5|
|*【LongWriter】|*【Hunyuan-Large】|*【Qwen2.5】|
|*【TeleChat2】|*【Marco-o1】|*【Skywork-o1】|
|*【YuLan-Mini】|*【DeepSeek-R1】|*【simpleRL-reason】|
|*【TinyZero】|*【STILL-3-1.5B-Preview】|*【MiniMax-01】|
|*【SmallThinker-3B-preview】|*【DeepSeek-V3】|*【RWKV-7】|
|*【FOX-1】|*【mini_qwen】|*【Qwen 0.5b on GRPO】|
|*【Qwen2.5-Max】|*【minimind】|*【Nano】|

| 医疗健康 |  |  |
|---|---|---|
| 本草 |华佗  |扁鹊  |
| 灵心 | 启真 | 儿童情感陪伴大模型“巧板” |
| OpenMEDLab 浦医|明医 (MING):中文医疗问诊大模型 (原名:MedicalGPT-zh) |情感大模型PICA|
|Chinese-Vicuna-medical|MedicalGPT| DISC-MedLLM (复旦)|
|DoctorGLM|ChatMed-TCM&ChatMed-Consult|ChatGLM-Med|
|MeChat|ShenNong-TCM-LLM|MindChat(漫谈): 心理大模型|
|WiNGPT|CareGPT|孙思邈|
|MolGen(药物研发)|Taiyi(太一)|MedAgents|
|Molecule Optimization|MolTC|Mol-Instructions|
|Multilingual Medicine|Sequel|Gene editing|
|Llama-3-8B-UltraMedical|PH-LLM|ProLLM|
|MolecularGPT|*【CHIEF(Clinical Histopathology Imaging Evaluation Foundation)】|*【HuatuoGPT-o1】|
|*【Baichuan-14B-M1】|*【MedFound】||

|经济/金融|||
|---|---|---|
|貔貅FinMA & PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance|轩辕|BBT-FinCUGE-Applications|
|Cornucopia-LLaMA-Fin-Chinese|EcomGPT|FinGLM|
|DISC-FinLLM|Deepmoney||

|法律|||
|---|---|---|
| 韩非 HanFei| 智海 录问|ChatLaw 法律大模型|
|LaWGPT|Lawyer LLaMA|LexiLaw|
|LawGPT_zh|夫子•明察司法大模型|DISC-LawLLM|
|LawBench|*【HK-O1aw】||

|交通|城市|
|---|---|
|TransGPT · 致远|UrbanGPT|

|教育&数学||
|---|---|
|桃李|EduChat|
|chatglm-maths|Abel|
|InternLM-Math|DeepSeekMath|
|LeerooDedicated-Math-7b|SimpleGeometry|
|Rho-1|ChatGLM-Math|
|JiuZhang3.0|InternLM2-WQX|
|Math-Minos|NuminaMath 7B TIR|
|MathΣtral|LLaMAX(翻译)|
|Qwen2-Math|*【AIMO-CMU_MATH】|
|*【Qwen2.5-Math】|*【SocraticLM】|
|*【Open Thoughts】|*【simpleRL-reason】|
|*【DRT-o1(翻译)】||

|表格/数据分析||
|---|---|
|TableGPT|Data-Copilot|
|Tabular LLM|Chain-of-table|
|Data Interpreter|TableLLM|
|Lag-Llama|TabuLa-8B|
|*【Time-MoE】||

|自媒体/角色扮演/风格/故事|
|---|
|MediaGPT|
|CharacterGLM-6B|
|Haruhi-Zero|
|Translational-Style-ChatLLM西式翻译腔|
|StyleLLM|
|Tianji来事儿AI|
|TinyStories|
|Higgs-Llama-3-70B|
|persona-hub|
|Peach-9B-8k-Roleplay|
|*【Hermes 3】|
|*【SkyReels(短剧)】|

|古汉语|
|---|
|尔雅 Erya|
|荀子|

|编程/代码/系统/设备||
|---|---|
|CodeShell|CODEFUSION-75M|
|DeepSeek Coder|DevOps-Model(运维)|
|Magicoder|LLaMA-Pro|
|HuixiangDou|CodeAct|
|Design2Code|bGPT|
|MobileLLM|Stable Code Instruct 3B|
|ReALM|aiXcoder|
|CodeQwen1.5|AutoCodeRover|
|CodeGemma|Snowflake Arctic|
|dolphin-2.9-llama3-70b|Granite|
|StarCoder2-15B-Instruct-v0.1|AutoCoder|
|CodeGeeX4|xLAM|
|*【deepin V23】|*【WaveCoder】|
|*【Llama-3.1-Storm-8B】|*【OpenCoder】|
|*【Qwen2.5-Coder】|*【Ministraux】|
|*【Reader-LM】|*【珠算】|
|*【Lingma SWE-GPT】|*【GLM-Edge】|
|*【SEMIKONG(半导体)】|*【ReaderLM-v2】|
|*【O1-CODER】||

|天文/海洋/地球科学/科学|
|---|
|星语StarWhisper|
|OceanGPT|
|K2&GeoGalactica|
|SciGLM|
|*【KAN 2.0】|

*Recommendation/IR/Information Extraction*
|||
|---|---|
|LLM for Recommendation Systems|Transformer Index for GEnerative Recommenders (TIGER)|
|EasyRL4Rec|RLMRec|
|RecAI|Actions Speak Louder than Words|
|PPM|LLaRA|
|Awesome Information Retrieval in the Age of Large Language Model|LLMs heart MIR|
|When to Retrieve|Lite-LLM4Rec|
|A Comprehensive Survey on Self-Supervised Learning for Recommendation|NoteLLM|
|LEARN|YAYI-UIE|
|XRec|Wukong|
|Leveraging LLM Reasoning Enhances Personalized Recommender Systems|*【Transformers in music recommendation】|

*文本向量/RAG*
|  |  |
|---|---|
| Matryoshka Representation Learning |Jina Embeddings|
|BGE-M3|Nomic Embed|
|Moka Massive Mixed Embedding(M3E)|GRIT|
|TinyRAG|RAFT|
|Chat with MLX|LLocalSearch|
|RAGFlow|Dot|
|Ollama Embedding Models|LLM2Vec|
|gecko|Cognita|
|Piccolo2|NV-Embed|
|RankRAG|LightRAG|
|GraphRAG|*【gte-multilinguial】|
|*【nano-graphrag】|*【MaxKB】|
|*【Langchain-Chatchat】|*【RAGLite】|
|*【OpenScholar】|*【MasteringRAG】|
|*【FlashRAG-Paddle】|*【MiniRAG】|
|*【XRAG】|*【Chronos】|
|*【DeepRAG】|*【UltraRAG】|
|*【CAG】|*【FlexRAG】|

*Agent*
|  |  |
|---|---|
| Auto-GPT | ToolBench&ToolLLM |
|HuggingGPT |CAMEL:Communicative Agents for “Mind” Exploration of Large Scale Language Model Society|
|AgentLM (AgentTuning, AgentInstruct) |XAgent|
|OpenAgents|Personal LLM Agents - Survey|
|AUTOACT|MetaGPT|
|Multi-LLM-Agent|KwaiAgents|
|Mistral-Interact|AgentLite|
|KnowAgent|LlamaGym|
|WorkArena|STE(Simulated Trial and Error)|
|More Agents Is All You Need|AIOS|
|TwoStep|Agent-FLAN|
|Jan|APAM|
|AgentStudio|AnyTool|
|TinyAgent|Octopus v2|
|ReadAgent|STORM|
|AgentRun|OS-Copilot|
|AutoWebGLM|Agent Hospital|
|CodeR|Mobile-Agent-v2|
|Husky|TinyAgent|
|Tree Search for Language Model Agents|octo-planner|
|MindSearch|*【AgentInstruct】|
|*【AgentCourt】|*【AI-Scientist】|
|*【RD-Agent】|*【AFlow: Automating Agentic Workflow Generation】|
|*【swarm】|*【FinVision】|
|*【Agent Mental Clinic (AMC)】|*【MedAI】|
|*【Agent-0】|*【Large Language Model-Brained GUI Agents: A Survey】|
|*【Building effective agents】|*【UI-TARS】|
|*【PaSa】|*【Docling】|
|*【Eko】|*【Search-o1】|
|*【CogAgent】|*【Proactive Agent】|
|*【Open-source DeepResearch】|*【RAGEN】|
|*【smolagents】|*【Open Deep Research】|

*可参考的其它开源模型(国外为主)*
|  |  |
|---|---|
| Cerebras | MPT-7B |
| ChatDoctor | OpenGPT |
| Code Llama (Meta AI)| Orca |
| Dolly 1&2 | OpenChatKit |
| FinGPT | Open-Assistant |
| Falcon | Platypus|
| Facebook/Meta LLaMA/LLaMA2 | MedLLaMA-13B & PMC-LLaMA: Continue Training LLaMA on Medical Papers |
| Giraffe| RedPajama |
| GALACTICA | SQLCoder (Defog)|
| Goar-7B for Arithmetic Tasks | StableLM |
| HuggingChat | StableVicuna |
| Koala: A Dialogue Model for Academic Research | Stanford Alpaca |
| LongLLaMA | UltraLM-13B |
| LLaMA复刻版OpenLLaMA | Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality |
| Llama-X: Open Academic Research on Improving LLaMA to SOTA LLM | Wombat |
| Lit-LLaMA ️ | WizardMath|
| MammoTH | XGen-7B |
|Mistral 7B|Xwin-LM|
|LLaMA 2 Long|UltraLM-13B (UltraFeedback)|
|Llemma: An Open Language Model For Mathematics|Mistral-Trismegistus-7B (神秘学/玄学/灵性)|
|Memory-GPT(MemGPT)|MetaMath|
|ChipNeMo (芯片设计)|Zephyr|
|neural-chat-7b-v3-1(Intel)|SteerLM|
|Llama Coder|Meditron|
|RankZephyr|StableLM Zephyr 3B|
|Orca 2|Mixtral 7b 8 Expert|
|Phi|LLM360(Amber,CrystalCoder,Diamond)|
|Mamba|SOLAR|
|NexusRaven(function calling LLM)|LLaMA-MoE|
|TinyLlama|Nous-Hermes-2 Mixtral 8x7B|
|AlphaGeometry|MoE-Mamba|
|StarCoder|OLMo|
|H2O-Danube-1.8B|OpenMathInstruct-1|
|Smaug-72B|Gemma|
|Aya Model|MobiLlama|
|StarCoder2|SmallLanguageModel-project|
|Command-R|Grok|
|DBRX|Jamba|
|BioMedLM|JetMoE|
|MicroLlama-300M|Mistral 7B v0.2 JAX|
|gemma-1.1-7b-it|h2o-danube2-1.8b-chat|
|WizardLM-2|RecurrentGemma|
|CodecLM|MEGALODON|
|Stable LM 2 12B|Mixtral 8x22B|
|Phi-3|Llama 3|
|OpenELM|base-7b-v0.2|
|FILM-7B|llama3 implemented from scratch|
|2.3MParams-LLM-From-Scratch-Python|KAN-GPT|
|Aya-23|Mamba-2|
|Recurrentgemma|Nemotron-4 340B|
|Gemma-2|Gemini Nano|
|TTT|Arcee-Spark|
|Mistral NeMo|Llama 3.1 405B|
|Mistral Large 2|SmolLM|
|DCLM-7B|Minitron|
|Gemma 2 2B/ShieldGemma/Gemma Scope|SmolLM|
|nano-llama31|*【instant-smollm】|
|*【Jamba 1.5】|*【Phi-3.5】|
|*【1.5-Pints】|*【Llama-3.1-Minitron 4B】|
|*【SmolLm2】|*【Ministral 3B/8B】|
|*【Zamba2-7B】|*【IBM Granite 3.0】|
|*【Tülu3】|*【Open-O1】|
|*【open-r1】|*【sky-t1】|
|*【Phi-4】|*【Dolphin 3.0】|
|*【Falcon 3】|*【Bamba】|
|*【Byte Latent Transformer】|*【Llama-3.3-70B-Instruct】|
|*【Granite 3.1】|*【mini-deepseek-r1】|
|*【RL, Reasoning & Writing: GRPO on Base model】|*【encoder-decoder-slm】|

*训练/推理*
|  |  |
|---|---|
| Alpaca-LoRA | llama2.mojo |
| AlpacaFarm | LightLLM |
| ColossalAI | Medusa |
| ChatLLaMA | Megatron-LLaMA |
| Chinese-Guanaco | MeZO: Fine-Tuning Language Models with Just Forward Passes |
| DPO (Direct Preference Optimization) | MLC LLM |
| DialogADV:Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality | PKU-Beaver 河狸 (Safe RLHF) |
| DeepSpeed-Chat | PaLM + RLHF (Pytorch) |
| FlexGen | RL4LMs |
| FlagAI and FlagData | Reinforcement Learning with Language Model |
| Guanaco & QloRA | SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression |
| GPT4All | Scikit-LLM: Sklearn Meets Large Language Models |
| HugNLP | Transformer Reinforcement Learning |
| INSTRUCTEVAL | Train_Transformers_with_INT4 |
| LOw-Memory Optimization (LOMO) | Transformer Reinforcement Learning X |
| llama.cpp | vLLM |
| llama2.c | LongLoRA |
|RLLTE: Long-Term Evolution Project of Reinforcement Learning|FlashAttention|
|ExecuTorch|TensorRT-LLM|
|BPO(Black-Box Prompt Optimization)|S-LoRA|
|SoRA|XuanCe(玄策): 开源的深度强化学习(DRL)库|
|EasyLM(JAX/Flax)|FATE-LLM - Federated Learning for LLMs|
|DeepSpeed-FastGen|NVIDIA NeMo-Aligner|
|RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback|MLX|
|OpenRLHF|CoLLiE: Collaborative Training of Large Language Models in an Efficient Way|
|Superalignment|LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models|
|Large Language Model Unlearning|PowerInfer|
|m-LoRA|LASER|
|StripedHyena-7B|SwiftInfer|
|SPIN(Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models)|Self-Rewarding Language Models|
|OPO(On-the-fly Preference Optimization)|ASPIRE|
|The Impact of Reasoning Step Length on Large Language Models|SliceGPT|
|FuseLLM|Tree of Thoughts|
|CogGPT|KTO(Kahneman-Tversky Optimisation)|
|Aligner|RPO(Robust Prompt Optimization)|
|Inference-Time Training Helps Long Text Generation|LiPO|
|ChatLLM.cpp|Self-Discover|
|DoRA|GPO(Generalized Preference Optimization)|
|CoT-decoding|FSDP&QLoRA(Answer)|
|MindNLP|GaLore|
|Mixture-of-LoRAs|LLaMA Factory|
|InfLLM|MediaPipe|
|OneBit|RWKV_Pytorch|
|HQQ|Uni-RLHF|
|LLMLingua-2|REST|
|MetaAligner|DiJiang|
|LISA(Layerwise Importance Sampled AdamW)|edge-infer|
|NeFT|Aligning Large Language Models with Recommendation Knowledge|
|llamafile|summarize_from_feedback_details|
|EvoLLM|llm.c|
|Mergoo|qwen-vllm|
|SiLLM|How to Train Data-Efficient LLMs|
|sDPO|PiSSA|
|LongRoPE|ORPO|
|How to Train Data-Efficient LLMs|Better & Faster Large Language Models via Multi-token Prediction|
|Llama-3 70B Gradient Adapter|Unsloth|
|RLHF Workflow|SimPO|
|ODPO|ΨPO|
|MoRA|LOFIT|
|MEFT|PowerInfer-2|
|Emulated Disalignment|Aligning Large Language Models with Representation Editing: A Control Perspective|
|Q\*|TDPO|
|ExCP|MindStar|
|LaMDA|MInference|
|Instruction Pre-Training|PEER|
|Step-DPO|Data, Data Everywhere|
|Prover-Verifier Games|Mem0|
|EAGLE-2|LoRA-GA|
|Q-GaLore|*【rStar】|
|*【T-MAC】|*【LLM-zero2hero】|
|*【MobileQuant】|*【min-p sampling】|
|*【Fast Best-of-N Decoding】|*【UNA: Unifying Alignments of RLHF/PPO, DPO and KTO】|
|*【LongReward】|*【HybridFlow】|
|*【The Surprising Effectiveness of Test-Time Training for Abstract Reasoning】|*【OpenR】|
|*【A Theoretical Understanding of Self-Correction through In-context Alignment】|*【EfficientQAT】|
|*【Cautious Optimizers】|*【Optimizing Large Language Model Training Using FP4 Quantization】|
|*【Evolving Deeper LLM Thinking】|*【rStar-Math】|
|*【Transformer²: Self-Adaptive LLMs】|*【test-time compute scaling】|
|*【XGrammar】|*【Reverse Thinking Makes LLMs Stronger Reasoners】|
|*【noise_step】||

*评价*
|  ||
|---|---|
| 天秤(FlagEval) |獬豸(Xiezhi)Benchmark |
| C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models | HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models|
| KoLA: Carefully Benchmarking World Knowledge of Large Language Models |LucyEval—中文大语言模型成熟度评测|
|CMB: A Comprehensive Medical Benchmark in Chinese|Multiscale Positive-Unlabeled Detection of AI-Generated Texts |
| PandaLM |Auto-J|
|CLEVA: Chinese Language Models EVAluation Platform|ALCUNA: Large Language Models Meet New Knowledge|
|HalluQA:Evaluating Hallucinations in Chinese Large Language Models|GLoRE: Evaluating Logical Reasoning of Large Language Models|
|HelpSteer|AlignBench: 多维度中文对齐评测基准|
|UHGEval|Purple Llama (Meta)|
|OMGEval|SciGuard&SciMT-Safety|
|HaluEval 2.0, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models|DebugBench: Evaluating Debugging Capability of Large Language Models|GenMedicalEval|
|R-Judge|TravelPlanner|
|EasyJailbreak|AgentBench|
|中文MT-Bench|E-EVAL|
|ConflictingQA|Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE)|
|∞Bench|Red Teaming Resistance Benchmark|
|Fin-Eva|Cappy|
|BAMBOO|Fast-DetectGPT|
|GAMA-Bench|FineMath|
|ToolEmu|ClongEval|
|Counting-Stars|InfiCoder-Eval|
|MathVerse|CoderUJB|
|LooGLE|McEval|
|CRAG|BigCodeBench|
|Prometheus 2|Open LLM Leaderboard|
|CriticGPT|Test  f Time|
|WebCanvas|Lynx|
|ComplexBench|Mr-Ben|
|*【SimpleQA】|*【AppBench】|
|*【CompassJudger/JudgerBench】|*【CMCOQA】|
|*【CodevBench】|*【FrontierMath】|
|*【GIFT-Eval】|*【LightEval】|
|*【RMB-Reward-Model-Benchmark】|*【Chinese SimpleQA】|
|*【Evalchemy】|*【WebWalker】|
|*【Getting a Judge-LLM】|*【PRMBench】|
|*【OmniDocBench】|*【CodeArena】|
|*【HALoGEN】||

*其它*
|  |  |
|---|---|
| Alpaca-CoT | Self-Instruct |
| ChatPiXiu | Wanda (Pruning by Weights and activations) |
| Gorilla | Streaming LLM |
| Sheared LLAMA (Structured Pruning) |gpu_poor|
| LLMPruner:大语言模型裁剪工具 | QA-LoRA |
| LLM-Pruner: On the Structural Pruning of Large Language Models |KnowPAT|
|AuthentiGPT: Detecting Machine-Generated Text|Curiosity-driven Red-teaming for Large Language Models|Language Models are Super Mario(DARE, Drop And REscale)|
|TinyGSM|MathPile|
|Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM|Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding|
|QAnything|Meta-Prompting|
|Lepton Search|Transformer Debugger|
|Open-Source AI Cookbook|MaLA-500|
|NVIDIA Chat with RTX|RAG vs Fine-tuning|
|Chain of Abstraction|序列猴子开源数据集|
|synthetic-data-save-costs|Data is Better Together|
|Large Language Models in Finance|WanJuan-CC|
|Larimar|Financial Datasets|
|LLM-UM-Reading|so-large-lm|
|Fine-tune Llama 3 with ORPO|COIG-CQIA|
|tiny-universe|llmc|
|LLMBox|MarkLLM|
|MobileCPM|LLM-Select|
|Transformer Architecture (LLMs: Zero-to-Hero)|Build a Large Language Model (From Scratch)|
|*【SynthID Text】|*【Small Language Models: Survey, Measurements, and Insights】|
|*【Multi-IF (Multi-turn and multilingual instruction following)】|*【LLM from scratch with Pytorch】|
|*【A Survey on Data Synthesis and Augmentation for Large Language Models】|*【A Survey of Small Language Models】|
|*【LLMForEverybody】|*【Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner】|
|*【CCI3.0-HQ】|*【rlhfbook】|
|*【Deepseek R1可能找到了超越人类的办法】|*【train-llm-from-scratch】|
|*【The Big Book of LLMs】|*【Primers • DeepSeek-R1】|
|*【A vision researcher’s guide to some RL stuff: PPO & GRPO】|*【group relative policy optimization (GRPO)】|
|*【DeepSeek R1 and R1-Zero Explained】|*【DeepSeek R1 阅读清单】|
|*【DeepSeek R1 Explained to your grandma】|*【Deepseek R1 for Everyone】|
|*【llm-course】|*【O1-Journey】|
|*【a reinforcement learning guide】|*【llm-universe】|
|*【smol-course】|*【self-llm】|
|*【Agents(Chip Huyen)】|*【Building effective agents】|
|*【LLMInterviewQuestions】|*【Transformers Laid Out】|

## 相关文章
- 穷穷穷孩子如何体验ColossalAI SFT([Kaggle篇](https://mp.weixin.qq.com/s/Q29uSNxvPMy0rC-QxHiGZA),[Colab篇](https://mp.weixin.qq.com/s/NS4yySeYd7QUYb7CB9V0lA))
- [通俗理解文本生成的常用解码策略](https://mp.weixin.qq.com/s/sVZuEkYXQ9ZZYXJCQz7F4A)
- [通俗理解P-tuning (GPT Understands)](https://mp.weixin.qq.com/s/EvD9OW115XMnrxOcC2BKDA)
- [通俗理解Gradient Checkpoint(附代码)](https://mp.weixin.qq.com/s/IwcfUP_j6JYFXH_xhnWWJQ)
- 千“垂”百炼:垂直领域与语言模型
  - [导语](https://mp.weixin.qq.com/s/G24skuUbyrSatxWczVxEAg)
  - 垂直领域应用
    - 【不限领域】利用未标注文本改进遵循指令的语言模型([1](https://mp.weixin.qq.com/s/50wtP--W_cy-682g8cOYww) [2](https://mp.weixin.qq.com/s/q7nKnwtEKPahABiLFLWuSw) [3](https://mp.weixin.qq.com/s/CE8YNx19dc0EyNfTK_HYHQ) [4](https://mp.weixin.qq.com/s/yj4gnoymNLFuLE1v94VJ9A) [5](https://mp.weixin.qq.com/s/N4mUe7hrvXGFArl20kKRCA))
    - 【医疗健康】ChatDoctor (解读 [上](https://mp.weixin.qq.com/s/zSeRKUZ2te1wxwpvByhcvg) [中](https://mp.weixin.qq.com/s/TcwiQoIex7SDY5Teri9xnw) [下](https://mp.weixin.qq.com/s/I1hXRS7gBMLUyOWMObfpBg) / PDF版PPT [上](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20ChatDoctor%EF%BC%88%E4%B8%8A%EF%BC%89.pdf) [中](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20ChatDoctor%EF%BC%88%E4%B8%AD%EF%BC%89.pdf) [下](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20ChatDoctor%EF%BC%88%E4%B8%8B%EF%BC%89.pdf))
    - 【医疗健康】MedicalGPT-zh ([解读](https://mp.weixin.qq.com/s/QJKZYKh16fqLTC367WhzdA) / [PDF版PPT](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20MedicalGPT-zh.pdf))
    - 【医疗健康】明医(MING) ([解读](https://mp.weixin.qq.com/s/uM4FZeDhAc6JuMlW7NCvUA) / [PDF版PPT](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20MING.pdf))
    - 【医疗健康】灵心(SoulChat) ([解读](https://mp.weixin.qq.com/s/0HOYSr-zQsGLFL_H9UZ2HA) / [PDF版PPT](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20SoulChat.pdf))
    - 【手机交互】ReALM ([1](https://mp.weixin.qq.com/s/gOmUi4_MGvU1Nx3KxXdxVQ) [2](https://mp.weixin.qq.com/s/wTPMwtRVWIrioile-rFzQA) [3](https://mp.weixin.qq.com/s/NgyZG0439UGFoVE7InrX9g) [4](https://mp.weixin.qq.com/s/v1NEovURZr4v8R4_v7TjdA))
  - 自动评估模型
    - 【不限领域】[用语言模型评估语言模型(1)导语](https://mp.weixin.qq.com/s/SUN_ywkI8ld1edXY7uq_1Q)
    - 【不限领域】[用语言模型评估语言模型(2)PandaLM](https://mp.weixin.qq.com/s/NTFu53MdVD9NusFJaORHcw)
    - 【不限领域】用语言模型评估语言模型(3)Shepherd([1](https://mp.weixin.qq.com/s/pbK1Zsv9j_DVtOJaTm_tPw) [2](https://mp.weixin.qq.com/s/n4_kVw8j42ZQv6VjQ_P-Dw) [3](https://mp.weixin.qq.com/s/PeGJOmQPyAhwl7czJgKnQQ) [4](https://mp.weixin.qq.com/s/7_NX7S2AHabX-xU254sq5g))
    - 【医疗/健康】[使用BERT-Score比较ChatDoctor与ChatGPT3.5](https://mp.weixin.qq.com/s/I1hXRS7gBMLUyOWMObfpBg)

## 所有文章 (ALL Articles)
- 中文:[https://mp.weixin.qq.com/s/hAqDqqwIHrCVwz4PYSd72A](https://mp.weixin.qq.com/s/hAqDqqwIHrCVwz4PYSd72A)
- English: [https://createmomo.github.io/](https://createmomo.github.io/)

---

## Chinese Open Source Language Models

### 本草
- https://zhuanlan.zhihu.com/p/626536996
- https://github.com/scir-hi/huatuo-llama-med-chinese

基于中文医学知识的LLaMa指令微调模型

在生物医学领域,LLM模型(如LLaMa,ChatGLM)因为缺乏一定的医学专业知识语料而表现不佳。该项目通过医学知识图谱和GPT3.5API构建了中文医学指令数据集,并对LLaMa模型进行了指令微调得到了一个针对医学领域的智能问诊模型HuaTuo,相比于未经过医学数据指令微调的原LLaMa而言,HuaTuo模型在智能问诊层面表现出色,可生成一些更为可靠的医学知识回答;与此同时,基于相同医学数据,该项目还训练了医疗版本的ChatGLM模型: ChatGLM-6B-Med,

该团队还即将发布扁鹊模型PienChueh(同为基于医学数据训练的大模型),欢迎大家届时使用体验。

### 百川 Baichuan-7B
- https://github.com/baichuan-inc/baichuan-7B
- https://huggingface.co/baichuan-inc/baichuan-7B

baichuan-7B 是由百川智能开发的一个开源可商用的大规模预训练语言模型。基于 Transformer 结构,在大约1.2万亿 tokens 上训练的70亿参数模型,支持中英双语,上下文窗口长度为4096。在标准的中文和英文权威 benchmark(C-EVAL/MMLU)上均取得同尺寸最好的效果。

原始数据包括开源的中英文数据和自行抓取的中文互联网数据,以及部分高质量知识性数据。

参考相关数据工作,频率和质量是数据处理环节重点考虑的两个维度。 我们基于启发式规则和质量模型打分,对原始数据集进行篇章和句子粒度的过滤。在全量数据上,利用局部敏感哈希方法,对篇章和句子粒度做滤重。

### 华佗
- https://mp.weixin.qq.com/s/lwJb8N420xfMTvXJPM2gtg
- https://arxiv.org/pdf/2305.15075.pdf
- https://github.com/FreedomIntelligence/HuatuoGPT
- https://www.huatuogpt.cn/ 

该论文提出的语言模型训练方法可以结合医生和 ChatGPT 的数据,充分发挥它们的互补作用,既保留真实医疗数据的专业性和准确性,又借助 ChatGPT 的多样性和内容丰富性的特点。

### 扁鹊
- https://github.com/scutcyr/BianQue

基于主动健康的主动性、预防性、精确性、个性化、共建共享、自律性六大特征,华南理工大学未来技术学院-广东省数字孪生人重点实验室开源了中文领域生活空间主动健康大模型基座ProactiveHealthGPT,包括:
- 经过千万规模中文健康对话数据指令微调的生活空间健康大模型扁鹊(BianQue)
- 经过百万规模心理咨询领域中文长文本指令与多轮共情对话数据联合指令微调的心理健康大模型灵心(SoulChat)

我们期望,生活空间主动健康大模型基座ProactiveHealthGPT 可以帮助学术界加速大模型在慢性病、心理咨询等主动健康领域的研究与应用。本项目为 生活空间健康大模型扁鹊(BianQue) 。

### 灵心(SoulChat)
- https://github.com/scutcyr/SoulChat

我们调研了当前常见的心理咨询平台,发现,用户寻求在线心理帮助时,通常需要进行较长篇幅地进行自我描述,然后提供帮助的心理咨询师同样地提供长篇幅的回复,缺失了一个渐进式的倾诉过程。但是,在实际的心理咨询过程当中,用户和心理咨询师之间会存在多轮次的沟通过程,在该过程当中,心理咨询师会引导用户进行倾诉,并且提供共情,例如:“非常棒”、“我理解你的感受”、“当然可以”等等。

考虑到当前十分欠缺多轮共情对话数据集,我们一方面,构建了超过15万规模的 单轮长文本心理咨询指令与答案(SoulChatCorpus-single_turn) ,回答数量超过50万(指令数是当前的常见的心理咨询数据集 PsyQA 的6.7倍),并利用ChatGPT与GPT4,生成总共约100万轮次的 多轮回答数据(SoulChatCorpus-multi_turn) 。特别地,我们在预实验中发现,纯单轮长本文驱动的心理咨询模型会产生让用户感到厌烦的文本长度,而且不具备引导用户倾诉的能力,纯多轮心理咨询对话数据驱动的心理咨询模型则弱化了模型的建议能力,因此,我们混合SoulChatCorpus-single_turn和SoulChatCorpus-multi_turn构造成超过120万个样本的 单轮与多轮混合的共情对话数据集SoulChatCorpus 。所有数据采用“用户:xxx\n心理咨询师:xxx\n用户:xxx\n心理咨询师:”的形式统一为一种指令格式。

我们选择了 ChatGLM-6B 作为初始化模型,进行了全量参数的指令微调,旨在提升模型的共情能力、引导用户倾诉能力以及提供合理建议的能力。更多训练细节请留意我们后续发布的论文。

### 启真医学大模型
- https://github.com/CMKRG/QiZhenGPT

本项目利用启真医学知识库构建的中文医学指令数据集,并基于此在Chinese-LLaMA-Plus-7B、CaMA-13B、ChatGLM-6B模型上进行指令精调,大幅提高了模型在中文医疗场景下效果,首先针对药品知识问答发布了评测数据集,后续计划优化疾病、手术、检验等方面的问答效果,并针对医患问答、病历自动生成等应用展开拓展。

### 貔貅FinMA & PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance
- https://github.com/chancefocus/PIXIU
- https://arxiv.org/abs/2306.05443
- https://huggingface.co/spaces/ChanceFocus/FLARE

The advancement of Natural Language Processing (NLP) and machine learning (ML) techniques in financial technology (FinTech) has enabled a diverse set of capabilities from predicting stock price movements to advanced financial analytics. However, to effectively understand the complex financial language and concepts, domain-specific LLMs are necessary.

Despite prior efforts, there is a lack of open-source financial LLMs and benchmarks to evaluate them. Additionally, these models are not fine-tuned to follow natural language instructions, limiting their performance in downstream financial tasks.

To address these gaps, we introduce PIXIU, providing:
- Open-source LLMs tailored for finance called FinMA, by fine-tuning LLaMA with the dataset constructed in PIXIU.
- Large-scale, high-quality multi-task and multi-modal financial instruction tuning data FIT.
- Holistic financial evaluation benchmarks FLARE for assessing financial LLMs.

Key Features
- Open resources: PIXIU openly provides the financial LLM, instruction tuning data, and datasets included in the evaluation benchmark to encourage open research and transparency.
- Multi-task: The instruction tuning data in PIXIU cover a diverse set of financial tasks, including four financial NLP tasks and one financial prediction task.
- Multi-modality: PIXIU's instruction tuning data consist of multi-modality financial data, including time series data from the stock movement prediction task. It covers various types of financial texts, including reports, news articles, tweets, and regulatory filings.
- Diversity: Unlike previous benchmarks focusing mainly on financial NLP tasks, PIXIU's evaluation benchmark includes critical financial prediction tasks aligned with real-world scenarios, making it more challenging.

### 中文Alpaca模型Luotuo
- https://sota.jiqizhixin.com/project/luotuo
- https://github.com/LC1332/Luotuo-Chinese-LLM

Alpaca 是斯坦福团队基于 LLaMA 7B 在 52k 指令上微调得到的模型,能出色适应多种自然语言应用场景。近日来自商汤科技和华中科技大学开源中文语言模型 Luotuo,基于 ChatGPT API 翻译 Alpaca 微调指令数据,并使用 lora 进行微调得到。目前该项目已公开训练的语料和模型权重文件(两个型号),供开发者可使用自己各种大小的语料,训练自己的语言模型,并适用到对应的垂直领域。

### 中文LLaMA&Alpaca大模型
- https://github.com/ymcui/Chinese-LLaMA-Alpaca

以ChatGPT、GPT-4等为代表的大语言模型(Large Language Model, LLM)掀起了新一轮自然语言处理领域的研究浪潮,展现出了类通用人工智能(AGI)的能力,受到业界广泛关注。然而,由于大语言模型的训练和部署都极为昂贵,为构建透明且开放的学术研究造成了一定的阻碍。

为了促进大模型在中文NLP社区的开放研究,本项目开源了中文LLaMA模型和经过指令精调的Alpaca大模型。这些模型在原版LLaMA的基础上扩充了中文词表并使用了中文数据进行二次预训练,进一步提升了中文基础语义理解能力。同时,在中文LLaMA的基础上,本项目使用了中文指令数据进行指令精调,显著提升了模型对指令的理解和执行能力。

### 中文LLaMA&Alpaca大模型2
- https://github.com/ymcui/Chinese-LLaMA-Alpaca-2
- https://mp.weixin.qq.com/s/s8bOcwRYiRA88kPlJKeAKA
- https://arxiv.org/abs/2304.08177v2

Chinese-LLaMA-Alpaca-2大模型项目正式发布v1.0版本,开源Chinese-LLaMA-2-7B(基座模型)和Chinese-Alpaca-2-7B(指令/chat模型)。这些模型在原版Llama-2的基础上扩充并优化了中文词表,使用了大规模中文数据进行增量预训练,进一步提升了中文基础语义和指令理解能力,相比一代相关模型获得了显著性能提升。相关模型支持4K上下文并可通过NTK方法最高扩展至18K+。

### 中文对话式大语言模型Firefly
- https://mp.weixin.qq.com/s/tyH9Ifcvw4DKqoIoYjT6Kg
- https://github.com/yangjianxin1/Firefly

Firefly(流萤) 是一个开源的中文对话式大语言模型,使用指令微调(Instruction Tuning)在中文数据集上进行调优。同时使用了词表裁剪、ZeRO、张量并行等技术,有效降低显存消耗和提高训练效率。 在训练中,我们使用了更小的模型参数量,以及更少的计算资源。

我们构造了许多与中华文化相关的数据,以提升模型这方面的表现,如对联、作诗、文言文翻译、散文、金庸小说等。

### 凤凰
- https://mp.weixin.qq.com/s/beAAh_MdqssV8bEKsccElg
- https://github.com/FreedomIntelligence/LLMZoo

LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.

### 【复旦】MOSS
- https://github.com/OpenLMLab/MOSS
- https://mp.weixin.qq.com/s/LjToZVWjQ-ot5KJFCFtA3g

MOSS是一个支持中英双语和多种插件的开源对话语言模型,moss-moon系列模型具有160亿参数,在FP16精度下可在单张A100/A800或两张3090显卡运行,在INT4/8精度下可在单张3090显卡运行。MOSS基座语言模型在约七千亿中英文以及代码单词上预训练得到,后续经过对话指令微调、插件增强学习和人类偏好训练具备多轮对话能力及使用多种插件的能力。

### 【复旦】MOSS-RLHF
- https://mp.weixin.qq.com/s/BjXtnEEVCQiPOy-_qCNM4g
- https://openlmlab.github.io/MOSS-RLHF/paper/SecretsOfRLHFPart1.pdf
- https://openlmlab.github.io/MOSS-RLHF/

FudanNLP 团队通过大量、详实工作,设计实验充分探索了大模型 RLHF 的完整工作流程,仔细剖析了 RLHF 中的强化学习 PPO 算法的内部工作原理以及它在整个 RLHF 中的作用,并研究各种优化方法如何影响训练过程。通过这些努力,确定了使得 PPO 算法在大模型人类对齐方面行之有效的关键因素。

综合上述发现,该团队进一步总结出在大模型上训练更稳定的 PPO 算法版本:PPO-max。并使用 Helpful 和 Harmless 数据集全面评估,结果显示经过 PPO-max 算法训练的模型展现出了出色的人类对齐性能!

综合上述发现,该团队进一步总结出在大模型上训练更稳定的 PPO 算法版本:PPO-max。并使用 Helpful 和 Harmless 数据集全面评估,结果显示经过 PPO-max 算法训练的模型展现出了出色的人类对齐性能!

### 【度小满】轩辕-首个千亿级中文金融对话模型
- https://arxiv.org/pdf/2305.12002.pdf
- https://huggingface.co/xyz-nlp/XuanYuan2.0
- https://github.com/Duxiaoman-DI/XuanYuan
- https://huggingface.co/xyz-nlp/XuanYuan2.0
- https://zhuanlan.zhihu.com/p/632780608

轩辕是国内首个开源的千亿级中文对话大模型,同时也是首个针对中文金融领域优化的千亿级开源对话大模型。轩辕在BLOOM-176B的基础上针对中文通用领域和金融领域进行了针对性的预训练与微调,它不仅可以应对通用领域的问题,也可以解答与金融相关的各类问题,为用户提供准确、全面的金融信息和建议。

### 悟道·天鹰(Aquila)
- https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila

这是首个具备中英双语知识、支持商用许可协议、支持国内数据合规要求的开源语言大模型。悟道·天鹰(Aquila)系列模型包括 Aquila基础模型(7B、33B),AquilaChat对话模型(7B、33B)以及 AquilaCode “文本-代码”生成模型。 

- https://github.com/FlagAI-Open/Aquila2

We announce that our Aquila2 series is now open source, comprising Aquila2 (the base language models: Aquila2-7B and Aquila2-34B) and AquilaChat2 (the chat models, namely AquilaChat2-7B and AquilaChat2-34B, as well as the long-text chat models, namely AquilaChat2-7B-16k and AquilaChat2-34B-16k). You can find the links in the following table. Kindly click on them to access the model cards.

### 桃李:国际中文教育大模型
- https://github.com/blcuicall/taoli

随着ChatGPT引起全社会的关注,及各类大语言模型(Large Language Model)争相亮相,通用领域自然语言处理任务已获得巨大成功,引起了国际中文教育领域的普遍关注。

国际中文教育人士纷纷展开了对大模型的探讨: 大模型是否可以根据学习者的水平,提供合适的语言表达,或根据学习者的问题给出详细的解答,从而在一定程度上辅助甚至充当学习伙伴、语言教师? 然而,目前通用领域的大模型在垂直领域的效果仍有限。

为解决上述问题,我们全面推出适用于国际中文教育领域的大模型 “桃李”(Taoli)1.0 ,一个在国际中文教育领域数据上进行了额外训练的模型。

我们基于目前国际中文教育领域流通的500余册国际中文教育教材与教辅书、汉语水平考试试题以及汉语学习者词典等,构建了国际中文教育资源库。 我们设置了多种形式的指令来充分利用知识,构造了共计 88000 条的高质量国际中文教育问答数据集,并利用收集到的数据对模型进行指令微调,让模型习得将法律知识应用到具体场景中的能力。

### 情感大模型PICA
- https://mp.weixin.qq.com/s/E37EFe10185THHa3pSqBig
- https://github.com/NEU-DataMining/PICA
- https://huggingface.co/NEUDM/PICA-V1

PICA 以清华大学开源的ChatGLM2-6B为基础,采用Prompt tuning技术在4 卡 A6000 训练大约15个小时得到。我们和SoulChat 进行了对比(最后部分),我们的模型在体验和安全上更有优势。我们只使用了2K的数据进行了p-tuning 微调,这充分说明了我们构造的数据质量比较高。模型权重可以在 HuggingFace 访问,欢迎各位使用并提出宝贵的意见。

### 雅意大模型
- https://github.com/wenge-research/YaYi
- https://yayi.wenge.com/

雅意大模型在百万级人工构造的高质量领域数据上进行指令微调得到,训练数据覆盖媒体宣传、舆情分析、公共安全、金融风控、城市治理等五大领域,上百种自然语言指令任务。雅意大模型从预训练初始化权重到领域模型的迭代过程中,我们逐步增强了它的中文基础能力和领域分析能力,并增加了多轮对话和部分插件能力。同时,经过数百名用户内测过程中持续不断的人工反馈优化,我们进一步提升了模型性能和安全性。

通过雅意大模型的开源为促进中文预训练大模型开源社区的发展,贡献自己的一份力量,通过开源,与每一位合作伙伴共建雅意大模型生态。

### 儿童情感陪伴大模型“巧板”
- https://github.com/HIT-SCIR-SC/QiaoBan

巧板大模型是一个7B规模的大语言模型。巧板”指七巧板,是一款承载着中国传统智慧的益智拼图玩具,更是一款教育益智工具。这次发布的儿童大模型正是希望通过陪伴、益智和教育功能,与儿童们建立更深厚的情感纽带。此外,为符合SCIR实验室发布大模型命名规范,故命名为“巧板”大模型。而这个特别的名称也蕴含着我们对儿童成长的悉心呵护,就像巧板一样,为他们拼出美好未来提供帮助。

巧板大模型独具三大特点:
1. 儿童心理学理论指导。基于情绪辅导理论的儿童情感陪伴对话数据构建,更有效地守护孩子的心理健康。

2. 高质量的儿童对话数据构建。高质量对话数据由具有儿童心理学背景的志愿者与专家参与完成,确保数据的真实性与有效性。

3. 温暖的儿童陪伴体验。与儿童的交互方式更加贴心,能够真正与他们建立深入的情感连接,让儿童感受到温暖和认同,成为他们坚实成长道路上的得力伙伴。

### 通义千问Qwen
- https://github.com/QwenLM/Qwen-7B
- https://qwenlm.github.io/

我们在🤖 ModelScope以及🤗 Hugging Face均开源了Qwen-7B系列模型。请在本文档顶部点击相关链接查看仓库信息。本仓库主要包括Qwen-7B的简介、使用指南、技术备忘等内容。想了解更多关于模型的信息,请点击链接查看我们的技术备忘录。

通义千问-7B(Qwen-7B) 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-7B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。Qwen-7B系列模型的特点包括:
1. 大规模高质量预训练数据:我们使用了超过2.2万亿token的自建大规模预训练数据集进行语言模型的预训练。数据集包括文本和代码等多种数据类型,覆盖通用领域和专业领域。
2. 优秀的模型性能:相比同规模的开源模型,Qwen-7B在多个评测数据集上具有显著优势,甚至超出12-13B等更大规模的模型。评测评估的能力范围包括自然语言理解与生成、数学运算解题、代码生成等。
3. 更好地支持多语言:基于更大词表的分词器在分词上更高效,同时它对其他语言表现更加友好。用户可以在Qwen-7B的基础上更方便地训练特定语言的7B语言模型。
4. 8K的上下文长度:Qwen-7B及Qwen-7B-Chat均能支持8K的上下文长度, 允许用户输入更长的prompt。
5. 支持插件调用:Qwen-7B-Chat针对插件调用相关的对齐数据做了特定优化,当前模型能有效调用插件以及升级为Agent。

### 活字
- https://mp.weixin.qq.com/s/WEitgZjOxZpp7KIbRU0ewg
- https://github.com/HIT-SCIR/huozi

大规模语言模型(LLM)在自然语言处理的通用领域已取得了令人瞩目的成功。对于广泛的应用场景,这种技术展示了强大的潜力,学术界和工业界的兴趣也持续升温。哈工大自然语言处理研究所30余位老师和学生参与开发了通用对话大模型活字1.0,哈工大社会计算与信息检索研究中心(哈工大-SCIR)研发了活字2.0,致力于为自然语言处理的研究和实际应用提供更多可能性和选择。

活字3.0是基于Chinese-Mixtral-8x7B,在大约30万行指令数据上微调得到的模型。该模型支持32K上下文,能够有效处理长文本。活字3.0继承了基座模型丰富的中英文知识,并在数学推理、代码生成等任务上具有强大性能。经过指令微调,活字3.0还在指令遵循能力和安全性方面实现了显著提升。

### 韩非 HanFei
- https://github.com/siat-nlp/HanFei

HanFei-1.0(韩非)是国内首个全参数训练的法律大模型,参数量7b,主要功能包括:法律问答、多轮对话、撰写文章、检索(敬请期待)等。

### 智海 录问
- https://github.com/zhihaiLLM/wisdomInterrogatory

智海-录问(wisdomInterrogatory)是由浙江大学、阿里巴巴达摩院以及华院计算三家单位共同设计研发的法律大模型。核心思想:以“普法共享和司法效能提升”为目标,从推动法律智能化体系入司法实践、数字化案例建设、虚拟法律咨询服务赋能等方面提供支持,形成数字化和智能化的司法基座能力。

### Anima:基于QLoRA的33B中文大语言模型
- https://github.com/lyogavin/Anima

AI Community从来都是非常开放的,AI发展到今天,离不开很多以前的重要开源工作,开放共享的Paper,或者的开源数据和代码。我们相信AI的未来也一定是开放的。希望能为开源社区做一些贡献。

为什么33B模型很重要?QLoRA是个Game Changer?

之前大部分开源可finetune的模型大都是比较小的模型7B或者13B,虽然可以在一些简单的chatbot评测集上,通过finetune训练有不错的表现。但是由于这些模型规模还是有限,LLM核心的reasoning的能力还是相对比较弱。这就是为什么很多这种小规模的模型在实际应用的场景表现像是个玩具。如这个工作中的论述:chatbot评测集比较简单,真正比较考验模型能力的复杂逻辑推理及数学问题上小模型和大模型差距还是很明显的。

因此我们认为QLoRA 的工作很重要,重要到可能是个Game Changer。通过QLoRA的优化方法,第一次让33B规模的模型可以比较民主化的,比较低成本的finetune训练,并且普及使用。我们认为33B模型既可以发挥大规模模型的比较强的reasoning能力,又可以针对私有业务领域数据进行灵活的finetune训练提升对于LLM的控制力。

### BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models
- https://github.com/ictnlp/BayLing
- https://arxiv.org/abs/2306.10968

BayLing (百聆, bǎi líng) is an instruction-following large language model equipped with advanced language alignment, showing superior capability in English/Chinese generation, instruction following and multi-turn interaction. BayLing can be effortlessly deployed on a consumer-grade GPU with 16GB of memory, and assists users with tasks such as translation, writing, creation, suggestion...

### BBT-FinCUGE-Applications
- https://github.com/ssymmetry/BBT-FinCUGE-Applications
- https://arxiv.org/abs/2302.09432
- https://bbt.ssymmetry.com/index.html

1.目前最大规模的中文金融领域开源语料库BBT-FinCorpus。预训练语料库的规模与多样性对PLM的性能和泛化能力具有重要作用,所以为了更好的训练PLM,首先需要搜集大规模多样性的语料库。然而,目前中文金融领域缺乏大规模多样性开源语料库,已有的中文金融领域模型多数基于小规模的私有语料库,严重限制了中文金融PLM的能力提升。为此,我们构建了BBT-FinCorpus,一个包含有从四种异质性来源获取的约300GB文本的大规模多样性语料库。针对如何确定语料库的覆盖范围和语料来源集合的问题,我们首先搜集了中文互联网上可获取的所有中文金融NLP任务数据集,并根据其文本来源分布来确定所需要爬取的文本来源集合。在确认好需要爬取的文本来源集合之后,我们使用基于代理的分布式爬虫技术实现大规模爬取网页上的文本。

2.目前最大规模的中文金融领域知识增强型预训练语言模型BBT-FinT5。PLM的架构与参数量对其性能有重要影响。现有的中文金融领域PLM都基于较为原始的BERT模型架构,参数量也相对较小,不能满足日益丰富的领域NLP需求。因此,我们基于T5模型架构构建了一个拥有十亿参数量的目前最大规模的中文金融领域预训练语言模型BBT-FinT5。为了在有限的硬件算力条件下,尽可能高效地利用好硬件算力,我们使用DeepSpeed加速框架对预训练过程进行效率优化。此外,我们还针对T5模型设计了独特的知识增强预训练方法,通过实验证明了该方法的有效性。

3.首个中文金融领域自然语言处理评测基准CFLEB。现有的自然语言处理评估基准多是通用领域的,没有公开可用的中文金融领域评测基准。这导致中文金融领域现有的预训练语言模型在不同的任务集合上进行评测,难以相互比较,阻碍了中文金融领域PLM性能的快速提升。为此,我们首先构建了首个中文金融领域自然语言处理评测基准CFLEB,包含六种不同的任务,涵盖对PLM理解与生成能力的评估。针对评测基准任务的选择及其选择标准问题,我们认为领域评测基准应当着重强调任务的实用性,以更好的反映学术界改进PLM对现实世界的帮助。为此,我们首先邀请金融领域专家对所有可获取的中文金融任务进行了实用性评价,筛选出具有较高实用性评分的任务。之后,我们综合任务数据集的开源情况确定了六个任务数据集作为最终的评测基准。该评测基准的早期版本命名为FinCUGE,包含八个任务,该版本目前已舍弃。


### BELLE: Bloom-Enhanced Large Language model Engine
- https://huggingface.co/BelleGroup
- https://github.com/LianjiaTech/BELLE
- https://zhuanlan.zhihu.com/p/616079388

本项目目标是促进中文对话大模型开源社区的发展,愿景做能帮到每一个人的LLM Engine。现阶段本项目基于一些开源预训练大语言模型(如BLOOM),针对中文做了优化,模型调优仅使用由ChatGPT生产的数据(不包含任何其他数据)。

本项目基于 Stanford Alpaca ,Stanford Alpaca 的目标是构建和开源一个基于LLaMA的模型。 Stanford Alpaca 的种子任务都是英语,收集的数据也都是英文,因此训练出来的模型未对中文优化。

本项目目标是促进中文对话大模型开源社区的发展。本项目针对中文做了优化,模型调优仅使用由ChatGPT生产的数据(不包含任何其他数据)。

### Bloom
- https://huggingface.co/blog/bloom
- https://huggingface.co/bigscience/bloom

BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks.

### BiLLa: A Bilingual LLaMA with Enhanced Reasoning Ability
- https://zhuanlan.zhihu.com/p/628688680
- https://github.com/Neutralzz/BiLLa

BiLLa是开源的推理能力增强的中英双语LLaMA模型。模型的主要特性有:
- 较大提升LLaMA的中文理解能力,并尽可能减少对原始LLaMA英文能力的损伤;
- 训练过程增加较多的任务型数据,利用ChatGPT生成解析,强化模型理解任务求解逻辑;
- 全量参数更新,追求更好的生成效果。

### BLOOMChat176B
- https://mp.weixin.qq.com/s/cY6ORD8CUyXRL0l20EjwqQ
- https://sambanova.ai/blog/introducing-bloomchat-176b-the-multilingual-chat-based-llm/
- https://huggingface.co/spaces/sambanovasystems/BLOOMChat
- https://github.com/sambanova/bloomchat

开源对话模型一直跟闭源模型在多语言能力上存在差距。SambaNova 和斯坦福 Together Computer 开源可商用的多语言聊天模型 BLOOMChat 176B,支持中文。BLOOMChat 在SambaNova 自研芯片 RDU 上完成训练,借助 SambaNova 的独特可重构数据流架构,利用 BLOOM 开源模型的核心能力,通过在 OpenChatKit、Dolly 2.0 和 OASST1 的 OIG 上进行微调。在基于六种语言的早期双盲测试中,BLOOMChat 在 66%的测评数据上产生的对话表现优于近期的开源对话模型。同时在与 GPT4 的基于六种语言的人工测评对比中,BLOOMChat 得到 45%对 55%的胜率,大大缩小开源和闭源模型的多语言对话能力差距。当前 BLOOMChat 开源模型文件,支持在 huggingface 在线推理试用。

### ChatLaw 法律大模型
- https://www.chatlaw.cloud/
- https://github.com/PKU-YuanGroup/ChatLaw
- https://arxiv.org/pdf/2306.16092.pdf

但愿世间不纷争,何惜法典卷生尘

ChatGPT浪潮下,人工智能的不断扩展和发展为LLM的扩散提供了肥沃的土壤,目前医疗、教育、金融领域已逐渐有了各自的模型,但法律领域迟迟没有明显进展。

为了促进LLM在法律甚至其他垂直应用落地的开放研究,本项目开源了中文法律大模型,并针对LLM和知识库的结合问题给出了法律场景下合理的解决方案。

ChatLaw法律大模型目前开源的仅供学术参考的版本底座为姜子牙-13B、Anima-33B,我们使用大量法律新闻、法律论坛、法条、司法解释、法律咨询、法考题、判决文书等原始文本来构造对话数据。

基于姜子牙-13B的模型是第一版模型,得益于姜子牙的优秀中文能力和我们对数据清洗、数据增强过程的严格要求,我们在逻辑简单的法律任务上表现优异,但涉及到复杂逻辑的法律推理任务时往往表现不佳。

随后基于Anima-33B,我们增加了训练数据,做成了ChatLaw-33B,发现逻辑推理能力大幅提升,由此可见,大参数的中文LLM是至关重要的。

我们的技术报告在这里: arXiv: ChatLaw

基于可商用的模型训练而成的版本会作为我们后续产品内部接入的版本,对外不开源,可以在这里进行开源版本模型的试用

### Chinese-Llama-2-7b (LinkSoul-AI)
- https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
- https://huggingface.co/spaces/LinkSoul/Chinese-Llama-2-7b

全部开源,完全可商用的中文版 Llama2 模型及中英文 SFT 数据集,输入格式严格遵循 llama-2-chat 格式,兼容适配所有针对原版 llama-2-chat 模型的优化。

### Chinese-Vicuna-medical
- https://github.com/Facico/Chinese-Vicuna/blob/master/docs/performance-medical.md

在cMedQA2上使用我们的checkpoint-11600 continue finetune

目前从2个epoch的Vicuna开始continue finetune,效果比3个epoch的在医疗问答数据更具有专业性,同时由于数据集构建的问题,会更加规范,比如经常性的加上“到正规医院检查”等等

- 同时验证了指令微调的有效性
- 使用单指令continue-finetune能保留原来更多的性能

### Cornucopia-LLaMA-Fin-Chinese
- https://github.com/jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese

聚宝盆(Cornucopia): 基于中文金融知识的LLaMA微调模型
本项目开源了经过中文金融知识指令精调/指令微调(Instruct-tuning) 的LLaMA-7B模型。通过中文金融公开数据+爬取的金融数据构建指令数据集,并在此基础上对LLaMA进行了指令微调,提高了 LLaMA 在金融领域的问答效果。

基于相同的数据,后期还会利用GPT3.5 API构建高质量的数据集,另在中文知识图谱-金融上进一步扩充高质量的指令数据集

陆续会发布研发的新模型(next-pretrain、multi-task SFT、RLHF Optimize),欢迎大家届时使用体验。

### chatglm-maths
- https://github.com/yongzhuo/chatglm-maths

chatglm-6b微调/LORA/PPO/推理, 样本为自动生成的整数/小数加减乘除运算, 可gpu/cpu。

### Abel
- https://github.com/GAIR-NLP/abel

Abel is created as a tribute to Niels Henrik Abel for his groundbreaking work in algebra and analysis, at which our model is relatively better as well. There is still a long way for us to go, though 🏃‍♂️🏃‍♀️🏁🏃‍♂️🏃‍♀️.

We show that:
- without tools
- without continuing pretraining
- without reward model
- without RLHF
- ONLY using SFT

We have established a new state-of-the-art performance across open-source LLMs (that do not use external tools) on the GSM8k (83.62) and MATH (28.26) benchmarks. Specifically

### InternLM-Math
- https://github.com/InternLM/InternLM-Math

State-of-the-art bilingual open-sourced Math reasoning LLMs. A solver, prover, verifier, augmentor.

### DeepSeekMath
- https://arxiv.org/abs/2402.03300
- https://github.com/deepseek-ai/DeepSeek-Math

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

### LeerooDedicated-Math-7b
- https://huggingface.co/leeroo/LeerooDedicated-Math-7b
- https://arxiv.org/abs/2401.13979

In this paper, we propose an architecture to harness the collective knowledge of multiple trained LLMs to create a new state-of-the-art. At the core of this framework is a LLM-based orchestrator that is adept at picking the right underlying LLM experts for optimal task execution. Inspired by self-play in reinforcement learning, we created a loop of query generation, orchestration, and evaluation to generate training data for the orchestrator. Our evaluation focused on the MMLU benchmark, employing models with 7B, 13B, and 34B parameters available on Hugging Face. The results demonstrate new state-of-the-art open-source models: Our Leeroo orchestrator achieves performance on par with the Mixtral model while incurring only two-thirds of its cost. Moreover, increasing the allowed cost surpasses Mixtral's accuracy by over 5% at the same cost level, reaching an accuracy of 75.9%. Further enhancements were observed when integrating GPT4 into the underlying model pool. The Leeroo orchestrator nearly matches GPT4's performance at half the cost and even exceeds GPT4's results with a 25% cost reduction. These findings illustrate the potential of our architecture in creating state-of-the-art and cost-effective LLMs by optimizing the synergy between multiple LLMs to achieve superior performance outcomes.

### SimpleGeometry
- https://huggingface.co/datasets/bethgelab/simplegeometry
- https://arxiv.org/abs/2404.06405

Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.

### Rho-1
- https://arxiv.org/abs/2404.07965

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

### ChatGLM-Math
- https://github.com/THUDM/ChatGLM-Math
- https://arxiv.org/pdf/2404.02893.pdf

Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems. In this work, we tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM's own generations for data collection. Based on ChatGLM3-32B, we conduct a series of experiments on both academic and our newly created challenging dataset, \textsc{MathUserEval}. Results show that our pipeline significantly enhances the LLM's mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger.

### JiuZhang3.0
- https://arxiv.org/abs/2405.14365
- https://github.com/RUCAIBox/JiuZhang3.0
- https://huggingface.co/ToheartZhang/JiuZhang3.0-8B
- https://huggingface.co/datasets/ToheartZhang/JiuZhang3.0-Corpus-PT-CoT

JiuZhang3.0 is a series of fine-tuned models for math reasoning continually pre-trained on corpus synthesized by our carefully trained small LLM.

### InternLM2-WQX
- https://github.com/InternLM/InternLM-WQX

InternLM2-WQX与InternLM2-WQX-VL是InternLM团队于2024年高考前夕最新推出的文曲星系列模型。

高考覆盖各类学科及题型,同时因其开考前的“绝密性”,被视作中国最具权威的考试之一,成为评估考生综合能力的“试金石”。这一面向人类设计的高难度综合性测试,目前普遍被研究者用于考察大模型的智能水平。InternLM2-WQX系列模型在2024年高考评测集GAOKAO-Eval上取得了优异的成绩,综合表现与GPT-4o相当,且超越了国内外一系列开源大模型,体现了InternLM2-WQX系列模型优秀的性能。

### Math-Minos
- https://arxiv.org/abs/2406.14024
- https://github.com/KbsdJames/MATH-Minos

Mathematical verfier achieves success in mathematical reasoning tasks by validating the correctness of solutions. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedbacks as rationale labels (i.e., the correctness of the current step and the explanations). In this paper, we propose \textbf{Math-Minos}, a natural language feedback enhanced verifier by constructing automatically-generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier by the accuracy of 1.6\% (86.6\% → 88.2\%) on GSM8K and 0.8\% (37.8\% → 38.6\%) on MATH. 

### NuminaMath 7B TIR
- https://huggingface.co/AI-MO/NuminaMath-7B-TIR

NuminaMath is a series of language models that are trained to solve math problems using tool-integrated reasoning (TIR). NuminaMath 7B TIR won the first progress prize of the AI Math Olympiad (AIMO), with a score of 29/50 on the public and private tests sets.

### MathΣtral
- https://mistral.ai/news/mathstral/

Mathstral can achieve significantly better results with more inference-time computation: Mathstral 7B scores 68.37% on MATH with majority voting and 74.59% with a strong reward model among 64 candidates.

Mathstral is an instructed model – use it or fine-tune it as such, referring to our documentation. Weights are hosted on HuggingFace. You can try Mathstral now with mistral-inference and adapt it with mistral-finetune.

### LLaMAX
- https://arxiv.org/pdf/2407.05975
- https://github.com/CONE-MT/LLaMAX/

Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model.

### Qwen2-Math
- https://huggingface.co/collections/Qwen/qwen2-math-66b4c9e072eda65b5ec7534d
- https://github.com/QwenLM/Qwen2-Math

Over the past year, we have dedicated significant effort to researching and enhancing the reasoning capabilities of large language models, with a particular focus on their ability to solve arithmetic and mathematical problems. Today, we are delighted to introduce a series of math-specific large language models of our Qwen2 series, Qwen2-Math, and Qwen2-Math-Instruct-1.5B/7B/72B. Qwen2-Math is a series of specialized math language models built upon the Qwen2 LLMs, which significantly outperforms the mathematical capabilities of open-source models and even closed-source models (e.g., GPT4o). We hope that Qwen2-Math can contribute to the scientific community by solving advanced mathematical problems that require complex, multi-step logical reasoning.

### AIMO-CMU_MATH
- https://github.com/AIMO-CMU-MATH/CMU_MATH-AIMO

the proud winners of the 2nd place in the AI Mathematical Olympiad (AIMO).

We are pleased to share all the datasets and code used in our competition. This repository contains the resources needed to reproduce our models and solutions.

### Qwen2.5-Math
- https://qwenlm.github.io/zh/blog/qwen2.5-math/

Qwen2.5-Math主要被设计用于通过CoT或TIR的方式解中英数学题,我们不推荐在其他任务上使用该系列模型。

### SocraticLM
- https://openreview.net/pdf?id=qkoZgJhxsA
- https://github.com/Ljyustc/SocraticLM

### Open Thoughts
- github.com/open-thoughts/open-thoughts  

Our first goal is to curate a reasoning dataset to train state-of-the-art small reasoning models that surpass DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-7B on math and code reasoning benchmarks.

### simpleRL-reason
- https://github.com/hkust-nlp/simpleRL-reason

This repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward is used, the recipe is almost the same as the one used in DeepSeek-R1, except that the code currently uses PPO rather than GRPO. We have used this code to train small models (7B) on limited data (8K examples), achieving surprisingly strong results -- for example, starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to previous baselines that use >50x more data and more complicated components. You may check our Notion blog or the Introduction below for more details.

### DRT-o1
- https://github.com/krystalan/DRT-o1

This repository contains the resources for our paper "DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought"

### ChatRWKV
- https://github.com/BlinkDL/ChatRWKV

ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves VRAM. Training sponsored by Stability EleutherAI :)

### ChatYuan
- https://github.com/clue-ai/ChatYuan
- https://modelscope.cn/models/ClueAI/ChatYuan-large

元语功能型对话大模型, 这个模型可以用于问答、结合上下文做对话、做各种生成任务,包括创意性写作,也能回答一些像法律、新冠等领域问题。它基于PromptCLUE-large结合数亿条功能对话多轮对话数据进一步训练得到。

PromptCLUE-large在1000亿token中文语料上预训练,累计学习1.5万亿中文token,并且在数百种任务上进行Prompt任务式训练。针对理解类任务,如分类、情感分析、抽取等,可以自定义标签体系;针对多种生成任务,可以进行采样自由生成。

### ChatGLM-6B
- https://github.com/THUDM/ChatGLM-6B
- https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning

ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型,基于 General Language Model (GLM) 架构,具有 62 亿参数。结合模型量化技术,用户可以在消费级的显卡上进行本地部署(INT4 量化级别下最低只需 6GB 显存)。 ChatGLM-6B 使用了和 ChatGPT 相似的技术,针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练,辅以监督微调、反馈自助、人类反馈强化学习等技术的加持,62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。更多信息请参考我们的博客。

### ChatGLM2-6B
- https://github.com/THUDM/ChatGLM2-6B

ChatGLM2-6B 是开源中英双语对话模型 ChatGLM-6B 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM2-6B 引入了如下新特性:

- 更强大的性能:基于 ChatGLM 初代模型的开发经验,我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 GLM 的混合目标函数,经过了 1.4T 中英标识符的预训练与人类偏好对齐训练,评测结果显示,相比于初代模型,ChatGLM2-6B 在 MMLU(+23%)、CEval(+33%)、GSM8K(+571%) 、BBH(+60%)等数据集上的性能取得了大幅度的提升,在同尺寸开源模型中具有较强的竞争力。
- 更长的上下文:基于 FlashAttention 技术,我们将基座模型的上下文长度(Context Length)由 ChatGLM-6B 的 2K 扩展到了 32K,并在对话阶段使用 8K 的上下文长度训练,允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限,我们会在后续迭代升级中着重进行优化。
- 更高效的推理:基于 Multi-Query Attention 技术,ChatGLM2-6B 有更高效的推理速度和更低的显存占用:在官方的模型实现下,推理速度相比初代提升了 42%,INT4 量化下,6G 显存支持的对话长度由 1K 提升到了 8K。
- 更开放的协议:ChatGLM2-6B 权重对学术研究完全开放,在获得官方的书面许可后,亦允许商业使用。如果您发现我们的开源模型对您的业务有用,我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。

### Chinese-Transformer-XL
- https://github.com/THUDM/Chinese-Transformer-XL

本项目提供了智源研究院"文汇" 预训练模型Chinese-Transformer-XL的预训练和文本生成代码。

### ChatMed-TCM & ChatMed-Consult
- https://github.com/michael-wzhu/ChatMed

🚀 ChatMed-Consult : 基于中文医疗在线问诊数据集ChatMed_Consult_Dataset的50w+在线问诊+ChatGPT回复作为训练集。模型主干为LlaMA-7b,融合了Chinese-LlaMA-Alpaca的LoRA权重与中文扩展词表,然后再进行基于LoRA的参数高效微调。我们将全部代码都进行了公开。我们也将部署一个在线Gradio demo, 敬请关注。

⏳ ChatMed-TCM : 大模型赋能中医药传承。这一模型的训练数据为中医药指令数据集ChatMed_TCM_Dataset。以我们开源的中医药知识图谱为基础,采用以实体为中心的自指令方法(entity-centric self-instruct),调用ChatGPT得到2.6w+的围绕中医药的指令数据。ChatMed-TCM模型也是以LlaMA为底座,采用LoRA微调得到。

### ChatGLM-Med
- https://github.com/SCIR-HI/Med-ChatGLM

基于中文医学知识的ChatGLM模型微调,本项目开源了经过中文医学指令精调/指令微调(Instruct-tuning) 的ChatGLM-6B模型。我们通过医学知识图谱和GPT3.5 API构建了中文医学指令数据集,并在此基础上对ChatGLM-6B进行了指令微调,提高了ChatGLM在医疗领域的问答效果。

### CPM-Bee
- https://mp.weixin.qq.com/s/UCW1BT60Lr9x24Rj0cLuxw
- https://huggingface.co/openbmb/cpm-bee-10b
- https://github.com/OpenBMB/CPM-Bee

CPM-Bee 是一个 完全开源、允许商用 的百亿参数中英文基座模型。它采用 Transformer 自回归架构(auto-regressive),使用万亿级高质量语料进行预训练,拥有强大的基础能力。CPM-Bee 的特点可以总结如下:

开源可商用:OpenBMB 始终秉承“让大模型飞入千家万户”的开源精神,CPM-Bee 基座模型将完全开源并且可商用,以推动大模型领域的发展。如需将模型用于商业用途,只需企业实名邮件申请并获得官方授权证书,即可商用使用。

中英双语性能优异:CPM-Bee 基座模型在预训练语料上进行了严格的筛选和配比,同时在中英双语上具有亮眼表现,具体可参见评测任务和结果。

超大规模高质量语料:CPM-Bee基座模型在万亿级语料上进行训练,是开源社区内经过语料最多的模型之一。同时,我们对预训练语料进行了严格的筛选、清洗和后处理以确保质量。

OpenBMB大模型系统生态支持:OpenBMB 大模型系统在高性能预训练、适配、压缩、部署、工具开发了一系列工具,CPM-Bee 基座模型将配套所有的工具脚本,高效支持开发者进行进阶使用。 

强大的对话和工具使用能力:结合OpenBMB 在指令微调和工具学习的探索,我们在 CPM-Bee 基座模型的基础上进行微调,训练出了具有强大对话和工具使用能力的实例模型,现已开放定向邀请内测,未来会逐步向公众开放。

### DISC-MedLLM(复旦)
- https://med.fudan-disc.com
- https://github.com/FudanDISC/DISC-MedLLM
- https://arxiv.org/abs/2308.14346

DISC-MedLLM 是基于我们构建的高质量数据集 DISC-Med-SFT 在通用领域中文大模型 Baichuan-13B 上训练得到的医疗大模型。值得注意的是,我们的训练数据和训练方法可以被适配到任何基座大模型之上。

DISC-MedLLM 具有三个关键特点:

- 可靠丰富的专业知识。我们以医学知识图谱作为信息源,通过采样三元组,并使用通用大模型的语言能力进行对话样本的构造。
- 多轮对话的问询能力。我们以真实咨询对话纪录作为信息源,使用大模型进行对话重建,构建过程中要求模型完全对齐对话中的医学信息。
- 对齐人类偏好的回复。病人希望在咨询的过程中获得更丰富的支撑信息和背景知识,但人类医生的回答往往简练;我们通过人工筛选,构建高质量的小规模指令样本,对齐病人的需求。

### Data-Copilot
- https://github.com/zwq2018/Data-Copilot
- https://arxiv.org/abs/2306.07209
- https://huggingface.co/spaces/zwq2018/Data-Copilot

Data-Copilot 是一个基于 LLM 的系统,用于处理与数据相关的任务,连接了数十亿条数据和多样化的用户需求。它独立设计接口工具,以高效地管理、调用、处理和可视化数据。在接收到复杂请求时,Data-Copilot 会自主调用这些自设计的接口,构建一个工作流程来满足用户的意图。在没有人类协助的情况下,它能够熟练地将来自不同来源、不同格式的原始数据转化为人性化的输出,如图形、表格和文本。

### Tabular LLM
- https://github.com/SpursGoZmy/Tabular-LLM

我们提出Tabular-LLM项目,项目的核心计划如下:

- 探索不同类型表格的表示方法:训练LLM势必需要将表格转化为一个文本序列,ChatGPT等LLM使用Markdown格式来表示简单表格,但这种方法无法很好地表示更复杂的表格,比如包含合并单元格的层级表格,因此我们需要探索如何(统一)表示不同类型的表格,更多讨论见下一节。
- 收集并整理涵盖多种类型表格、多种表格智能任务的数据:考虑学界目前研究较多的表格智能任务,收集开源的数据集并将其转化为指令微调格式的数据,以便用户按需选择。
- 开源表格智能LLM并进行测试分析:利用收集到的数据去微调Alpaca-CoT等模型,构建首批面向表格智能任务的开源LLM,在此基础上对训练好的模型进行测试分析,比如测试训练后的模型在学界测试数据集上的表现,后续将相关实验结果整理为文档,希望能为大家提供一些有用的经验。

### Chain-of-table
- https://blog.research.google/2024/03/chain-of-table-evolving-tables-in.html
- https://arxiv.org/abs/2401.04398

Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.

### Data Interpreter
- https://arxiv.org/abs/2402.18679
- https://github.com/geekan/MetaGPT

Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time data adjustment, expertise in optimization due to complex dependencies among various tasks, and the ability to identify logical errors for precise reasoning. In this study, we introduce the Data Interpreter, a solution designed to solve with code that emphasizes three pivotal techniques to augment problem-solving in data science: 1) dynamic planning with hierarchical graph structures for real-time data adaptability;2) tool integration dynamically to enhance code proficiency during execution, enriching the requisite expertise;3) logical inconsistency identification in feedback, and efficiency enhancement through experience recording. We evaluate the Data Interpreter on various data science and real-world tasks. Compared to open-source baselines, it demonstrated superior performance, exhibiting significant improvements in machine learning tasks, increasing from 0.86 to 0.95. Additionally, it showed a 26% increase in the MATH dataset and a remarkable 112% improvement in open-ended tasks.

### TableLLM
- https://arxiv.org/abs/2403.19318
- https://github.com/TableLLM/TableLLM

We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction.

### Lag-Llama
- https://github.com/time-series-foundation-models/lag-llama
- https://arxiv.org/abs/2310.08278

Lag-Llama is the first open-source foundation model for time series forecasting!

###  TabuLa-8B
- https://github.com/mlfoundations/rtfm
- https://arxiv.org/abs/2406.12031
- https://huggingface.co/mlfoundations/tabula-8b


Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. 

### Time-MoE
- https://arxiv.org/pdf/2409.16040
- https://github.com/Time-MoE/Time-MoE

1️⃣ Time-MoE is the first work to scale time series foundation models up to 2.4 billion parameters, trained from scratch.

2️⃣ Time-300B is the largest open-access time series data collection comprising over 300 billion time points across more than 9 domains.

### DoctorGLM
- https://github.com/xionghonglin/DoctorGLM

DoctorGLM,基于 ChatGLM-6B的中文问诊模型。

### EduChat
- https://github.com/icalk-nlp/EduChat

教育是影响人的身心发展的社会实践活动,旨在把人所固有的或潜在的素质自内而外激发出来。因此,必须贯彻“以人为本”的教育理念,重点关注人的个性化、引导式、身心全面发展。为了更好地助力”以人为本“的教育,华东师范大学计算机科学与技术学院的EduNLP团队探索了针对教育垂直领域的对话大模型EduChat相关项目研发。该项目主要研究以预训练大模型为基底的教育对话大模型相关技术,融合多样化的教育垂直领域数据,辅以指令微调、价值观对齐等方法,提供教育场景下自动出题、作业批改、情感支持、课程辅导、高考咨询等丰富功能,服务于广大老师、学生和家长群体,助力实现因材施教、公平公正、富有温度的智能教育。

### EVA: 大规模中文开放域对话系统
- https://github.com/thu-coai/EVA

EVA 是目前最大的开源中文预训练对话模型,拥有28亿参数,主要擅长开放域闲聊,目前有 1.0 和 2.0 两个版本。其中,1.0版本在 WudaoCorpus-Dialog 上训练而成,2.0 版本在从 WudaoCorpus-Dialog 中清洗出的更高质量的对话数据上训练而成,模型性能也明显好于 EVA1.0。

### EcomGPT
- https://arxiv.org/abs/2308.06966
- https://github.com/Alibaba-NLP/EcomGPT

- we proposed the first E-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data.
- EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks.
- We developed EcomGPT by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities.

### FinGLM
- https://github.com/MetaGLM/FinGLM/

📈 一个旨在深度解析上市公司年报的对话交互智能系统。面对金融文本中的专业术语与暗含信息,我们致力于用AI实现专家级别的金融分析。

🚀 在AI领域,虽然已在文本对话取得进展,但真正的金融交互场景仍然是一个巨大挑战。多方机构联手举办此次竞赛,探索金融领域AI的边界。

📘 上市公司年报为投资者呈现了公司的经营状况、财务状况和未来规划。专业知识是解读的关键,而我们的目标是通过AI技术让这一过程变得更简单、更准确。

### DISC-FinLLM
- https://fin.fudan-disc.com
- https://github.com/FudanDISC/DISC-FinLLM

DISC-FinLLM 是一个专门针对金融场景下为用户提供专业、智能、全面的金融咨询服务的金融领域大模型,由复旦大学数据智能与社会计算实验室 (Fudan-DISC) 开发并开源。

### Deepmoney
- https://sota.jiqizhixin.com/project/deepmoney
- https://huggingface.co/TriadParty

DeepMoney是一个专注于金融领域投资的大型语言模型项目。该模型基于Yi-34B、DeepSeek 67B、miqu-70b构建,当前作者微调了三个模型版本:base和sft (基于yi-34B)、deepmoney-67b-chat (DeepSeek) ,和deepmoney-miqu-70b(migu-70b)。基础模型采用了全参数训练。其训练数据包括高质量的研究报告,覆盖2019年至2023年12月的数据,这些报告主要来自传统券商和专业研究机构,大多数为有偿且仅对机构开放。与大多数基于公共知识训练的金融模型不同,deepmoney能够提供深入的市场解释,弥补公共知识在实际金融领域中的不足。该项目还集成了多模态模型,以提取关键信息。

### GPT2 for Multiple Language
- https://github.com/imcaspar/gpt2-ml

- 简化整理 GPT2 训练代码(based on Grover, supporting TPUs)
- 移植 bert tokenizer,添加多语言支持
- 15亿参数 GPT2 中文预训练模型( 15G 语料,训练 10w 步 )
- 开箱即用的模型生成效果 demo #
- 15亿参数 GPT2 中文预训练模型( 30G 语料,训练 22w 步 )

### InternLM 书生・浦语
- https://github.com/InternLM
- https://mp.weixin.qq.com/s/oTXnvWZJVdoOpFLHngbTYQ
- https://intern-ai.org.cn/home

InternLM has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:

It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.
It supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.

It provides a versatile toolset for users to flexibly build their own workflows.
Additionally, a lightweight training framework is offered to support model pre-training without the need for extensive dependencies. With a single codebase, it supports pre-training on large-scale clusters with thousands of GPUs, and fine-tuning on a single GPU while achieving remarkable performance optimizations. InternLM achieves nearly 90% acceleration efficiency during training on 1024 GPUs.

### Llama2-chat-Chinese-50W
- https://mp.weixin.qq.com/s/r_hKK5_cYm8ClqYVApkUYQ
- https://huggingface.co/RicardoLee/Llama2-chat-Chinese-50W

由于目前的LLama2-chat模型很难约束其以中文进行问题回复,因此该模型旨在提供一个能以中文进行问答的LLama2-chat 7B 模型。

该模型使用LLama2-chat 7B 作为基底模型,使用带embedding 和 LM head 的Lora训练方式训练。模型已完成参数合并,可直接使用。也可以手动将sft_lora_model同Llama2-chat进行合并。

训练数据使用BELLE项目中采样的50万SFT数据进行SFT训练。

### Llama2-Chinese (FlagAlpha)
- https://github.com/FlagAlpha/Llama2-Chinese
- https://llama.family/

我们是一个专注于Llama2模型在中文方面的优化和上层建设的高级技术社区。 *基于大规模中文数据,从预训练开始对Llama2模型进行中文能力的持续迭代升级*。 我们热忱欢迎对大模型LLM充满热情的开发者和研究者加入我们的行列。

### LaWGPT
- https://github.com/pengxiao-song/LaWGPT

LaWGPT 是一系列基于中文法律知识的开源大语言模型。

该系列模型在通用中文基座模型(如 Chinese-LLaMA、ChatGLM 等)的基础上扩充法律领域专有词表、大规模中文法律语料预训练,增强了大模型在法律领域的基础语义理解能力。在此基础上,构造法律领域对话问答数据集、中国司法考试数据集进行指令精调,提升了模型对法律内容的理解和执行能力。

### 夫子•明察司法大模型
- https://github.com/irlab-sdu/fuzi.mingcha

夫子•明察司法大模型是由山东大学、浪潮云、中国政法大学联合研发,以 ChatGLM 为大模型底座,基于海量中文无监督司法语料(包括各类判决文书、法律法规等)与有监督司法微调数据(包括法律问答、类案检索)训练的中文司法大模型。该模型支持法条检索、案例分析、三段论推理判决以及司法对话等功能,旨在为用户提供全方位、高精准的法律咨询与解答服务。

### DISC-LawLLM
- https://law.fudan-disc.com
- https://github.com/FudanDISC/DISC-LawLLM
- https://arxiv.org/abs/2309.11325

复旦大学数据智能与社会计算实验室(FudanDISC)发布大语言模型驱动的中文智慧法律系统——DISC-LawLLM。该系统可以面向不同用户群体,提供多样的法律服务。此外,构建了评测基准DISC-Law-Eval,从客观和主观两个方面来评测法律大语言模型,模型在评测中的表现相较现有的法律大模型有明显优势。

课题组同时公开包含30万高质量的监督微调(SFT)数据集——DISC-Law-SFT,模型参数和技术报告也一并开源。

### LawBench
- https://github.com/open-compass/LawBench
- https://arxiv.org/abs/2309.16289

LawBench经过精心设计,可对大语言模型的法律能力进行精确评估。 在设计测试任务时,我们模拟了司法认知的三个维度,并选择了20个任务来评估大模型的能力。与一些仅有多项选择题的现有基准相比,我们包含了更多与现实世界应用密切相关的任务类型,如法律实体识别、阅读理解、犯罪金额计算和咨询等。 我们认识到当前大模型的安全性策略可能会拒绝回应某些法律询问,或在理解指令方面遇到困难,从而导致缺乏回应。因此,我们开发了一个单独的评估指标 "弃权率",以衡量模型拒绝提供答案或未能正确理解指令的频率。 我们汇报了51种大语言模型在LawBench上的表现,包括20种多语言模型、22种中文模型和9种法律专用大语言模型。

### HK-O1aw
- https://github.com/HKAIR-Lab/HK-O1aw

HK-O1aw is a legal assistant designed to handle complex legal reasoning, specifically for the Hong Kong legal system. It is built using the Align-Anything framework and trained on the O1aw-Dataset., based on the LLaMA-3.1-8B model. The primary goal of HK-O1aw is to improve the reasoning and problem-solving abilities of large language models in the legal domain. Importantly, all training data, code, and prompts used for synthetic data generation have been open-sourced, facilitating research and collaboration within the community.

This model addresses the need for intelligent legal assistance in Hong Kong, where legal issues require in-depth analysis and precise reasoning. HK-O1aw integrates advanced O1-style reasoning capabilities, allowing it to perform complex legal analysis, understand context, identify precedents, and interpret statutes. As the first complex reasoning model tailored for Hong Kong‘s common law system, it is particularly valuable for improving legal services and education.

### Lawyer LLaMA
- https://github.com/AndrewZhe/lawyer-llama

Lawyer LLaMA 首先在大规模法律语料上进行了continual pretraining,让它系统的学习中国的法律知识体系。 在此基础上,我们借助ChatGPT收集了一批对中国国家统一法律职业资格考试客观题(以下简称法考)的分析和对法律咨询的回答,利用收集到的数据对模型进行指令微调,让模型习得将法律知识应用到具体场景中的能力。

我们的模型能够:
- 掌握中国法律知识: 能够正确的理解民法、刑法、行政法、诉讼法等常见领域的法律概念。例如,掌握了刑法中的犯罪构成理论,能够从刑事案件的事实描述中识别犯罪主体、犯罪客体、犯罪行为、主观心理状态等犯罪构成要件。模型利用学到的法律概念与理论,能够较好回答法考中的大部分题目。
- 应用于中国法律实务:能够以通俗易懂的语言解释法律概念,并且进行基础的法律咨询,涵盖婚姻、借贷、海商、刑事等法律领域。
- 为了给中文法律大模型的开放研究添砖加瓦,本项目将开源一系列法律领域的指令微调数据和基于LLaMA训练的中文法律大模型的参数 。

### LexiLaw
- https://github.com/CSHaitao/LexiLaw

LexiLaw 是一个经过微调的中文法律大模型,它基于 ChatGLM-6B 架构,通过在法律领域的数据集上进行微调,使其在提供法律咨询和支持方面具备更高的性能和专业性。

该模型旨在为法律从业者、学生和普通用户提供准确、可靠的法律咨询服务。无论您是需要针对具体法律问题的咨询,还是对法律条款、案例解析、法规解读等方面的查询,LexiLaw 都能够为您提供有益的建议和指导。

同时,我们将分享在大模型基础上微调的经验和最佳实践,以帮助社区开发更多优秀的中文法律大模型,推动中文法律智能化的发展。

### LawGPT_zh 中文法律大模型(獬豸)
- https://mp.weixin.qq.com/s/Pk4NdFQq5G6iZ3QmcyyFUg
- https://github.com/LiuHC0428/LAW-GPT

我们的愿景是为让所有人在遇到法律问题时能第一时间获得专业可靠的回答。因为专业的律师服务只有真正触手可及,才会让人们习惯运用,一如二十年前的搜索引擎,十年前的快递业务。我们希望让法律走进日常生活,为构建法治社会贡献我们的力量。项目海报由Midjourney生成。

本项目开源的中文法律通用模型由ChatGLM-6B LoRA 16-bit指令微调得到。数据集包括现有的法律问答数据集和基于法条和真实案例指导的self-Instruct构建的高质量法律文本问答,提高了通用语言大模型在法律领域的表现,提高了模型回答的可靠性和专业程度。

### Linly伶荔说 中文 LLaMA1-2 & OpenLLaMA & Falcon 大模型
- https://github.com/CVI-SZU/Linly
- https://mp.weixin.qq.com/s/zSxsArP1pxYNubNDZua7iA
- https://mp.weixin.qq.com/s/AuAG3tw4JI8lHyLkSdM18g

本项目向社区提供中文对话模型 Linly-ChatFlow 、中文基础模型 Chinese-LLaMA (1-2)、Chinese-Falcon 及其训练数据。 模型基于 TencentPretrain 预训练框架全参数训练(Full-tuning)。 中文基础模型以 LLaMA 和 Falcon 为底座,利用中文和中英平行增量预训练,将它在英文上语言能力迁移到中文上。进一步,项目汇总了目前公开的多语言指令数据,对中文模型进行了大规模指令跟随训练,实现了 Linly-ChatFlow 对话模型。

此外,本项目还公开从头训练的 Linly-OpenLLaMA 模型,包含 3B、7B、13B 规模,在 1TB 中英文语料预训练,针对中文优化字词结合tokenizer,模型以 Apache 2.0 协议公开。

### MediaGPT
- https://github.com/IMOSR/MediaGPT

虽然LLaMA模型在通用领域通过指令微调已经展示出了令人印象深刻的性能,但对于自媒体创作、直播和运营等领域,由于缺乏专业的训练数据,其能力仍有待提高。为了解决这个问题,我们提出了MediaGPT,一个针对自媒体领域进行特殊训练的模型。

MediaGPT(曾用名Media LLaMA)首先在大规模自媒体语料上进行连续预训练,系统地学习自媒体的知识体系。然后,我们借助ChatGPT收集了一批关于抖音运营、短视频创作、巨量千川投放、直播运营和直播话术技巧等领域知识问题的分析和回答,并利用这些数据对模型进行指令微调,使模型习得如何将自媒体知识应用到实际场景中。

我们的模型具有以下能力:
1. 掌握自媒体知识: 能够理解抖音运营、短视频创作、巨量千川投放、直播运营等领域的核心概念和策略。

2. 适用于实际操作: 能够以通俗易懂的语言解释自媒体概念,并进行基础的自媒体运营咨询,涵盖内容创作、平台运营、广告投放等领域。

为了推动中文自媒体大模型的开放研究,我们将开源一系列自媒体领域的指令微调数据和基于LLaMA训练的中文自媒体大模型的参数。

### CharacterGLM-6B
- https://github.com/thu-coai/CharacterGLM-6B
- https://arxiv.org/pdf/2311.16832.pdf

In this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can customize various AI characters or social agents by configuring their attributes (identities, interests, viewpoints, experiences, achievements, social relationships, etc.) and behaviors (linguistic features, emotional expressions, interaction patterns, etc.). Our model outperforms most mainstream close-source large langauge models, including the GPT series, especially in terms of consistency, human-likeness, and engagement according to manual evaluations. We will release our 6B version of CharacterGLM and a subset of training data to facilitate further research development in the direction of character-based dialogue generation.

### Haruhi-Zero
- https://github.com/LC1332/Zero-Haruhi
- https://huggingface.co/silk-road/Haruhi-Zero-7B-0_3

凉宫春日-Zero是一个同时支持Zero-Shot角色构造和RAG角色构造(原ChatHaruhi)的角色扮演模型。

### Translational-Style-ChatLLM西式翻译腔
- https://github.com/Benson114/Translational-Style-ChatLLM

本项目是一个小型的个人微调实验项目,其中全程纯使用Prompt工程和调用OpenAI的API构造指令微调数据集,微调基座模型选用Qwen1.5-7B-Chat,微调步骤使用的是开源框架LLaMA-Factory

项目中包含构造数据集的全流程代码和微调部分的脚本

### StyleLLM
- https://github.com/stylellm/stylellm_models

stylellm 是一个基于大语言模型(llm)的文本风格迁移(text style transfer)项目。项目利用大语言模型来学习指定文学作品的写作风格(惯用词汇、句式结构、修辞手法、人物对话等),形成了一系列特定风格的模型。

利用stylellm模型可将学习到的风格移植至其他通用文本上,即:输入一段原始文本,模型可对其改写,输出带有该风格特色的文本,达到文字修饰、润色或风格模仿的效果。

### Tianji来事儿AI
- https://github.com/SocialAI-tianji/Tianji

天机是 SocialAI(来事儿AI)制作的一款免费使用、非商业用途的人工智能系统。您可以利用它进行涉及传统人情世故的任务,如如何敬酒、如何说好话、如何会来事儿等,以提升您的情商和核心竞争能力。我们坚信,只有人情世故才是未来AI的核心技术,只有会来事儿的AI才有机会走向AGI,让我们携手见证通用人工智能的来临。 —— "天机不可泄漏。"

### TinyStories
- https://github.com/Mxoder/TinyStories

这次打算用 Hugging Face 的 API 来写一份预训练大(小)模型的代码,也就是用 Trainer 来做预训练。由于只是想练习一下,因此打算选一个极小模型 + 小数据集。为了贴近主流,于是打算预训练一个 LLaMA 3——不过是超迷你版本,大小仅不到 20M。

想起来曾经看到过的微软的工作 TinyStories,探索的是语言模型在多小的情况下还能流利地讲故事,工作非常直白、有趣,刚好也契合我的练习想法,于是这次来复现一下。

### Higgs-Llama-3-70B
- https://huggingface.co/bosonai/Higgs-Llama-3-70B
- https://boson.ai/higgs-opensource/

We perform supervised fine-tuning with our in-house instruction-following and chat datasets. Afterwards, we construct preference pairs with a semi-automated pipeline that relies on both human-labelers and our private LLMs. We conduct iterative preference optimization to align the model. During alignment, we adopted a special strategy to align the model’s behavior with the system message. Compared with other instruct models, Higgs models follow their roles more closely.

### persona-hub
- https://github.com/tencent-ailab/persona-hub
- https://arxiv.org/pdf/2406.20094

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

### Peach-9B-8k-Roleplay
- https://huggingface.co/ClosedCharacter/Peach-9B-8k-Roleplay

Peach-9B-8k-Roleplay is a chat large language model obtained by finetuning 01-ai/Yi-1.5-9B model on more than 100K conversations created through our data synthesis approach.

### Hermes 3
- https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf
- https://nousresearch.com/hermes3/

Hermes 3 contains advanced long-term context retention and multi-turn conversation capability, complex roleplaying and internal monologue abilities, and enhanced agentic function-calling. Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fine-tuning Llama 3.1 8B, 70B and 405B, and training on a dataset of primarily synthetically generated responses. The model boasts comparable and superior performance to Llama 3.1 while unlocking deeper capabilities in reasoning and creativity.

### SkyReels(短剧)
- https://github.com/vaew/skyscript-100m
- https://skyreels.ai/beta

Generating high-quality shooting scripts containing information such as scene and shot language is essential for short drama script generation. We collect 6,660 popular short drama episodes from the Internet, each with an average of 100 short episodes, and the total number of short episodes is about 80,000, with a total duration of about 2,000 hours and totaling 10 terabytes (TB). We perform keyframe extraction and annotation on each episode to obtain about 10,000,000 shooting scripts. We perform 100 script restorations on the extracted shooting scripts based on our self-developed large short drama generation model SkyReels. This leads to a dataset containing 1,000,000,000 pairs of scripts and shooting scripts for short dramas, called SkyScript-100M. We compare SkyScript-100M with the existing dataset in detail and demonstrate some deeper insights that can be achieved based on SkyScript-100M. Based on SkyScript-100M, researchers can achieve several deeper and more far-reaching script optimization goals, which may drive a paradigm shift in the entire field of text-to-video and significantly advance the field of short drama video generation.

### MeChat (Mental Health Support Chatbot)
- https://github.com/qiuhuachuan/smile
- https://huggingface.co/qiuhuachuan/MeChat
- https://mechat.fly.dev/

我们的愿景是为让所有人在遇到心理健康问题时能够获得及时、有效的倾听和支持。我们相信,心理健康是每个人的权利,而不是奢侈品。我们的使命是为人们提供平等、全面、易于访问的心理健康服务,无论他们身在何处、面临何种挑战。我们的愿景还包括推动社会对心理健康问题的认识和理解,打破心理健康问题带来的污名和歧视,为创建一个更加健康、包容和平等的社会做出贡献。项目海报取自 flaticon 。

### MedicalGPT
- https://github.com/shibing624/MedicalGPT

MedicalGPT 训练医疗大模型,实现包括二次预训练、有监督微调、奖励建模、强化学习训练。

基于ChatGPT Training Pipeline,本项目实现了领域模型--医疗模型的四阶段训练:

第一阶段:PT(Continue PreTraining)增量预训练,在海量领域文档数据上二次预训练GPT模型,以注入领域知识

第二阶段:SFT(Supervised Fine-tuning)有监督微调,构造指令微调数据集,在预训练模型基础上做指令精调,以对齐指令意图

第三阶段:RM(Reward Model)奖励模型建模,构造人类偏好排序数据集,训练奖励模型,用来对齐人类偏好,主要是"HHH"原则,具体是"helpful, honest, harmless"

第四阶段:RL(Reinforcement Learning)基于人类反馈的强化学习(RLHF),用奖励模型来训练SFT模型,生成模型使用奖励或惩罚来更新其策略,以便生成更高质量、更符合人类偏好的文本

### 明医 (MING):中文医疗问诊大模型 (原名:MedicalGPT-zh)
- https://github.com/MediaBrain-SJTU/MING

**MedicalGPT-zh**
该开源了基于ChatGLM-6B LoRA 16-bit指令微调的中文医疗通用模型。基于共计28科室的中文医疗共识与临床指南文本,我们生成医疗知识覆盖面更全,回答内容更加精准的高质量指令数据集。

**明医(MING)**
本项目开源了基于医疗指令微调的中文医疗问诊模型:明医 (MING)。

### OpenKG-KnowLLM
- https://github.com/zjunlp/KnowLLM

Knowledgable Large Language Model Series.

With the rapid development of deep learning technology, large language models such as ChatGPT have achieved significant success in the field of natural language processing. However, these large models still face some challenges and issues in learning and understanding knowledge, including the difficulty of knowledge updating, and issues with potential errors and biases within the model, known as knowledge fallacies. The Deep Model series aims to release a series of open-source large models to mitigate these knowledge fallacy issues. The first phase of this project released a knowledge extraction large model based on LLaMA, named Zhishi. To provide Chinese capabilities without disrupting the original model's distribution, we firstly (1) use Chinese corpora for the full-scale pre-training of LLaMA (13B), in order to improve the model's understanding of Chinese and knowledge reserve as much as possible while retaining its original English and code capabilities; Then (2) we fine-tune the model from the first step using an instruction dataset, to enhance the language model's understanding of human extraction instructions.

### OpenMEDLab 浦医
- https://github.com/OpenMEDLab
- https://github.com/openmedlab/PULSE
- https://stcsm.sh.gov.cn/xwzx/kjzl/20230630/c783c30d8e62494e83073535f841675f.html

OpenMEDLab is an open-source platform to share medical foundation models in multi-modalities, e.g., medical imaging, medical NLP, bioinformatics, protein, etc. It targets promoting novel approaches to long-tail problems in medicine, and meanwhile, it seeks solutions to achieve lower cost, higher efficiency, and better generalizability in training medical AI models. The new learning paradigm of adapting foundation models to downstream applications makes it possible to develop innovative solutions for cross-domain and cross-modality diagnostic tasks efficiently. OpenMEDLab is distinguished by several features:
- World's first open-source platform for medical foundation models.
- 10+ medical data modalities targeting a variety of clinical and research problems.
- Pioneering works of the new learning paradigm using foundation models, including pre-trained models, code, and data.
- Releasing multiple sets of medical data for pre-training and downstream applications.
- Collaboration with top medical institutes and facilities.

### PromptCLUE
- https://github.com/clue-ai/PromptCLUE

PromptCLUE:大规模多任务Prompt预训练中文开源模型。

中文上的三大统一:统一模型框架,统一任务形式,统一应用方式。

支持几十个不同类型的任务,具有较好的零样本学习能力和少样本学习能力。针对理解类任务,如分类、情感分析、抽取等,可以自定义标签体系;针对生成任务,可以进行采样自由生成。

千亿中文token上大规模预训练,累计学习1.5万亿中文token,亿级中文任务数据上完成训练,训练任务超过150+。比base版平均任务提升7个点+;具有更好的理解、生成和抽取能力,并且支持文本改写、纠错、知识图谱问答。

### SkyText-Chinese-GPT3
- https://github.com/SkyWorkAIGC/SkyText-Chinese-GPT3

SkyText是由奇点智源发布的中文GPT3预训练大模型,可以进行聊天、问答、中英互译等不同的任务。 应用这个模型,除了可以实现基本的聊天、对话、你问我答外,还能支持中英文互译、内容续写、对对联、写古诗、生成菜谱、第三人称转述、创建采访问题等多种功能。

### ShenNong-TCM-LLM
- https://github.com/michael-wzhu/ShenNong-TCM-LLM

为推动LLM在中医药领域的发展和落地,提升LLM的在中医药方面的知识与回答医学咨询的能力,同时推动大模型赋能中医药传承,我们现推出ShenNong中医药大规模语言模型:

🚀 ShenNong-TCM :
- 这一模型的训练数据为中医药指令数据集ShenNong_TCM_Dataset。
- ChatMed_TCM_Dataset以我们开源的中医药知识图谱为基础;
- 采用以实体为中心的自指令方法entity-centric self-instruct,调用ChatGPT得到11w+的围绕中医药的指令数据;
- ShenNong-TCM模型也是以LlaMA为底座,采用LoRA (rank=16)微调得到。微调代码与ChatMed代码库相同

### TableGPT
- https://github.com/ZJU-M3/TableGPT-techreport

TableGPT is a specifically designed for table analysis. By unifying tables, natural language, and commands into one model, TableGPT comprehends tabular data, understands user intent through natural language, dissects the desired actions, and executes external commands on the table. It subsequently returns the processed results in both tabular and textual explanations to the user. This novel approach simplifies the way users engage with table data, bringing an intuitive feel to data analysis.

### TransGPT · 致远
- https://mp.weixin.qq.com/s/WvzyjHqI0lOGIyPlCIFNQg
- https://github.com/DUOMO/TransGPT

TransGPT・致远的训练基于约 34.6 万条交通领域文本数据(用于领域内预训练)和 5.8 万条交通领域对话数据(用于微调),可支持实时类 APP 接入(地图、公交等应用)。目前,TransGPT・致远已开源,相关资源不仅对学术研究完全开放,仅需邮件申请并获得官方商用许可后,即可以免费商用。

与通用型的多模态交通大模型产品不同,TransGPT 主要致力于在真实交通场景中发挥实际价值,包括交通情况预测、智能咨询助手、公共交通服务、交通规划设计、交通安全教育、协助管理、交通事故报告和分析、自动驾驶辅助系统等能力。

### UrbanGPT
- https://urban-gpt.github.io/
- https://github.com/HKUDS/UrbanGPT
- https://arxiv.org/abs/2403.00813
- https://sites.google.com/view/chaoh/home

In this work, we present a spatio-temporal large language model that can exhibit exceptional generalization capabilities across a wide range of downstream urban tasks. To achieve this objective, we present the UrbanGPT, which seamlessly integrates a spatio-temporal dependency encoder with the instruction-tuning paradigm. This integration enables large language models (LLMs) to comprehend the complex inter-dependencies across time and space, facilitating more comprehensive and accurate predictions under data scarcity. Extensive experimental findings highlight the potential of building LLMs for spatio-temporal learning, particularly in zero-shot scenarios.

### TechGPT
- https://mp.weixin.qq.com/s/nF1He7jhAHfh7PzhjqHoZg
- https://huggingface.co/neukg/TechGPT-7B
- https://github.com/neukg/TechGPT

2023年6月26日,“东北大学知识图谱研究组”正式发布大语言模型TechGPT。

TechGPT的名字主要来源于小组在2018年推出的TechKG大规模中文学术多领域的知识库。

与当前其他各类大模型相比,TechGPT主要强化了以“知识图谱构建”为核心的关系三元组抽取等各类信息抽取任务、以“逻辑推理”为核心的机器阅读理解等各类智能问答任务、以“文本理解”为核心的关键词生成等各类序列生成任务。

在这三大自然语言处理核心能力之内,TechGPT还具备了对计算机科学、材料、机械、冶金、金融和航空航天等十余种垂直专业领域自然语言文本的处理能力。

### TigerBot
- https://github.com/TigerResearch/TigerBot

TigerBot 是一个多语言多任务的大规模语言模型(LLM)。根据 OpenAI InstructGPT 论文在公开 NLP 数据集上的自动评测,TigerBot-7B 达到 OpenAI 同样大小模型的综合表现的 96%,并且这只是我们的 MVP,在此我们将如下探索成果开源:

- 模型:TigerBot-7B, TigerBot-7B-base,TigerBot-180B (research version),
- 代码:基本训练和推理代码,包括双卡推理 180B 模型的量化和推理代码,
- 数据:预训练 100G,从 2TB 过滤后的数据中经过去噪去重清洗而得;监督微调 1G 或 100 万条数据,按比例涵盖用户指令常见的 10 大类 120 小类任务,
- API: chat, plugin, finetune, 让用户能在半小时内无代码的训练和使用专属于自己的大模型和数据,
- 领域数据:涵盖金融,法律,百科,广邀大模型应用开发者,一起打造中国的世界级的应用。

我们在 BLOOM 基础上,在模型架构和算法上做了如下优化:

- 指令完成监督微调的创新算法以获得更好的可学习型(learnability),
- 运用 ensemble 和 probabilistic modeling 的方法实现更可控的事实性(factuality)和创造性(generativeness),
- 在并行训练上,我们突破了 deep-speed 等主流框架中若干内存和通信问题,使得在千卡环境下数月无间断,
- 对中文语言的更不规则的分布,从 tokenizer 到训练算法上做了更适合的算法优化。

### XVERSE-13B
- https://github.com/xverse-ai/XVERSE-13B

XVERSE-13B 是由深圳元象科技自主研发的支持多语言的大语言模型(Large Language Model),主要特点如下:
- 模型结构:XVERSE-13B 使用主流 Decoder-only 的标准 Transformer 网络结构,支持 8K 的上下文长度(Context Length),为同尺寸模型中最长,能满足更长的多轮对话、知识问答与摘要等需求,模型应用场景更广泛。
- 训练数据:构建了 1.4 万亿 token 的高质量、多样化的数据对模型进行充分训练,包含中、英、俄、西等 40 多种语言,通过精细化设置不同类型数据的采样比例,使得中英两种语言表现优异,也能兼顾其他语言效果。
- 分词:基于 BPE(Byte-Pair Encoding)算法,使用上百 GB 语料训练了一个词表大小为 100,278 的分词器,能够同时支持多语言,而无需额外扩展词表。
- 训练框架:自主研发多项关键技术,包括高效算子、显存优化、并行调度策略、数据-计算-通信重叠、平台和框架协同等,让训练效率更高,模型稳定性强,在千卡集群上的峰值算力利用率可达到 58.5%,位居业界前列。

### YuLan-Chat & YuLan-Chat-2
- https://github.com/RUC-GSAI/YuLan-Chat
- https://huggingface.co/yulan-team
- https://mp.weixin.qq.com/s/nPS4N3stAAG_51fnZANbMA

**YuLan-Chat**
中国人民大学高瓴人工智能学院相关研究团队(由多位学院老师联合指导)展开了一系列关于指令微调技术的研究,并发布了学院初版大语言对话模型——YuLan-Chat,旨在探索和提升大语言模型的中英文双语对话能力。

我们分别开源了13B和65B的YuLan-Chat模型文件及相关代码,并采用量化技术使其分别可以在单张RTX3090-24G和A800-80G显卡上部署。YuLan-Chat模型基于LLaMA底座模型,采用精心优化的高质量中英文混合指令进行微调,其中YuLan-Chat-65B模型目前能够在中英文相关评测数据集上显著超越已有开源模型效果。后续我们会继续优化指令微调方法与底座模型,持续更新YuLan-Chat模型。

**YuLan-Chat-2**
在2023年6月发布YuLan-Chat第一版模型后,高瓴人工智能学院研究团队继续探索大语言模型预训练与指令微调等技术,并在LLaMA-2的基础上训练得到了新版基石模型YuLan-LLaMA-2-13B和对话模型YuLan-Chat-2-13B。这些模型在原版LLaMA-2的基础上扩充中文词表与上下文长度(达到8k),使用了大规模中英文数据进行增量预训练和指令微调,同时提升了模型的中英文基础语义和理解能力,相比前一代模型效果获得了显著提升,与同期其他基于LLaMA-2得到的模型相比,也具有显著性能优势。经过量化,模型可以在单张RTX-3090 24G显卡中部署。

### Ziya-LLaMA
- https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1
- https://github.com/IDEA-CCNL/Fengshenbang-LM
- https://mp.weixin.qq.com/s/IeXgq8blGoeVbpIlAUCAjA

姜子牙通用大模型V1是基于LLaMa的130亿参数的大规模预训练模型,具备翻译,编程,文本分类,信息抽取,摘要,文案生成,常识问答和数学计算等能力。目前姜子牙通用大模型已完成大规模预训练、多任务有监督微调和人类反馈学习三阶段的训练过程。

The Ziya-LLaMA-13B-v1 is a large-scale pre-trained model based on LLaMA with 13 billion parameters. It has the ability to perform tasks such as translation, programming, text classification, information extraction, summarization, copywriting, common sense Q&A, and mathematical calculation. The Ziya-LLaMA-13B-v1 has undergone three stages of training: large-scale continual pre-training (PT), multi-task supervised fine-tuning (SFT), and human feedback learning (RM, PPO).

### FLM-101B
- https://arxiv.org/pdf/2309.03852.pdf
- https://huggingface.co/CofeAI/FLM-101B

FLM-101B is an open-source decoder-only LLM with 101 billion parameters. During the training process, model growth technique was employed. The model rapidly acquires knowledge on a small-scale model(16B) in the early stages of training and gradually scales up to 101B, resulting in a cost-effective 100B-scale LLM training(costing approximately $100,000). FLM-101B supports both Chinese and English languages. It has a context window length of 2048 in training. Thanks to the use of xPos rotary position embedding, it allows for efficient expansion of the window size during inference.

To advance the development of 100B-scale Large Language Models (LLMs), FLM-101B has now been fully open-sourced.

### MindChat(漫谈): 心理大模型
- https://github.com/X-D-Lab/MindChat

心理大模型——漫谈(MindChat)期望从心理咨询、心理评估、心理诊断、心理治疗四个维度帮助人们纾解心理压力与解决心理困惑, 提高心理健康水平. 作为一个心理大模型, MindChat通过营造轻松、开放的交谈环境, 以放松身心、交流感受或分享经验的方式, 与用户建立信任和理解的关系. MindChat希望为用户提供隐私、温暖、安全、及时、方便的对话环境, 从而帮助用户克服各种困难和挑战, 实现自我成长和发展.

无论是在工作场景还是在个人生活中, MindChat期望通过心理学专业知识和人工智能大模型技术, 在严格保护用户隐私的前提下, 全时段全天候为用户提供全面的心理支持和诊疗帮助, 同时实现自我成长和发展, 以期为建设一个更加健康、包容和平等的社会贡献力量。

### WiNGPT
- https://github.com/winninghealth/WiNGPT2

WiNGPT是一个基于GPT的医疗垂直领域大模型,旨在将专业的医学知识、医疗信息、数据融会贯通,为医疗行业提供智能化的医疗问答、诊断支持和医学知识等信息服务,提高诊疗效率和医疗服务质量。

### CareGPT
- https://github.com/WangRongsheng/CareGPT

CareGPT (关怀GPT)是一个医疗大语言模型,同时它集合了数十个公开可用的医疗微调数据集和开放可用的医疗大语言模型,包含LLM的训练、测评、部署等以促进医疗LLM快速发展。

### 孙思邈
- https://github.com/thomas-yanxin/Sunsimiao

孙思邈中文医疗大模型(简称: Sunsimiao)希望能够遵循孙思邈的生平轨迹, 重视民间医疗经验, 不断累积中文医疗数据, 并将数据附加给模型, 致力于提供安全、可靠、普惠的中文医疗大模型.

目前, Sunsimiao是由baichuan-7B和ChatGLM-6B系列在十万级高质量的中文医疗数据中微调而得, 后续将收集更多数据, 扩充模型能力, 不断迭代更新. 相关细节工作正在整理, 敬请期待.

### MolGen(药物研发)
- https://github.com/zjunlp/Mol-Instructions

Mol-Instructions comprises three cardinal components:

🔬 Molecule-oriented instructions: This component delves into the world of small molecules, emphasizing their inherent properties and behaviors. It sheds light on the fundamental challenges of diverse chemical reactions and molecular design, with 148,4K instructions across six tasks.

🧬 Protein-oriented instructions: Rooted in the biosciences, this component presents 505K instructions across five distinct categories of tasks. These tasks aim to predict the structure, function, and activity of proteins, and facilitate protein design based on textual directives.

🥼 Biomolecular text instructions: Predominantly designed to cater to NLP tasks within the fields of bioinformatics and chemoinformatics, this part encapsulates six information extraction and Q&A tasks represented through 53K instructions.

### Multilingual Medicine
- https://github.com/FreedomIntelligence/Apollo/tree/main

Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.

### Sequel
- https://github.com/SequelHQ/Sequel

Sequel is an open-source software application meticulously designed to be your ultimate companion in taking control of your health through personalized nutrition. By leveraging our cutting-edge platform, users can effortlessly analyze lab reports, track supplement and nutrient intake, and access a comprehensive library of evidence-based information. Our mission is to empower you with the tools and knowledge necessary to make informed decisions about your well-being, guiding you towards a healthier, longer life.

### Gene editing
- https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1.full.pdf

Gene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as human cells. Artificial intelligence (AI) enabled design provides a powerful alternative with potential to bypass evolutionary constraints and generate editors with optimal properties. Here, using large language models (LLMs) trained on biological diversity at scale, we demonstrate the first successful precision editing of the human genome with a programmable gene editor designed with AI. To achieve this goal, we curated a dataset of over one million CRISPR operons through systematic mining of 26 terabases of assembled genomes and meta-genomes. We demonstrate the capacity of our models by generating 4.8x the number of protein clusters across CRISPR-Cas families found in nature and tailoring single-guide RNA sequences for Cas9-like effector proteins. Several of the generated gene editors show comparable or improved activity and specificity relative to SpCas9, the prototypical gene editing effector, while being 400 mutations away in sequence. Finally, we demonstrate an AI-generated gene editor, denoted as OpenCRISPR-1, exhibits compatibility with base editing. We release OpenCRISPR-1 publicly to facilitate broad, ethical usage across research and commercial applications.

### Llama-3-8B-UltraMedical
- https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical

Llama-3-8B-UltraMedical is an open-access large language model (LLM) specialized in biomedicine. Developed by the Tsinghua C3I Lab, this model aims to enhance medical examination access, literature comprehension, and clinical knowledge.

### PH-LLM
- https://research.google/blog/advancing-personal-health-and-wellness-insights-with-ai/
- https://arxiv.org/abs/2406.06474
- https://arxiv.org/abs/2406.06464

The Personal Health Large Language Model (PH-LLM) is a fine-tuned version of Gemini, designed to generate insights and recommendations to improve personal health behaviors related to sleep and fitness patterns. By using a multimodal encoder, PH-LLM is optimized for both textual understanding and reasoning as well as interpretation of raw time-series sensor data such as heart rate variability and respiratory rate from wearables.

### ProLLM
- https://github.com/MingyuJ666/ProLLM
- https://arxiv.org/html/2405.06649v1

The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions, ignoring the broader context of nonphysical connections through intermediate proteins, thus limiting their effectiveness. The emergence of Large Language Models (LLMs) provides a new opportunity for addressing this complex biological challenge. By transforming structured data into natural language prompts, we can map the relationships between proteins into texts. This approach allows LLMs to identify indirect connections between proteins, tracing the path from upstream to downstream. Therefore, we propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time. Specifically, we propose Protein Chain of Thought (ProCoT), which replicates the biological mechanism of signaling pathways as natural language prompts. ProCoT considers a signaling pathway as a protein reasoning process, which starts from upstream proteins and passes through several intermediate proteins to transmit biological signals to downstream proteins. Thus, we can use ProCoT to predict the interaction between upstream proteins and downstream proteins. The training of ProLLM employs the ProCoT format, which enhances the model’s understanding of complex biological problems. In addition to ProCoT, this paper also contributes to the exploration of embedding replacement of protein sites in natural language prompts, and instruction fine-tuning in protein knowledge datasets. We demonstrate the efficacy of ProLLM through rigorous validation against benchmark datasets, showing significant improvement over existing methods in terms of prediction accuracy and generalizability. Our results highlight the potential of LLMs to transform the field of PPI, serving as a robust potential tool for various categories of biological and medical research. 

### MolecularGPT
- https://arxiv.org/abs/2406.12950
- https://github.com/NYUSHCS/MolecularGPT

Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 16.6% increase on classification accuracy and decrease of 199.17 on regression metrics (e.g., RMSE) under zero-shot. This st
Download .txt
gitextract_7o7yxv9l/

├── 01-colossalai-sft-kaggle.ipynb
├── 02-colossalai-sft-colab.ipynb
└── README.md
Condensed preview — 3 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (829K chars).
[
  {
    "path": "01-colossalai-sft-kaggle.ipynb",
    "chars": 43663,
    "preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "02-colossalai-sft-colab.ipynb",
    "chars": 59379,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": [],\n      \"gpuType\": \"T4\",\n"
  },
  {
    "path": "README.md",
    "chars": 638653,
    "preview": "# 开源语言模型百宝袋 (Ver. 3.6)\nOpen-Source Language Model Pocket\n\n**注意**:由于此文本内容太多了,直接在Github网页阅览会出现内容不全(导致部分内容搜索不到)的问题。建议下载到本地或"
  }
]

About this extraction

This page contains the full source code of the createmomo/Open-Source-Language-Model-Pocket GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 3 files (724.3 KB), approximately 200.4k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!