[
  {
    "path": "01-colossalai-sft-kaggle.ipynb",
    "content": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"python\",\"version\":\"3.7.12\",\"mimetype\":\"text/x-python\",\"codemirror_mode\":{\"name\":\"ipython\",\"version\":3},\"pygments_lexer\":\"ipython3\",\"nbconvert_exporter\":\"python\",\"file_extension\":\".py\"}},\"nbformat_minor\":4,\"nbformat\":4,\"cells\":[{\"cell_type\":\"markdown\",\"source\":\"**语言模型练习场01：**ColossalAI（SFT部分）\\n- https://github.com/hpcaitech/ColossalAI\\n\\n**注意：**\\n- 此notebook**只演示在kaggle notebook下如何跑通ColossalAI的SFT部分**，并不会包含超参数的调整、对结果的分析等\\n- 类似的操作放到google colab理论上应该也可以跑通\\n- **如果你有自己的机器**，则此notebook对你的帮助可能不大（因为你不需要在notebook上进行训练）\\n- 此notebook的受众是**手里没有GPU资源，但是又想熟悉和浅浅尝试ColossalAI的小伙伴**\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"**数据的准备：**\\n1. 根据[官方文档的提示](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)，在运行前需要准备好数据\\n2. 数据可以在[这里下载](https://github.com/XueFuzhao/InstructionWild/tree/main/data)。注意不要下载seed文件（因为seed文件只有instruction，而没有response），要下载README里面提到的json文件，例如instinwild_ch.json\\n3.将数据上传到Kaggle的Dataset中（需要创建自己的Dataset，可以为仅自己可见），按照Kaggle的步骤操作即可\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"**Kaggle Notebook的准备：**\\n1. 在界面右方**添加Dataset**，选择自己创建的Dataset。选择后，在代码中可以通过绝对路径访问。比如刚才创建的数据集名字叫做“instructdata”，其中我们上传了文件“instinwild_ch_small.json”，则在代码中，我们就可以通过这个路径访问数据集：/kaggle/input/instructdata/instinwild_ch_small.json\\n2. 最好选择**GPU T4x2**（Kaggle Notebook界面右方的Accelerator中选择）。如果选择P100可能会在安装过程中报错\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"**特别要注意的地方：**\\n1. 每周有**30个小时**的GPU使用时间\\n2. 每一次启动notebook**最长只能运行12个小时**（如果启动了但是不怎么使用，比如没有运行任何cell，也没有什么编辑的动作，可能也会在12个小时以内被强行终止。与google colab不同的时，长时间运行cell是可以的）\\n3. 一旦被终止，则**不能再找回输出的数据**！（输出的数据会放到/kaggle/working/路径下，如果需要里面的数据，必须在终止运行之前就下载下来）\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"## 安装环境\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"大体上是按照官方文档说明安装。**但是**，如果严格按照文档安装会报错。原因是ColossalAI是一个非常活跃的项目，每日都会有不同的代码变化。文档的部分内容可能还不能及时得到更新。所以，我们针对目前的情况，对安装顺序和细节做了一点微调。 **（当前时间为：2023年4月30日）**\\n\\n相信在不远的将来，ColossalAI团队会把这些小bug修好，并且把文档立刻完善起来。\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"### 1 安装ColossalAI\\n\\n执行完下面的命令后，此时你会发现，下载的文件是放在了/kaggle/working/ColossalAI下\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"!git clone https://github.com/hpcaitech/ColossalAI.git\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:57:00.220637Z\",\"iopub.execute_input\":\"2023-04-30T13:57:00.221204Z\",\"iopub.status.idle\":\"2023-04-30T13:57:03.206251Z\",\"shell.execute_reply.started\":\"2023-04-30T13:57:00.221169Z\",\"shell.execute_reply\":\"2023-04-30T13:57:03.205056Z\"},\"trusted\":true},\"execution_count\":1,\"outputs\":[{\"name\":\"stdout\",\"text\":\"Cloning into 'ColossalAI'...\\nremote: Enumerating objects: 24949, done.\\u001b[K\\nremote: Counting objects: 100% (2362/2362), done.\\u001b[K\\nremote: Compressing objects: 100% (479/479), done.\\u001b[K\\nremote: Total 24949 (delta 1987), reused 2083 (delta 1881), pack-reused 22587\\u001b[K\\nReceiving objects: 100% (24949/24949), 23.09 MiB | 29.19 MiB/s, done.\\nResolving deltas: 100% (16582/16582), done.\\n\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"**安装ColossalAI**\\n\\n如果不执行这个安装，可能会出现这个[错误](https://github.com/hpcaitech/ColossalAI/issues/3629)：“ImportError: cannot import name 'ColoInitContext' from 'colossalai.zero'”\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"import os\\nos.chdir('./ColossalAI')\\n!pip install .\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:57:03.208770Z\",\"iopub.execute_input\":\"2023-04-30T13:57:03.209133Z\",\"iopub.status.idle\":\"2023-04-30T13:57:28.240848Z\",\"shell.execute_reply.started\":\"2023-04-30T13:57:03.209098Z\",\"shell.execute_reply\":\"2023-04-30T13:57:28.239606Z\"},\"trusted\":true},\"execution_count\":2,\"outputs\":[{\"name\":\"stdout\",\"text\":\"Processing /kaggle/working/ColossalAI\\n  Preparing metadata (setup.py) ... \\u001b[?25ldone\\n\\u001b[?25hRequirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (1.21.6)\\nRequirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (4.64.1)\\nRequirement already satisfied: psutil in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (5.9.3)\\nRequirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (23.0)\\nCollecting pre-commit\\n  Downloading pre_commit-2.21.0-py2.py3-none-any.whl (201 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m201.9/201.9 kB\\u001b[0m \\u001b[31m6.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hRequirement already satisfied: rich in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (13.2.0)\\nRequirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (8.1.3)\\nCollecting fabric\\n  Downloading fabric-3.0.1-py3-none-any.whl (53 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m53.3/53.3 kB\\u001b[0m \\u001b[31m4.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hCollecting contexttimer\\n  Downloading contexttimer-0.3.3.tar.gz (4.9 kB)\\n  Preparing metadata (setup.py) ... \\u001b[?25ldone\\n\\u001b[?25hRequirement already satisfied: ninja in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (1.11.1)\\nRequirement already satisfied: torch>=1.11 in /opt/conda/lib/python3.7/site-packages (from colossalai==0.2.8) (1.13.0)\\nCollecting safetensors\\n  Downloading safetensors-0.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m1.3/1.3 MB\\u001b[0m \\u001b[31m33.5 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m00:01\\u001b[0m\\n\\u001b[?25hRequirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from torch>=1.11->colossalai==0.2.8) (4.4.0)\\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from click->colossalai==0.2.8) (4.11.4)\\nCollecting paramiko>=2.4\\n  Downloading paramiko-3.1.0-py3-none-any.whl (211 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m211.2/211.2 kB\\u001b[0m \\u001b[31m17.4 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hCollecting invoke>=2.0\\n  Downloading invoke-2.1.0-py3-none-any.whl (159 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m159.9/159.9 kB\\u001b[0m \\u001b[31m17.3 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai==0.2.8) (6.0)\\nCollecting cfgv>=2.0.0\\n  Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)\\nCollecting identify>=1.0.0\\n  Downloading identify-2.5.23-py2.py3-none-any.whl (98 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m98.8/98.8 kB\\u001b[0m \\u001b[31m9.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hRequirement already satisfied: virtualenv>=20.10.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai==0.2.8) (20.17.1)\\nCollecting nodeenv>=0.11.1\\n  Downloading nodeenv-1.7.0-py2.py3-none-any.whl (21 kB)\\nRequirement already satisfied: markdown-it-py<3.0.0,>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai==0.2.8) (2.1.0)\\nRequirement already satisfied: pygments<3.0.0,>=2.6.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai==0.2.8) (2.14.0)\\nRequirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.7/site-packages (from markdown-it-py<3.0.0,>=2.1.0->rich->colossalai==0.2.8) (0.1.2)\\nRequirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from nodeenv>=0.11.1->pre-commit->colossalai==0.2.8) (59.8.0)\\nCollecting pynacl>=1.5\\n  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m856.7/856.7 kB\\u001b[0m \\u001b[31m41.8 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hCollecting bcrypt>=3.2\\n  Downloading bcrypt-4.0.1-cp36-abi3-manylinux_2_28_x86_64.whl (593 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m593.7/593.7 kB\\u001b[0m \\u001b[31m42.9 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hRequirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai==0.2.8) (38.0.2)\\nRequirement already satisfied: filelock<4,>=3.4.1 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (3.9.0)\\nRequirement already satisfied: platformdirs<3,>=2.4 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (2.6.2)\\nRequirement already satisfied: distlib<1,>=0.3.6 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (0.3.6)\\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->colossalai==0.2.8) (3.11.0)\\nRequirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.7/site-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (1.15.1)\\nRequirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (2.21)\\nBuilding wheels for collected packages: colossalai, contexttimer\\n  Building wheel for colossalai (setup.py) ... \\u001b[?25ldone\\n\\u001b[?25h  Created wheel for colossalai: filename=colossalai-0.2.8-py3-none-any.whl size=1059097 sha256=50ed72a86bf2ae29440764d5c46f67d000fcafbe8d4a5d8f6a947f5e57e85c70\\n  Stored in directory: /tmp/pip-ephem-wheel-cache-3th5gmed/wheels/3e/97/46/e40c7da8c6931df2650672912b14531b399ef776670745f133\\n  Building wheel for contexttimer (setup.py) ... \\u001b[?25ldone\\n\\u001b[?25h  Created wheel for contexttimer: filename=contexttimer-0.3.3-py3-none-any.whl size=5818 sha256=15b3da44f55d3cf68ee6b623d09715157fa4a60ff4f805c607bca5acf41c83f4\\n  Stored in directory: /root/.cache/pip/wheels/f4/67/63/f276d2acab046618878e3eaf13c5a356c9a500baf21403f345\\nSuccessfully built colossalai contexttimer\\nInstalling collected packages: safetensors, contexttimer, nodeenv, invoke, identify, cfgv, bcrypt, pynacl, pre-commit, paramiko, fabric, colossalai\\nSuccessfully installed bcrypt-4.0.1 cfgv-3.3.1 colossalai-0.2.8 contexttimer-0.3.3 fabric-3.0.1 identify-2.5.23 invoke-2.1.0 nodeenv-1.7.0 paramiko-3.1.0 pre-commit-2.21.0 pynacl-1.5.0 safetensors-0.3.1\\n\\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\\u001b[0m\\u001b[33m\\n\\u001b[0m\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"**安装transformers**\\n\\n这里我们安装的是hpcaitech下的transformers，如果直接pip install transformers是否可行并没有测试\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"!git clone https://github.com/hpcaitech/transformers\\nos.chdir('./transformers')\\n!pip install .\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:57:28.243773Z\",\"iopub.execute_input\":\"2023-04-30T13:57:28.244481Z\",\"iopub.status.idle\":\"2023-04-30T13:58:15.962347Z\",\"shell.execute_reply.started\":\"2023-04-30T13:57:28.244440Z\",\"shell.execute_reply\":\"2023-04-30T13:58:15.960975Z\"},\"trusted\":true},\"execution_count\":3,\"outputs\":[{\"name\":\"stdout\",\"text\":\"Cloning into 'transformers'...\\nremote: Enumerating objects: 124468, done.\\u001b[K\\nremote: Total 124468 (delta 0), reused 0 (delta 0), pack-reused 124468\\u001b[K\\nReceiving objects: 100% (124468/124468), 127.08 MiB | 26.91 MiB/s, done.\\nResolving deltas: 100% (93344/93344), done.\\nProcessing /kaggle/working/ColossalAI/transformers\\n  Installing build dependencies ... \\u001b[?25ldone\\n\\u001b[?25h  Getting requirements to build wheel ... \\u001b[?25ldone\\n\\u001b[?25h  Preparing metadata (pyproject.toml) ... \\u001b[?25ldone\\n\\u001b[?25hRequirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (23.0)\\nRequirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (3.9.0)\\nRequirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (1.21.6)\\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (4.11.4)\\nRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (6.0)\\nRequirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (0.13.2)\\nRequirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (0.13.3)\\nRequirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (4.64.1)\\nRequirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (2.28.2)\\nRequirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.7/site-packages (from transformers==4.28.0.dev0) (2021.11.10)\\nRequirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.7/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.28.0.dev0) (4.4.0)\\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->transformers==4.28.0.dev0) (3.11.0)\\nRequirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (2022.12.7)\\nRequirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (2.1.1)\\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (1.26.14)\\nRequirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->transformers==4.28.0.dev0) (3.4)\\nBuilding wheels for collected packages: transformers\\n  Building wheel for transformers (pyproject.toml) ... \\u001b[?25ldone\\n\\u001b[?25h  Created wheel for transformers: filename=transformers-4.28.0.dev0-py3-none-any.whl size=6790611 sha256=4220935232e4fb5bbdd639242eec8975f925c105da87c0d4d0137e013c5479a5\\n  Stored in directory: /tmp/pip-ephem-wheel-cache-u7rm17k7/wheels/f8/7e/62/d660e4bfe297957f2a56ddb6284d5815eba12ca9dfe5b1cf73\\nSuccessfully built transformers\\nInstalling collected packages: transformers\\n  Attempting uninstall: transformers\\n    Found existing installation: transformers 4.27.4\\n    Uninstalling transformers-4.27.4:\\n      Successfully uninstalled transformers-4.27.4\\nSuccessfully installed transformers-4.28.0.dev0\\n\\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\\u001b[0m\\u001b[33m\\n\\u001b[0m\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"**安装Chat部分需要的库**\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"os.chdir('/kaggle/working/ColossalAI/applications/Chat/')\\n!pip install .\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:58:15.966733Z\",\"iopub.execute_input\":\"2023-04-30T13:58:15.967109Z\",\"iopub.status.idle\":\"2023-04-30T13:58:28.362540Z\",\"shell.execute_reply.started\":\"2023-04-30T13:58:15.967072Z\",\"shell.execute_reply\":\"2023-04-30T13:58:28.361342Z\"},\"trusted\":true},\"execution_count\":4,\"outputs\":[{\"name\":\"stdout\",\"text\":\"Processing /kaggle/working/ColossalAI/applications/Chat\\n  Preparing metadata (setup.py) ... \\u001b[?25ldone\\n\\u001b[?25hRequirement already satisfied: transformers>=4.20.1 in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (4.28.0.dev0)\\nRequirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (4.64.1)\\nRequirement already satisfied: datasets in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (2.1.0)\\nCollecting loralib\\n  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)\\nRequirement already satisfied: colossalai>=0.2.4 in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.2.8)\\nRequirement already satisfied: torch<2.0.0,>=1.12.1 in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (1.13.0)\\nCollecting langchain\\n  Downloading langchain-0.0.27-py3-none-any.whl (124 kB)\\n\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m124.9/124.9 kB\\u001b[0m \\u001b[31m4.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\\u001b[?25hRequirement already satisfied: tokenizers in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.13.2)\\nRequirement already satisfied: fastapi in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.89.1)\\nCollecting sse_starlette\\n  Downloading sse_starlette-0.10.3-py3-none-any.whl (8.0 kB)\\nRequirement already satisfied: wandb in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.14.0)\\nRequirement already satisfied: sentencepiece in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (0.1.97)\\nRequirement already satisfied: gpustat in /opt/conda/lib/python3.7/site-packages (from coati==1.0.0) (1.0.0)\\nRequirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (1.21.6)\\nRequirement already satisfied: safetensors in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.1)\\nRequirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (8.1.3)\\nRequirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (23.0)\\nRequirement already satisfied: fabric in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (3.0.1)\\nRequirement already satisfied: psutil in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (5.9.3)\\nRequirement already satisfied: pre-commit in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (2.21.0)\\nRequirement already satisfied: rich in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (13.2.0)\\nRequirement already satisfied: ninja in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (1.11.1)\\nRequirement already satisfied: contexttimer in /opt/conda/lib/python3.7/site-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.3)\\nRequirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from torch<2.0.0,>=1.12.1->coati==1.0.0) (4.4.0)\\nRequirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (4.11.4)\\nRequirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (2.28.2)\\nRequirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (3.9.0)\\nRequirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (6.0)\\nRequirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (2021.11.10)\\nRequirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.7/site-packages (from transformers>=4.20.1->coati==1.0.0) (0.13.3)\\nRequirement already satisfied: multiprocess in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (0.70.14)\\nRequirement already satisfied: aiohttp in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (3.8.3)\\nRequirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (1.3.5)\\nRequirement already satisfied: pyarrow>=5.0.0 in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (5.0.0)\\nRequirement already satisfied: xxhash in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (3.2.0)\\nRequirement already satisfied: responses<0.19 in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (0.18.0)\\nRequirement already satisfied: dill in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (0.3.6)\\nRequirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/lib/python3.7/site-packages (from datasets->coati==1.0.0) (2023.1.0)\\nRequirement already satisfied: starlette==0.22.0 in /opt/conda/lib/python3.7/site-packages (from fastapi->coati==1.0.0) (0.22.0)\\nRequirement already satisfied: pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2 in /opt/conda/lib/python3.7/site-packages (from fastapi->coati==1.0.0) (1.10.4)\\nRequirement already satisfied: anyio<5,>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from starlette==0.22.0->fastapi->coati==1.0.0) (3.6.2)\\nRequirement already satisfied: nvidia-ml-py<=11.495.46,>=11.450.129 in /opt/conda/lib/python3.7/site-packages (from gpustat->coati==1.0.0) (11.495.46)\\nRequirement already satisfied: six>=1.7 in /opt/conda/lib/python3.7/site-packages (from gpustat->coati==1.0.0) (1.16.0)\\nRequirement already satisfied: blessed>=1.17.1 in /opt/conda/lib/python3.7/site-packages (from gpustat->coati==1.0.0) (1.19.1)\\nRequirement already satisfied: sqlalchemy in /opt/conda/lib/python3.7/site-packages (from langchain->coati==1.0.0) (1.4.46)\\nRequirement already satisfied: GitPython!=3.1.29,>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (3.1.30)\\nRequirement already satisfied: pathtools in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (0.1.2)\\nRequirement already satisfied: protobuf!=4.21.0,<5,>=3.12.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (3.20.3)\\nRequirement already satisfied: setproctitle in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (1.3.2)\\nRequirement already satisfied: sentry-sdk>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (1.18.0)\\nRequirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (59.8.0)\\nRequirement already satisfied: appdirs>=1.4.3 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (1.4.4)\\nRequirement already satisfied: docker-pycreds>=0.4.0 in /opt/conda/lib/python3.7/site-packages (from wandb->coati==1.0.0) (0.4.0)\\nRequirement already satisfied: wcwidth>=0.1.4 in /opt/conda/lib/python3.7/site-packages (from blessed>=1.17.1->gpustat->coati==1.0.0) (0.2.6)\\nRequirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (1.3.1)\\nRequirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (6.0.4)\\nRequirement already satisfied: charset-normalizer<3.0,>=2.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (2.1.1)\\nRequirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (4.0.2)\\nRequirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (1.8.2)\\nRequirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (22.2.0)\\nRequirement already satisfied: asynctest==0.13.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (0.13.0)\\nRequirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasets->coati==1.0.0) (1.3.3)\\nRequirement already satisfied: gitdb<5,>=4.0.1 in /opt/conda/lib/python3.7/site-packages (from GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0) (4.0.10)\\nRequirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->transformers>=4.20.1->coati==1.0.0) (3.4)\\nRequirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->transformers>=4.20.1->coati==1.0.0) (2022.12.7)\\nRequirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->transformers>=4.20.1->coati==1.0.0) (1.26.14)\\nRequirement already satisfied: paramiko>=2.4 in /opt/conda/lib/python3.7/site-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (3.1.0)\\nRequirement already satisfied: invoke>=2.0 in /opt/conda/lib/python3.7/site-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (2.1.0)\\nRequirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->transformers>=4.20.1->coati==1.0.0) (3.11.0)\\nRequirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->datasets->coati==1.0.0) (2.8.2)\\nRequirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->datasets->coati==1.0.0) (2023.3)\\nRequirement already satisfied: virtualenv>=20.10.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (20.17.1)\\nRequirement already satisfied: cfgv>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (3.3.1)\\nRequirement already satisfied: nodeenv>=0.11.1 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (1.7.0)\\nRequirement already satisfied: identify>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (2.5.23)\\nRequirement already satisfied: markdown-it-py<3.0.0,>=2.1.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.1.0)\\nRequirement already satisfied: pygments<3.0.0,>=2.6.0 in /opt/conda/lib/python3.7/site-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.14.0)\\nRequirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.7/site-packages (from sqlalchemy->langchain->coati==1.0.0) (2.0.1)\\nRequirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.7/site-packages (from anyio<5,>=3.4.0->starlette==0.22.0->fastapi->coati==1.0.0) (1.3.0)\\nRequirement already satisfied: smmap<6,>=3.0.1 in /opt/conda/lib/python3.7/site-packages (from gitdb<5,>=4.0.1->GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0) (5.0.0)\\nRequirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.7/site-packages (from markdown-it-py<3.0.0,>=2.1.0->rich->colossalai>=0.2.4->coati==1.0.0) (0.1.2)\\nRequirement already satisfied: cryptography>=3.3 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (38.0.2)\\nRequirement already satisfied: pynacl>=1.5 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.5.0)\\nRequirement already satisfied: bcrypt>=3.2 in /opt/conda/lib/python3.7/site-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (4.0.1)\\nRequirement already satisfied: platformdirs<3,>=2.4 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (2.6.2)\\nRequirement already satisfied: distlib<1,>=0.3.6 in /opt/conda/lib/python3.7/site-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (0.3.6)\\nRequirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.7/site-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.15.1)\\nRequirement already satisfied: pycparser in /opt/conda/lib/python3.7/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (2.21)\\nBuilding wheels for collected packages: coati\\n  Building wheel for coati (setup.py) ... \\u001b[?25ldone\\n\\u001b[?25h  Created wheel for coati: filename=coati-1.0.0-py3-none-any.whl size=73195 sha256=114e624d66aa7c22966144210804c68036dda419db454e1fb5a6b657d58b5879\\n  Stored in directory: /tmp/pip-ephem-wheel-cache-pshxknq0/wheels/19/ab/40/58b2528cfb9dab45fa2cdceeff3538d85c2c72d65872c4de6a\\nSuccessfully built coati\\nInstalling collected packages: loralib, sse_starlette, langchain, coati\\nSuccessfully installed coati-1.0.0 langchain-0.0.27 loralib-0.1.1 sse_starlette-0.10.3\\n\\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\\u001b[0m\\u001b[33m\\n\\u001b[0m\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"我们还需要将**analyzer**的部分复制到当前操作系统的对应的**python packages**中（否则可能将来会报错：找不到analyzer）\\n\\n如果你问，我怎么知道packages的路径是哪里呢？其实在上面执行各种pip install的过程中，就可以发现这个路径。\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"!cp -r /kaggle/working/ColossalAI/colossalai/_analyzer/ /opt/conda/lib/python3.7/site-packages/colossalai/\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:58:28.364761Z\",\"iopub.execute_input\":\"2023-04-30T13:58:28.365247Z\",\"iopub.status.idle\":\"2023-04-30T13:58:29.347282Z\",\"shell.execute_reply.started\":\"2023-04-30T13:58:28.365204Z\",\"shell.execute_reply\":\"2023-04-30T13:58:29.345935Z\"},\"trusted\":true},\"execution_count\":5,\"outputs\":[]},{\"cell_type\":\"markdown\",\"source\":\"### 2 预训练模型的下载（以bloom为例）\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"首先**需要执行这个命令**（如果你是在colab中做尝试，可能也需要）\\n\\n如果不安装下面的命令会出现什么情况呢？\\n- 从huggingface中git clone下来的模型看似下载下来了，但是其实下载下来的并不是实质的模型文件（如果你检查文件的大小，只有几B）\\n- 一旦下载下来的文件并不是实质的模型，则在运行SFT代码的时候会报错：**safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge**\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"!sudo apt-get install git-lfs\\n!git lfs install\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:58:29.349334Z\",\"iopub.execute_input\":\"2023-04-30T13:58:29.349759Z\",\"iopub.status.idle\":\"2023-04-30T13:58:33.723349Z\",\"shell.execute_reply.started\":\"2023-04-30T13:58:29.349713Z\",\"shell.execute_reply\":\"2023-04-30T13:58:33.722173Z\"},\"trusted\":true},\"execution_count\":6,\"outputs\":[{\"name\":\"stdout\",\"text\":\"Reading package lists... Done\\nBuilding dependency tree       \\nReading state information... Done\\ngit-lfs is already the newest version (2.9.2-1).\\n0 upgraded, 0 newly installed, 0 to remove and 76 not upgraded.\\nUpdated git hooks.\\nGit LFS initialized.\\n\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"**下载ColossalAI支持的系列模型**，我们以bloomz-560m为例。在下面，我们将模型放在了/kaggle/working/中，但是这里并不是强制的，可以根据自己的喜欢变换位置。\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"os.chdir('/kaggle/working/')\\n!git clone https://huggingface.co/bigscience/bloomz-560m\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:58:33.726131Z\",\"iopub.execute_input\":\"2023-04-30T13:58:33.726802Z\",\"iopub.status.idle\":\"2023-04-30T13:59:00.114858Z\",\"shell.execute_reply.started\":\"2023-04-30T13:58:33.726758Z\",\"shell.execute_reply\":\"2023-04-30T13:59:00.113640Z\"},\"trusted\":true},\"execution_count\":7,\"outputs\":[{\"name\":\"stdout\",\"text\":\"Cloning into 'bloomz-560m'...\\nremote: Enumerating objects: 1332, done.\\u001b[K\\nremote: Counting objects: 100% (10/10), done.\\u001b[K\\nremote: Compressing objects: 100% (10/10), done.\\u001b[K\\nremote: Total 1332 (delta 3), reused 0 (delta 0), pack-reused 1322\\u001b[K\\nReceiving objects: 100% (1332/1332), 7.18 MiB | 22.55 MiB/s, done.\\nResolving deltas: 100% (616/616), done.\\nFiltering content: 100% (8/8), 2.11 GiB | 88.79 MiB/s, done.\\n\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"### 3 运行SFT\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"os.chdir('/kaggle/working/ColossalAI/applications/Chat/examples')\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:59:00.116823Z\",\"iopub.execute_input\":\"2023-04-30T13:59:00.117188Z\",\"iopub.status.idle\":\"2023-04-30T13:59:00.124033Z\",\"shell.execute_reply.started\":\"2023-04-30T13:59:00.117154Z\",\"shell.execute_reply\":\"2023-04-30T13:59:00.123036Z\"},\"trusted\":true},\"execution_count\":8,\"outputs\":[]},{\"cell_type\":\"markdown\",\"source\":\"**执行SFT代码**\\n\\n我们这里是直接运行的py文件，如果你按照[文档的说明](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)，运行sh脚本文件（!bash train_sft.sh）也是可以的。其本质上，都是运行这个train_sft.py文件。\\n\\n在下面的命令中，为了演示\\n- 我们只是用一个非常非常非常小的数据集（--dataset）去跑程序（小到就只有5条数据）\\n- model：改为了“bloom”\\n- pretrain：改成了我们自己下载的模型地址\\n- save_path: 改成了我们想放的目录地址\\n\\n**需要注意的是：**\\n- Kaggle Notebook的GPU是T4x2,所以显存大概有12+12=24G。所以我们在运行训练的时候可以设置**--nproc_per_node=2**（如果是1的话，则有一块GPU会闲置，显存也会砍半）\\n- 其他参数可能需要你自己去多多探索：比如lora、gradient checkingpoint等。更多参数的说明见[官方文档](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"!torchrun --standalone --nproc_per_node=2 train_sft.py \\\\\\n    --pretrain \\\"/kaggle/working/bloomz-560m\\\" \\\\\\n    --model 'bloom' \\\\\\n    --strategy colossalai_zero2 \\\\\\n    --log_interval 50 \\\\\\n    --save_path  \\\"/kaggle/working/bloomz-560m-finetuned\\\" \\\\\\n    --dataset \\\"/kaggle/input/instructdata/instinwild_ch_small.json\\\" \\\\\\n    --batch_size 4 \\\\\\n    --accumulation_steps 8 \\\\\\n    --lr 2e-5 \\\\\\n    --max_datasets_size 512 \\\\\\n    --max_epochs 1\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T13:59:00.125449Z\",\"iopub.execute_input\":\"2023-04-30T13:59:00.126251Z\",\"iopub.status.idle\":\"2023-04-30T14:03:20.357385Z\",\"shell.execute_reply.started\":\"2023-04-30T13:59:00.126214Z\",\"shell.execute_reply\":\"2023-04-30T14:03:20.354320Z\"},\"trusted\":true},\"execution_count\":9,\"outputs\":[{\"name\":\"stdout\",\"text\":\"\\u001b[2;36m[04/30/23 13:59:19]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/colossalai/c\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35montext/\\u001b[0m\\u001b[95mparallel_context.py\\u001b[0m:\\u001b[1;36m522\\u001b[0m set_device          \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: process rank \\u001b[1;36m0\\u001b[0m is  \\n\\u001b[2;36m                    \\u001b[0m         bound to device \\u001b[1;36m0\\u001b[0m                                  \\n\\u001b[2;36m[04/30/23 13:59:24]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/colossalai/c\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35montext/\\u001b[0m\\u001b[95mparallel_context.py\\u001b[0m:\\u001b[1;36m558\\u001b[0m set_seed            \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: initialized seed on\\n\\u001b[2;36m                    \\u001b[0m         rank \\u001b[1;36m0\\u001b[0m, numpy: \\u001b[1;36m42\\u001b[0m, python random: \\u001b[1;36m42\\u001b[0m,              \\n\\u001b[2;36m                    \\u001b[0m         ParallelMode.DATA: \\u001b[1;36m42\\u001b[0m, ParallelMode.TENSOR: \\u001b[1;36m42\\u001b[0m,the \\n\\u001b[2;36m                    \\u001b[0m         default parallel seed is ParallelMode.DATA.        \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/colossalai/\\u001b[0m\\u001b[95mi\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[95mnitialize.py\\u001b[0m:\\u001b[1;36m119\\u001b[0m launch                            \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Distributed        \\n\\u001b[2;36m                    \\u001b[0m         environment is initialized, data parallel size: \\u001b[1;36m1\\u001b[0m, \\n\\u001b[2;36m                    \\u001b[0m         pipeline parallel size: \\u001b[1;36m1\\u001b[0m, tensor parallel size: \\u001b[1;36m1\\u001b[0m \\n\\u001b[2;36m[04/30/23 14:03:09]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35mt/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m121\\u001b[0m __init__                      \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Loading data\\u001b[33m...\\u001b[0m    \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35mt/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m123\\u001b[0m __init__                      \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Loaded \\u001b[1;36m6\\u001b[0m examples. \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35mt/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m126\\u001b[0m __init__                      \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Limiting dataset to\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[1;36m512\\u001b[0m examples.                                      \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35mt/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m129\\u001b[0m __init__                      \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Formatting         \\n\\u001b[2;36m                    \\u001b[0m         inputs\\u001b[33m...\\u001b[0m                                          \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/coati/datase\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35mt/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m137\\u001b[0m __init__                      \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Tokenizing         \\n\\u001b[2;36m                    \\u001b[0m         inputs\\u001b[33m...\\u001b[0m This may take some time\\u001b[33m...\\u001b[0m               \\nsteps: 0it [00:00, ?it/s]\\u001b[2;36m[04/30/23 14:03:13]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING:                 \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/coati/traine\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35mr/\\u001b[0m\\u001b[95msft.py\\u001b[0m:\\u001b[1;36m86\\u001b[0m fit                                    \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING: batch_i\\u001b[1;92md:0\\u001b[0m,     \\n\\u001b[2;36m                    \\u001b[0m         abnormal loss: \\u001b[1;36m3.6484375\\u001b[0m                           \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING:                 \\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/opt/conda/lib/python3.7/site-packages/coati/traine\\u001b[0m\\n\\u001b[2;36m                    \\u001b[0m         \\u001b[35mr/\\u001b[0m\\u001b[95msft.py\\u001b[0m:\\u001b[1;36m86\\u001b[0m fit                                    \\n\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING: batch_i\\u001b[1;92md:1\\u001b[0m,     \\n\\u001b[2;36m                    \\u001b[0m         abnormal loss: \\u001b[1;36m4.1015625\\u001b[0m                           \\nsteps: 0it [00:02, ?it/s]\\n\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"### 4 下载SFT完成的模型\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"os.chdir('/kaggle/working/')\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T14:03:20.361678Z\",\"iopub.execute_input\":\"2023-04-30T14:03:20.362417Z\",\"iopub.status.idle\":\"2023-04-30T14:03:20.368009Z\",\"shell.execute_reply.started\":\"2023-04-30T14:03:20.362368Z\",\"shell.execute_reply\":\"2023-04-30T14:03:20.367092Z\"},\"trusted\":true},\"execution_count\":10,\"outputs\":[]},{\"cell_type\":\"markdown\",\"source\":\"理论上，我们是可以通过Kaggle Notebook右边的界面，选择想要下载的文件进行下载。但是由于Kaggle界面做的并不是很好，**经常会出现目录不能显示的问题**。好在，你应该是知道训练好的模型是放在了哪里。我们可以找到另一种办法将文件下载下来。\",\"metadata\":{}},{\"cell_type\":\"markdown\",\"source\":\"首先，我们把**完成的模型进行打包**（这样我们就可以一下字全都下载下来了，而不用一个一个的下载）\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"!tar -czvf bloomz-560m-finetuned.tar.gz bloomz-560m-finetuned\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T14:03:20.369504Z\",\"iopub.execute_input\":\"2023-04-30T14:03:20.369861Z\",\"iopub.status.idle\":\"2023-04-30T14:04:11.809532Z\",\"shell.execute_reply.started\":\"2023-04-30T14:03:20.369826Z\",\"shell.execute_reply\":\"2023-04-30T14:04:11.808272Z\"},\"trusted\":true},\"execution_count\":11,\"outputs\":[{\"name\":\"stdout\",\"text\":\"bloomz-560m-finetuned/\\nbloomz-560m-finetuned/config.json\\nbloomz-560m-finetuned/pytorch_model.bin\\nbloomz-560m-finetuned/tokenizer_config.json\\nbloomz-560m-finetuned/generation_config.json\\nbloomz-560m-finetuned/special_tokens_map.json\\nbloomz-560m-finetuned/tokenizer.json\\n\",\"output_type\":\"stream\"}]},{\"cell_type\":\"markdown\",\"source\":\"完成打包后，我们**想办法获得下载链接**。执行下面的命令后，直接点击链接便可下载。\",\"metadata\":{}},{\"cell_type\":\"code\",\"source\":\"from IPython.display import FileLink\\nFileLink(r'bloomz-560m-finetuned.tar.gz')\",\"metadata\":{\"execution\":{\"iopub.status.busy\":\"2023-04-30T14:04:11.811207Z\",\"iopub.execute_input\":\"2023-04-30T14:04:11.811637Z\",\"iopub.status.idle\":\"2023-04-30T14:04:11.821318Z\",\"shell.execute_reply.started\":\"2023-04-30T14:04:11.811584Z\",\"shell.execute_reply\":\"2023-04-30T14:04:11.820184Z\"},\"trusted\":true},\"execution_count\":12,\"outputs\":[{\"execution_count\":12,\"output_type\":\"execute_result\",\"data\":{\"text/plain\":\"/kaggle/working/bloomz-560m-finetuned.tar.gz\",\"text/html\":\"<a href='bloomz-560m-finetuned.tar.gz' target='_blank'>bloomz-560m-finetuned.tar.gz</a><br>\"},\"metadata\":{}}]},{\"cell_type\":\"markdown\",\"source\":\"### 5 小结\\n仅仅跑起来只是一个开始。祝愿每个小伙伴都会最终获得更好的GPU资源、更棒的数据和更出色的属于自己的语言模型！\",\"metadata\":{}}]}\n"
  },
  {
    "path": "02-colossalai-sft-colab.ipynb",
    "content": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": [],\n      \"gpuType\": \"T4\",\n      \"collapsed_sections\": [\n        \"NY7XsonIG8Ev\",\n        \"bYe-_gnDHEEa\"\n      ]\n    },\n    \"kernelspec\": {\n      \"name\": \"python3\",\n      \"display_name\": \"Python 3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    },\n    \"accelerator\": \"GPU\",\n    \"gpuClass\": \"standard\"\n  },\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# 语言模型练习场02：ColossalAI（SFT部分）Google Colab版本\"\n      ],\n      \"metadata\": {\n        \"id\": \"AamIytcNnxL1\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"- [https://github.com/hpcaitech/ColossalAI](https://github.com/hpcaitech/ColossalAI)\"\n      ],\n      \"metadata\": {\n        \"id\": \"yAw_VzT6pVwW\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"**注意：**\\n\",\n        \"- 此notebook**只演示在google colab下如何跑通ColossalAI的SFT部分**，并不会包含超参数的调整、对结果的分析等\\n\",\n        \"- 如果想在**Kaggle Notebook**中跑，可以看这篇文章的Kaggle版：[穷穷穷孩子如何体验ColossalAI SFT（Kaggle篇）](https://mp.weixin.qq.com/s/Q29uSNxvPMy0rC-QxHiGZA)\\n\",\n        \"- **如果你有自己的机器**，则此notebook对你的帮助可能不大（因为你不需要在notebook上进行训练）\\n\",\n        \"- 此notebook的受众是手里没有GPU资源，但是又想熟悉和浅浅尝试ColossalAI的小伙伴\\n\",\n        \"- Google Colab目前可以获得的免费运算资源是**T4（大概有15GB的显存）**\\n\",\n        \"- **运行太长时间会自动终止**。所以建议重要的运行结果保存路径要指向自己的google drive目录中，而不那么重要的文件可以直接安装到默认目录下（终止后会被自动删除）\"\n      ],\n      \"metadata\": {\n        \"id\": \"hmoMi8BspejU\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"**数据的准备：**\\n\",\n        \"- 根据[官方文档的提示](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)，在运行前需要准备好数据（https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples）\\n\",\n        \"- 数据可以在[这里下载](https://github.com/XueFuzhao/InstructionWild/tree/main/data)。注意不要下载seed文件（因为seed文件只有instruction，而没有response），要下载README里面提到的json文件，例如instinwild_ch.json（https://github.com/XueFuzhao/InstructionWild/tree/main/data）\"\n      ],\n      \"metadata\": {\n        \"id\": \"lvubgzEuGch2\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"## 安装环境\"\n      ],\n      \"metadata\": {\n        \"id\": \"zN_47P5fG3c_\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"### 1 安装ColossalAI\"\n      ],\n      \"metadata\": {\n        \"id\": \"NY7XsonIG8Ev\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"PhpvqFt06DSm\",\n        \"outputId\": \"62102879-51e9-4014-a248-7e2020d83ded\"\n      },\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Cloning into 'ColossalAI'...\\n\",\n            \"remote: Enumerating objects: 26116, done.\\u001b[K\\n\",\n            \"remote: Counting objects: 100% (5/5), done.\\u001b[K\\n\",\n            \"remote: Compressing objects: 100% (5/5), done.\\u001b[K\\n\",\n            \"remote: Total 26116 (delta 0), reused 0 (delta 0), pack-reused 26111\\u001b[K\\n\",\n            \"Receiving objects: 100% (26116/26116), 23.36 MiB | 34.12 MiB/s, done.\\n\",\n            \"Resolving deltas: 100% (17448/17448), done.\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"!git clone https://github.com/hpcaitech/ColossalAI.git\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"import os\\n\",\n        \"os.chdir('./ColossalAI')\\n\",\n        \"!pip install .\"\n      ],\n      \"metadata\": {\n        \"id\": \"kWmQqVbu6Ncc\",\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"outputId\": \"7554b306-038f-4b06-a5e0-a7734c8e8945\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\\n\",\n            \"Processing /content/ColossalAI\\n\",\n            \"  Preparing metadata (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (1.22.4)\\n\",\n            \"Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (4.65.0)\\n\",\n            \"Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (5.9.5)\\n\",\n            \"Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (23.1)\\n\",\n            \"Collecting pre-commit (from colossalai==0.2.8)\\n\",\n            \"  Downloading pre_commit-3.3.2-py2.py3-none-any.whl (202 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m202.8/202.8 kB\\u001b[0m \\u001b[31m6.5 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (13.3.4)\\n\",\n            \"Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (8.1.3)\\n\",\n            \"Collecting fabric (from colossalai==0.2.8)\\n\",\n            \"  Downloading fabric-3.0.1-py3-none-any.whl (53 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m53.3/53.3 kB\\u001b[0m \\u001b[31m7.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting contexttimer (from colossalai==0.2.8)\\n\",\n            \"  Downloading contexttimer-0.3.3.tar.gz (4.9 kB)\\n\",\n            \"  Preparing metadata (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"Collecting ninja (from colossalai==0.2.8)\\n\",\n            \"  Downloading ninja-1.11.1-py2.py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (145 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m146.0/146.0 kB\\u001b[0m \\u001b[31m21.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: torch>=1.11 in /usr/local/lib/python3.10/dist-packages (from colossalai==0.2.8) (2.0.1+cu118)\\n\",\n            \"Collecting safetensors (from colossalai==0.2.8)\\n\",\n            \"  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m1.3/1.3 MB\\u001b[0m \\u001b[31m39.8 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (3.12.0)\\n\",\n            \"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (4.5.0)\\n\",\n            \"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (1.11.1)\\n\",\n            \"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (3.1)\\n\",\n            \"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (3.1.2)\\n\",\n            \"Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11->colossalai==0.2.8) (2.0.0)\\n\",\n            \"Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.11->colossalai==0.2.8) (3.25.2)\\n\",\n            \"Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.11->colossalai==0.2.8) (16.0.5)\\n\",\n            \"Collecting invoke>=2.0 (from fabric->colossalai==0.2.8)\\n\",\n            \"  Downloading invoke-2.1.2-py3-none-any.whl (160 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m160.1/160.1 kB\\u001b[0m \\u001b[31m19.9 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting paramiko>=2.4 (from fabric->colossalai==0.2.8)\\n\",\n            \"  Downloading paramiko-3.1.0-py3-none-any.whl (211 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m211.2/211.2 kB\\u001b[0m \\u001b[31m26.5 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting cfgv>=2.0.0 (from pre-commit->colossalai==0.2.8)\\n\",\n            \"  Downloading cfgv-3.3.1-py2.py3-none-any.whl (7.3 kB)\\n\",\n            \"Collecting identify>=1.0.0 (from pre-commit->colossalai==0.2.8)\\n\",\n            \"  Downloading identify-2.5.24-py2.py3-none-any.whl (98 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m98.8/98.8 kB\\u001b[0m \\u001b[31m13.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting nodeenv>=0.11.1 (from pre-commit->colossalai==0.2.8)\\n\",\n            \"  Downloading nodeenv-1.8.0-py2.py3-none-any.whl (22 kB)\\n\",\n            \"Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai==0.2.8) (6.0)\\n\",\n            \"Collecting virtualenv>=20.10.0 (from pre-commit->colossalai==0.2.8)\\n\",\n            \"  Downloading virtualenv-20.23.0-py3-none-any.whl (3.3 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m3.3/3.3 MB\\u001b[0m \\u001b[31m81.4 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: markdown-it-py<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai==0.2.8) (2.2.0)\\n\",\n            \"Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai==0.2.8) (2.14.0)\\n\",\n            \"Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py<3.0.0,>=2.2.0->rich->colossalai==0.2.8) (0.1.2)\\n\",\n            \"Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nodeenv>=0.11.1->pre-commit->colossalai==0.2.8) (67.7.2)\\n\",\n            \"Collecting bcrypt>=3.2 (from paramiko>=2.4->fabric->colossalai==0.2.8)\\n\",\n            \"  Downloading bcrypt-4.0.1-cp36-abi3-manylinux_2_28_x86_64.whl (593 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m593.7/593.7 kB\\u001b[0m \\u001b[31m54.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: cryptography>=3.3 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai==0.2.8) (40.0.2)\\n\",\n            \"Collecting pynacl>=1.5 (from paramiko>=2.4->fabric->colossalai==0.2.8)\\n\",\n            \"  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m856.7/856.7 kB\\u001b[0m \\u001b[31m67.3 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting distlib<1,>=0.3.6 (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8)\\n\",\n            \"  Downloading distlib-0.3.6-py2.py3-none-any.whl (468 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m468.5/468.5 kB\\u001b[0m \\u001b[31m50.1 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: platformdirs<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->colossalai==0.2.8) (3.3.0)\\n\",\n            \"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11->colossalai==0.2.8) (2.1.2)\\n\",\n            \"Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11->colossalai==0.2.8) (1.3.0)\\n\",\n            \"Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (1.15.1)\\n\",\n            \"Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai==0.2.8) (2.21)\\n\",\n            \"Building wheels for collected packages: colossalai, contexttimer\\n\",\n            \"  Building wheel for colossalai (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Created wheel for colossalai: filename=colossalai-0.2.8-py3-none-any.whl size=1106778 sha256=c48863e1d61e5a17a6d7d3fbd80022a7990e07e012015591dd40042ddb3b2760\\n\",\n            \"  Stored in directory: /tmp/pip-ephem-wheel-cache-bobgacrn/wheels/b1/fb/16/e46aa3127ee272b8cac710c8f76aa02445d96aaeed9da956ea\\n\",\n            \"  Building wheel for contexttimer (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Created wheel for contexttimer: filename=contexttimer-0.3.3-py3-none-any.whl size=5803 sha256=21ac5cf4b74bf53b9a96c4f8705106a7107d03a3c3f118b2515e3f7419170e06\\n\",\n            \"  Stored in directory: /root/.cache/pip/wheels/72/1c/da/cfd97201d88ccce214427fa84a5caeb91fef7c5a1b4c4312b4\\n\",\n            \"Successfully built colossalai contexttimer\\n\",\n            \"Installing collected packages: safetensors, ninja, distlib, contexttimer, virtualenv, nodeenv, invoke, identify, cfgv, bcrypt, pynacl, pre-commit, paramiko, fabric, colossalai\\n\",\n            \"Successfully installed bcrypt-4.0.1 cfgv-3.3.1 colossalai-0.2.8 contexttimer-0.3.3 distlib-0.3.6 fabric-3.0.1 identify-2.5.24 invoke-2.1.2 ninja-1.11.1 nodeenv-1.8.0 paramiko-3.1.0 pre-commit-3.3.2 pynacl-1.5.0 safetensors-0.3.1 virtualenv-20.23.0\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"### 2 安装transformers\\n\",\n        \"这里我们安装的是hpcaitech下的transformers，如果直接pip install transformers是否可行并没有测试。\"\n      ],\n      \"metadata\": {\n        \"id\": \"bYe-_gnDHEEa\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"!git clone https://github.com/hpcaitech/transformers\\n\",\n        \"os.chdir('./transformers')\\n\",\n        \"!pip install .\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"PJTkDrHUhFpP\",\n        \"outputId\": \"11383ade-42af-4f76-ab3e-f47f955a863d\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Cloning into 'transformers'...\\n\",\n            \"remote: Enumerating objects: 124468, done.\\u001b[K\\n\",\n            \"remote: Total 124468 (delta 0), reused 0 (delta 0), pack-reused 124468\\u001b[K\\n\",\n            \"Receiving objects: 100% (124468/124468), 127.28 MiB | 27.99 MiB/s, done.\\n\",\n            \"Resolving deltas: 100% (93320/93320), done.\\n\",\n            \"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\\n\",\n            \"Processing /content/ColossalAI/transformers\\n\",\n            \"  Installing build dependencies ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Getting requirements to build wheel ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Preparing metadata (pyproject.toml) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (3.12.0)\\n\",\n            \"Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0.dev0)\\n\",\n            \"  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m224.5/224.5 kB\\u001b[0m \\u001b[31m6.3 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (1.22.4)\\n\",\n            \"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (23.1)\\n\",\n            \"Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (6.0)\\n\",\n            \"Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (2022.10.31)\\n\",\n            \"Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (2.27.1)\\n\",\n            \"Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0.dev0)\\n\",\n            \"  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m7.8/7.8 MB\\u001b[0m \\u001b[31m64.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.28.0.dev0) (4.65.0)\\n\",\n            \"Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.28.0.dev0) (2023.4.0)\\n\",\n            \"Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.28.0.dev0) (4.5.0)\\n\",\n            \"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (1.26.15)\\n\",\n            \"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (2022.12.7)\\n\",\n            \"Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (2.0.12)\\n\",\n            \"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.28.0.dev0) (3.4)\\n\",\n            \"Building wheels for collected packages: transformers\\n\",\n            \"  Building wheel for transformers (pyproject.toml) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Created wheel for transformers: filename=transformers-4.28.0.dev0-py3-none-any.whl size=6790611 sha256=92968d77f2b7dc7aa1557f1ab5447bb5ddbf00b2936e257790df7e00d1659f10\\n\",\n            \"  Stored in directory: /tmp/pip-ephem-wheel-cache-pyix5uo5/wheels/87/f3/6f/b220a07b1eb427c5c698eed3338325ec784fe66427d0989fa6\\n\",\n            \"Successfully built transformers\\n\",\n            \"Installing collected packages: tokenizers, huggingface-hub, transformers\\n\",\n            \"Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.0.dev0\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"### 3 安装Chat需要的库\"\n      ],\n      \"metadata\": {\n        \"id\": \"Y_KsSNVYHM_I\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"os.chdir('../applications/Chat/')\\n\",\n        \"!pip install .\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"F73wlOCahZRz\",\n        \"outputId\": \"e6c92639-0202-47d4-d917-637dc8ba064e\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\\n\",\n            \"Processing /content/ColossalAI/applications/Chat\\n\",\n            \"  Preparing metadata (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"Requirement already satisfied: transformers>=4.20.1 in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (4.28.0.dev0)\\n\",\n            \"Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (4.65.0)\\n\",\n            \"Collecting datasets (from coati==1.0.0)\\n\",\n            \"  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m474.6/474.6 kB\\u001b[0m \\u001b[31m10.9 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting loralib (from coati==1.0.0)\\n\",\n            \"  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)\\n\",\n            \"Requirement already satisfied: colossalai>=0.2.4 in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (0.2.8)\\n\",\n            \"Collecting torch<2.0.0,>=1.12.1 (from coati==1.0.0)\\n\",\n            \"  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m887.5/887.5 MB\\u001b[0m \\u001b[31m1.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting langchain (from coati==1.0.0)\\n\",\n            \"  Downloading langchain-0.0.178-py3-none-any.whl (892 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m892.2/892.2 kB\\u001b[0m \\u001b[31m15.9 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: tokenizers in /usr/local/lib/python3.10/dist-packages (from coati==1.0.0) (0.13.3)\\n\",\n            \"Collecting fastapi (from coati==1.0.0)\\n\",\n            \"  Downloading fastapi-0.95.2-py3-none-any.whl (56 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m57.0/57.0 kB\\u001b[0m \\u001b[31m7.8 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting sse_starlette (from coati==1.0.0)\\n\",\n            \"  Downloading sse_starlette-1.6.1-py3-none-any.whl (9.6 kB)\\n\",\n            \"Collecting wandb (from coati==1.0.0)\\n\",\n            \"  Downloading wandb-0.15.3-py3-none-any.whl (2.0 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m2.0/2.0 MB\\u001b[0m \\u001b[31m17.2 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting sentencepiece (from coati==1.0.0)\\n\",\n            \"  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m1.3/1.3 MB\\u001b[0m \\u001b[31m16.3 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting gpustat (from coati==1.0.0)\\n\",\n            \"  Downloading gpustat-1.1.tar.gz (97 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m97.9/97.9 kB\\u001b[0m \\u001b[31m13.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25h  Installing build dependencies ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Getting requirements to build wheel ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Preparing metadata (pyproject.toml) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (1.22.4)\\n\",\n            \"Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (5.9.5)\\n\",\n            \"Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (23.1)\\n\",\n            \"Requirement already satisfied: pre-commit in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (3.3.2)\\n\",\n            \"Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (13.3.4)\\n\",\n            \"Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (8.1.3)\\n\",\n            \"Requirement already satisfied: fabric in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (3.0.1)\\n\",\n            \"Requirement already satisfied: contexttimer in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.3)\\n\",\n            \"Requirement already satisfied: ninja in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (1.11.1)\\n\",\n            \"Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from colossalai>=0.2.4->coati==1.0.0) (0.3.1)\\n\",\n            \"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch<2.0.0,>=1.12.1->coati==1.0.0) (4.5.0)\\n\",\n            \"Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\\n\",\n            \"  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m849.3/849.3 kB\\u001b[0m \\u001b[31m16.3 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting nvidia-cudnn-cu11==8.5.0.96 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\\n\",\n            \"  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m557.1/557.1 MB\\u001b[0m \\u001b[31m2.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting nvidia-cublas-cu11==11.10.3.66 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\\n\",\n            \"  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m317.1/317.1 MB\\u001b[0m \\u001b[31m4.1 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch<2.0.0,>=1.12.1->coati==1.0.0)\\n\",\n            \"  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m21.0/21.0 MB\\u001b[0m \\u001b[31m26.5 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2.0.0,>=1.12.1->coati==1.0.0) (67.7.2)\\n\",\n            \"Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2.0.0,>=1.12.1->coati==1.0.0) (0.40.0)\\n\",\n            \"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (3.12.0)\\n\",\n            \"Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (0.14.1)\\n\",\n            \"Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (6.0)\\n\",\n            \"Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (2022.10.31)\\n\",\n            \"Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers>=4.20.1->coati==1.0.0) (2.27.1)\\n\",\n            \"Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->coati==1.0.0) (9.0.0)\\n\",\n            \"Collecting dill<0.3.7,>=0.3.0 (from datasets->coati==1.0.0)\\n\",\n            \"  Downloading dill-0.3.6-py3-none-any.whl (110 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m110.5/110.5 kB\\u001b[0m \\u001b[31m13.3 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets->coati==1.0.0) (1.5.3)\\n\",\n            \"Collecting xxhash (from datasets->coati==1.0.0)\\n\",\n            \"  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m212.5/212.5 kB\\u001b[0m \\u001b[31m20.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting multiprocess (from datasets->coati==1.0.0)\\n\",\n            \"  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m134.3/134.3 kB\\u001b[0m \\u001b[31m16.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets->coati==1.0.0) (2023.4.0)\\n\",\n            \"Collecting aiohttp (from datasets->coati==1.0.0)\\n\",\n            \"  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m1.0/1.0 MB\\u001b[0m \\u001b[31m28.5 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting responses<0.19 (from datasets->coati==1.0.0)\\n\",\n            \"  Downloading responses-0.18.0-py3-none-any.whl (38 kB)\\n\",\n            \"Requirement already satisfied: pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2 in /usr/local/lib/python3.10/dist-packages (from fastapi->coati==1.0.0) (1.10.7)\\n\",\n            \"Collecting starlette<0.28.0,>=0.27.0 (from fastapi->coati==1.0.0)\\n\",\n            \"  Downloading starlette-0.27.0-py3-none-any.whl (66 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m67.0/67.0 kB\\u001b[0m \\u001b[31m8.7 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting nvidia-ml-py>=11.450.129 (from gpustat->coati==1.0.0)\\n\",\n            \"  Downloading nvidia_ml_py-11.525.112-py3-none-any.whl (35 kB)\\n\",\n            \"Collecting blessed>=1.17.1 (from gpustat->coati==1.0.0)\\n\",\n            \"  Downloading blessed-1.20.0-py2.py3-none-any.whl (58 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m58.4/58.4 kB\\u001b[0m \\u001b[31m7.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: SQLAlchemy<3,>=1.4 in /usr/local/lib/python3.10/dist-packages (from langchain->coati==1.0.0) (2.0.10)\\n\",\n            \"Collecting async-timeout<5.0.0,>=4.0.0 (from langchain->coati==1.0.0)\\n\",\n            \"  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)\\n\",\n            \"Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain->coati==1.0.0)\\n\",\n            \"  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)\\n\",\n            \"Requirement already satisfied: numexpr<3.0.0,>=2.8.4 in /usr/local/lib/python3.10/dist-packages (from langchain->coati==1.0.0) (2.8.4)\\n\",\n            \"Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain->coati==1.0.0)\\n\",\n            \"  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m90.0/90.0 kB\\u001b[0m \\u001b[31m12.1 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: tenacity<9.0.0,>=8.1.0 in /usr/local/lib/python3.10/dist-packages (from langchain->coati==1.0.0) (8.2.2)\\n\",\n            \"Collecting GitPython!=3.1.29,>=1.0.0 (from wandb->coati==1.0.0)\\n\",\n            \"  Downloading GitPython-3.1.31-py3-none-any.whl (184 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m184.3/184.3 kB\\u001b[0m \\u001b[31m21.7 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting sentry-sdk>=1.0.0 (from wandb->coati==1.0.0)\\n\",\n            \"  Downloading sentry_sdk-1.24.0-py2.py3-none-any.whl (206 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m206.5/206.5 kB\\u001b[0m \\u001b[31m18.6 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting docker-pycreds>=0.4.0 (from wandb->coati==1.0.0)\\n\",\n            \"  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)\\n\",\n            \"Collecting pathtools (from wandb->coati==1.0.0)\\n\",\n            \"  Downloading pathtools-0.1.2.tar.gz (11 kB)\\n\",\n            \"  Preparing metadata (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"Collecting setproctitle (from wandb->coati==1.0.0)\\n\",\n            \"  Downloading setproctitle-1.3.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)\\n\",\n            \"Requirement already satisfied: appdirs>=1.4.3 in /usr/local/lib/python3.10/dist-packages (from wandb->coati==1.0.0) (1.4.4)\\n\",\n            \"Requirement already satisfied: protobuf!=4.21.0,<5,>=3.19.0 in /usr/local/lib/python3.10/dist-packages (from wandb->coati==1.0.0) (3.20.3)\\n\",\n            \"Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->coati==1.0.0) (23.1.0)\\n\",\n            \"Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->coati==1.0.0) (2.0.12)\\n\",\n            \"Collecting multidict<7.0,>=4.5 (from aiohttp->datasets->coati==1.0.0)\\n\",\n            \"  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m114.5/114.5 kB\\u001b[0m \\u001b[31m16.0 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting yarl<2.0,>=1.0 (from aiohttp->datasets->coati==1.0.0)\\n\",\n            \"  Downloading yarl-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (268 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m268.8/268.8 kB\\u001b[0m \\u001b[31m26.7 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting frozenlist>=1.1.1 (from aiohttp->datasets->coati==1.0.0)\\n\",\n            \"  Downloading frozenlist-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m149.6/149.6 kB\\u001b[0m \\u001b[31m19.4 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting aiosignal>=1.1.2 (from aiohttp->datasets->coati==1.0.0)\\n\",\n            \"  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\\n\",\n            \"Requirement already satisfied: wcwidth>=0.1.4 in /usr/local/lib/python3.10/dist-packages (from blessed>=1.17.1->gpustat->coati==1.0.0) (0.2.6)\\n\",\n            \"Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from blessed>=1.17.1->gpustat->coati==1.0.0) (1.16.0)\\n\",\n            \"Collecting marshmallow<4.0.0,>=3.3.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\\n\",\n            \"  Downloading marshmallow-3.19.0-py3-none-any.whl (49 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m49.1/49.1 kB\\u001b[0m \\u001b[31m7.2 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hCollecting marshmallow-enum<2.0.0,>=1.5.1 (from dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\\n\",\n            \"  Downloading marshmallow_enum-1.5.1-py2.py3-none-any.whl (4.2 kB)\\n\",\n            \"Collecting typing-inspect>=0.4.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\\n\",\n            \"  Downloading typing_inspect-0.8.0-py3-none-any.whl (8.7 kB)\\n\",\n            \"Collecting gitdb<5,>=4.0.1 (from GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0)\\n\",\n            \"  Downloading gitdb-4.0.10-py3-none-any.whl (62 kB)\\n\",\n            \"\\u001b[2K     \\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\\u001b[0m \\u001b[32m62.7/62.7 kB\\u001b[0m \\u001b[31m8.8 MB/s\\u001b[0m eta \\u001b[36m0:00:00\\u001b[0m\\n\",\n            \"\\u001b[?25hRequirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.20.1->coati==1.0.0) (1.26.15)\\n\",\n            \"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.20.1->coati==1.0.0) (2022.12.7)\\n\",\n            \"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers>=4.20.1->coati==1.0.0) (3.4)\\n\",\n            \"Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy<3,>=1.4->langchain->coati==1.0.0) (2.0.2)\\n\",\n            \"Requirement already satisfied: anyio<5,>=3.4.0 in /usr/local/lib/python3.10/dist-packages (from starlette<0.28.0,>=0.27.0->fastapi->coati==1.0.0) (3.6.2)\\n\",\n            \"Requirement already satisfied: invoke>=2.0 in /usr/local/lib/python3.10/dist-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (2.1.2)\\n\",\n            \"Requirement already satisfied: paramiko>=2.4 in /usr/local/lib/python3.10/dist-packages (from fabric->colossalai>=0.2.4->coati==1.0.0) (3.1.0)\\n\",\n            \"Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->coati==1.0.0) (2.8.2)\\n\",\n            \"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets->coati==1.0.0) (2022.7.1)\\n\",\n            \"Requirement already satisfied: cfgv>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (3.3.1)\\n\",\n            \"Requirement already satisfied: identify>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (2.5.24)\\n\",\n            \"Requirement already satisfied: nodeenv>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (1.8.0)\\n\",\n            \"Requirement already satisfied: virtualenv>=20.10.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->colossalai>=0.2.4->coati==1.0.0) (20.23.0)\\n\",\n            \"Requirement already satisfied: markdown-it-py<3.0.0,>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.2.0)\\n\",\n            \"Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->colossalai>=0.2.4->coati==1.0.0) (2.14.0)\\n\",\n            \"Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.4.0->starlette<0.28.0,>=0.27.0->fastapi->coati==1.0.0) (1.3.0)\\n\",\n            \"Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->GitPython!=3.1.29,>=1.0.0->wandb->coati==1.0.0)\\n\",\n            \"  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)\\n\",\n            \"Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py<3.0.0,>=2.2.0->rich->colossalai>=0.2.4->coati==1.0.0) (0.1.2)\\n\",\n            \"Requirement already satisfied: bcrypt>=3.2 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (4.0.1)\\n\",\n            \"Requirement already satisfied: cryptography>=3.3 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (40.0.2)\\n\",\n            \"Requirement already satisfied: pynacl>=1.5 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.5.0)\\n\",\n            \"Collecting mypy-extensions>=0.3.0 (from typing-inspect>=0.4.0->dataclasses-json<0.6.0,>=0.5.7->langchain->coati==1.0.0)\\n\",\n            \"  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)\\n\",\n            \"Requirement already satisfied: distlib<1,>=0.3.6 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (0.3.6)\\n\",\n            \"Requirement already satisfied: platformdirs<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->colossalai>=0.2.4->coati==1.0.0) (3.3.0)\\n\",\n            \"Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (1.15.1)\\n\",\n            \"Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4->fabric->colossalai>=0.2.4->coati==1.0.0) (2.21)\\n\",\n            \"Building wheels for collected packages: coati, gpustat, pathtools\\n\",\n            \"  Building wheel for coati (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Created wheel for coati: filename=coati-1.0.0-py3-none-any.whl size=75334 sha256=13d5d09bdbb4b2b47045312ca337e68b88dfd2044087fbaf0c05bd593c1b4a35\\n\",\n            \"  Stored in directory: /tmp/pip-ephem-wheel-cache-tez1zypt/wheels/49/ba/eb/98b39707d3bcca1d3ecf646b531cdb25f480bd44ec5c0edafb\\n\",\n            \"  Building wheel for gpustat (pyproject.toml) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Created wheel for gpustat: filename=gpustat-1.1-py3-none-any.whl size=26280 sha256=a748ad4da7967293e3504c25de991a95a408f0d9a9caf2a91b7d4e09b0ee58fd\\n\",\n            \"  Stored in directory: /root/.cache/pip/wheels/ee/d0/2c/1e02440645c2318ba03aea99993a44a9108dc8f74de0bd370b\\n\",\n            \"  Building wheel for pathtools (setup.py) ... \\u001b[?25l\\u001b[?25hdone\\n\",\n            \"  Created wheel for pathtools: filename=pathtools-0.1.2-py3-none-any.whl size=8791 sha256=48ee994acbcb9e407204168f303fd9c252c0f7f64f4948b689f062e60891e7e3\\n\",\n            \"  Stored in directory: /root/.cache/pip/wheels/e7/f3/22/152153d6eb222ee7a56ff8617d80ee5207207a8c00a7aab794\\n\",\n            \"Successfully built coati gpustat pathtools\\n\",\n            \"Installing collected packages: sentencepiece, pathtools, nvidia-ml-py, xxhash, smmap, setproctitle, sentry-sdk, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cublas-cu11, mypy-extensions, multidict, marshmallow, loralib, frozenlist, docker-pycreds, dill, blessed, async-timeout, yarl, typing-inspect, starlette, responses, openapi-schema-pydantic, nvidia-cudnn-cu11, multiprocess, marshmallow-enum, gpustat, gitdb, aiosignal, torch, sse_starlette, GitPython, fastapi, dataclasses-json, aiohttp, wandb, langchain, datasets, coati\\n\",\n            \"  Attempting uninstall: torch\\n\",\n            \"    Found existing installation: torch 2.0.1+cu118\\n\",\n            \"    Uninstalling torch-2.0.1+cu118:\\n\",\n            \"      Successfully uninstalled torch-2.0.1+cu118\\n\",\n            \"\\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\\n\",\n            \"torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\\n\",\n            \"torchdata 0.6.1 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\\n\",\n            \"torchtext 0.15.2 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\\n\",\n            \"torchvision 0.15.2+cu118 requires torch==2.0.1, but you have torch 1.13.1 which is incompatible.\\u001b[0m\\u001b[31m\\n\",\n            \"\\u001b[0mSuccessfully installed GitPython-3.1.31 aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 blessed-1.20.0 coati-1.0.0 dataclasses-json-0.5.7 datasets-2.12.0 dill-0.3.6 docker-pycreds-0.4.0 fastapi-0.95.2 frozenlist-1.3.3 gitdb-4.0.10 gpustat-1.1 langchain-0.0.178 loralib-0.1.1 marshmallow-3.19.0 marshmallow-enum-1.5.1 multidict-6.0.4 multiprocess-0.70.14 mypy-extensions-1.0.0 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-ml-py-11.525.112 openapi-schema-pydantic-1.2.4 pathtools-0.1.2 responses-0.18.0 sentencepiece-0.1.99 sentry-sdk-1.24.0 setproctitle-1.3.2 smmap-5.0.0 sse_starlette-1.6.1 starlette-0.27.0 torch-1.13.1 typing-inspect-0.8.0 wandb-0.15.3 xxhash-3.2.0 yarl-1.9.2\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"## 预训练模型的下载（以bloom为例）\"\n      ],\n      \"metadata\": {\n        \"id\": \"qbtHtV4pHXQ4\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"os.chdir('/content/ColossalAI/applications/Chat/examples')\"\n      ],\n      \"metadata\": {\n        \"id\": \"WxRdvkAvhlJW\"\n      },\n      \"execution_count\": null,\n      \"outputs\": []\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"如果不安装下面的命令会出现什么情况呢？\\n\",\n        \"- 从huggingface中git clone下来的模型看似下载下来了，但是其实下载下来的并不是实质的模型文件（如果你检查文件的大小，只有几B）\\n\",\n        \"- 一旦下载下来的文件并不是实质的模型，则在运行SFT代码的时候会报错：**safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge**\"\n      ],\n      \"metadata\": {\n        \"id\": \"GdlFgQzmHo2t\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"!sudo apt-get install git-lfs\\n\",\n        \"!git lfs install\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"C4Pu4RmJiEK6\",\n        \"outputId\": \"e49226b3-fd13-4f48-9f41-43f23fa017d8\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Reading package lists... Done\\n\",\n            \"Building dependency tree       \\n\",\n            \"Reading state information... Done\\n\",\n            \"git-lfs is already the newest version (2.9.2-1).\\n\",\n            \"0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.\\n\",\n            \"Updated git hooks.\\n\",\n            \"Git LFS initialized.\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"**下载ColossalAI支持的系列模型**，我们以bloomz-560m为例。在下面，我们将模型放在了当前的目录下（ColossalAI/applications/Chat/examples），但是这里并不是强制的，可以根据自己的喜欢变换位置。\"\n      ],\n      \"metadata\": {\n        \"id\": \"9JeE6R7QH4dJ\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"!git clone https://huggingface.co/bigscience/bloomz-560m\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"3xUK88K7iydi\",\n        \"outputId\": \"90626fc0-dd6c-46d9-d68b-08289ae034f6\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Cloning into 'bloomz-560m'...\\n\",\n            \"remote: Enumerating objects: 1332, done.\\u001b[K\\n\",\n            \"remote: Counting objects: 100% (10/10), done.\\u001b[K\\n\",\n            \"remote: Compressing objects: 100% (7/7), done.\\u001b[K\\n\",\n            \"remote: Total 1332 (delta 3), reused 10 (delta 3), pack-reused 1322\\u001b[K\\n\",\n            \"Receiving objects: 100% (1332/1332), 7.18 MiB | 22.90 MiB/s, done.\\n\",\n            \"Resolving deltas: 100% (616/616), done.\\n\",\n            \"Filtering content: 100% (8/8), 2.11 GiB | 57.58 MiB/s, done.\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"## 运行SFT\"\n      ],\n      \"metadata\": {\n        \"id\": \"wnZ6xjQfIEpL\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"我们这里是直接运行的py文件，如果你按照[文档的说明](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)，运行sh脚本文件（!bash train_sft.sh）也是可以的。其本质上，都是运行这个train_sft.py文件。\\n\",\n        \"\\n\",\n        \"**训练完成的模型存放地址（save_path）**这里写的示例路径是临时路径，一旦notebook停止运行，这个路径就失效了。所以要么在notebook停止前先下载训练好的模型，要么这里可以考虑填写到google drive的目录下。\\n\",\n        \"\\n\",\n        \"在下面的命令中，为了演示\\n\",\n        \"- 我们只是用一个非常非常非常小的数据集（--dataset）去跑程序（小到就只有5条数据）\\n\",\n        \"- model：改为了“bloom”\\n\",\n        \"- pretrain：改成了我们自己下载的模型地址\\n\",\n        \"- save_path: 改成了我们想放的目录地址\\n\",\n        \"- 其他参数可能需要你自己去多多探索：比如lora、gradient checkingpoint等。更多参数的说明见[官方文档](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples)\"\n      ],\n      \"metadata\": {\n        \"id\": \"9w6QPTmIIr8J\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"!torchrun --standalone --nproc_per_node=1 train_sft.py \\\\\\n\",\n        \"    --pretrain \\\"./bloomz-560m\\\" \\\\\\n\",\n        \"    --model 'bloom' \\\\\\n\",\n        \"    --strategy colossalai_zero2 \\\\\\n\",\n        \"    --log_interval 50 \\\\\\n\",\n        \"    --save_path  \\\"./bloomz-560m-finetuned\\\" \\\\\\n\",\n        \"    --dataset \\\"./instinwild_ch_small.json\\\" \\\\\\n\",\n        \"    --batch_size 4 \\\\\\n\",\n        \"    --accumulation_steps 8 \\\\\\n\",\n        \"    --lr 2e-5 \\\\\\n\",\n        \"    --max_datasets_size 512 \\\\\\n\",\n        \"    --max_epochs 1\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"WVu629ZLjf6G\",\n        \"outputId\": \"6fd1ef2f-6483-42a4-b43a-65f9181bac2c\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"2023-05-24 11:46:01.994320: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\\n\",\n            \"\\u001b[2;36m[05/24/23 11:46:05]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/colossalai/\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35mcontext/\\u001b[0m\\u001b[95mparallel_context.py\\u001b[0m:\\u001b[1;36m522\\u001b[0m set_device         \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: process rank \\u001b[1;36m0\\u001b[0m is  \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         bound to device \\u001b[1;36m0\\u001b[0m                                  \\n\",\n            \"\\u001b[2;36m[05/24/23 11:46:13]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/colossalai/\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35mcontext/\\u001b[0m\\u001b[95mparallel_context.py\\u001b[0m:\\u001b[1;36m558\\u001b[0m set_seed           \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: initialized seed on\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         rank \\u001b[1;36m0\\u001b[0m, numpy: \\u001b[1;36m42\\u001b[0m, python random: \\u001b[1;36m42\\u001b[0m,              \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         ParallelMode.DATA: \\u001b[1;36m42\\u001b[0m, ParallelMode.TENSOR: \\u001b[1;36m42\\u001b[0m,the \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         default parallel seed is ParallelMode.DATA.        \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/colossalai/\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[95minitialize.py\\u001b[0m:\\u001b[1;36m115\\u001b[0m launch                           \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Distributed        \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         environment is initialized, data parallel size: \\u001b[1;36m1\\u001b[0m, \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         pipeline parallel size: \\u001b[1;36m1\\u001b[0m, tensor parallel size: \\u001b[1;36m1\\u001b[0m \\n\",\n            \"/usr/local/lib/python3.10/dist-packages/colossalai/kernel/op_builder/utils.py:94: UserWarning: [extension] The CUDA version on the system (11.8) does not match with the version (11.7) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions\\n\",\n            \"  warnings.warn(\\n\",\n            \"/usr/local/lib/python3.10/dist-packages/colossalai/kernel/op_builder/utils.py:94: UserWarning: [extension] The CUDA version on the system (11.8) does not match with the version (11.7) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions\\n\",\n            \"  warnings.warn(\\n\",\n            \"\\u001b[2;36m[05/24/23 11:52:04]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35met/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m121\\u001b[0m __init__                     \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Loading data\\u001b[33m...\\u001b[0m    \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35met/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m123\\u001b[0m __init__                     \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Loaded \\u001b[1;36m6\\u001b[0m examples. \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35met/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m126\\u001b[0m __init__                     \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Limiting dataset to\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[1;36m512\\u001b[0m examples.                                      \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35met/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m129\\u001b[0m __init__                     \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Formatting         \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         inputs\\u001b[33m...\\u001b[0m                                          \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO:                    \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/coati/datas\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35met/\\u001b[0m\\u001b[95msft_dataset.py\\u001b[0m:\\u001b[1;36m137\\u001b[0m __init__                     \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[34mINFO    \\u001b[0m colossalai - colossalai - INFO: Tokenizing         \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         inputs\\u001b[33m...\\u001b[0m This may take some time\\u001b[33m...\\u001b[0m               \\n\",\n            \"steps: 0it [00:00, ?it/s]\\u001b[2;36m[05/24/23 11:52:09]\\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING:                 \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/coati/train\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35mer/\\u001b[0m\\u001b[95msft.py\\u001b[0m:\\u001b[1;36m86\\u001b[0m fit                                   \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING: batch_i\\u001b[1;92md:0\\u001b[0m,     \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         abnormal loss: \\u001b[1;36m3.6484375\\u001b[0m                           \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING:                 \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35m/usr/local/lib/python3.10/dist-packages/coati/train\\u001b[0m\\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         \\u001b[35mer/\\u001b[0m\\u001b[95msft.py\\u001b[0m:\\u001b[1;36m86\\u001b[0m fit                                   \\n\",\n            \"\\u001b[2;36m                   \\u001b[0m\\u001b[2;36m \\u001b[0m\\u001b[31mWARNING \\u001b[0m colossalai - colossalai - WARNING: batch_i\\u001b[1;92md:1\\u001b[0m,     \\n\",\n            \"\\u001b[2;36m                    \\u001b[0m         abnormal loss: \\u001b[1;36m4.109375\\u001b[0m                            \\n\",\n            \"steps: 0it [00:03, ?it/s]\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"## 小结\\n\",\n        \"仅仅跑起来只是一个开始。祝愿每个小伙伴最终都会获得更好的GPU资源、更棒的数据和更出色的属于自己的语言模型！\"\n      ],\n      \"metadata\": {\n        \"id\": \"RQKDNZogLUNs\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [],\n      \"metadata\": {\n        \"id\": \"H7HM6E5VLW2u\"\n      },\n      \"execution_count\": null,\n      \"outputs\": []\n    }\n  ]\n}\n"
  },
  {
    "path": "README.md",
    "content": "# 开源语言模型百宝袋 (Ver. 3.6)\nOpen-Source Language Model Pocket\n\n**注意**：由于此文本内容太多了，直接在Github网页阅览会出现内容不全（导致部分内容搜索不到）的问题。建议下载到本地或直接上传给语言模型助手查阅。\n\n**Github**: https://github.com/createmomo/Open-Source-Language-Model-Pocket\n\n## 开源模型一览 (Table of Contents)\n\n*中文友好或国内主创的开源模型（Chinese Open Source Language Models）*\n\n|多个领域/通用|||\n|---|---|---|\n|百川|中文Alpaca Luotuo|中文LLaMA&Alpaca大模型|\n|中文LLaMA&Alpaca大模型2|流萤Firefly|凤凰|\n|复旦MOSS|复旦MOSS-RLHF|悟道·天鹰Aquila&Aquila2|\n|雅意大模型| 通义千问Qwen| 活字3.0|\n| Anima |BayLing|BELLE|\n|Bloom|BiLLa |BLOOMChat176B|\n|Chinese-Llama-2-7b (LinkSoul-AI) |GPT2 for Multiple Language |InternLM 书生・浦语|\n|Llama2-chat-Chinese-50W|Llama2-Chinese (FlagAlpha) |Linly伶荔说 中文 LLaMA1-2 & OpenLLaMA & Falcon 大模型 |\n|ChatRWKV|ChatYuan|ChatGLM-6B|\n|ChatGLM2-6B|Chinese-Transformer-XL|OpenKG-KnowLLM |\n|PromptCLUE|SkyText-Chinese-GPT3|CPM-Bee|\n|TigerBot|XVERSE-13B|YuLan-Chat & YuLan-Chat-2|\n|Ziya-LLaMA |TechGPT|EVA|\n|FLM-101B|TinyLlama|Colossal-LLaMA-2|\n|OpenBA (Encoder-Decoder)|Ziya-Reader-13B|Firefly-LLaMA2-Chinese|\n|MindLLM|ChatGLM3|Skywork大模型|\n|Yi-6B/34B（零一万物）|Nanbeige-16B（南北阁-16B）|OrionStar-Yi-34B-Chat|\n|源2.0|TechGPT2.0|SUS-Chat-34B|\n|Alaya 元识|OpenBuddy|MiniGPT4Qwen|\n|ChatLM-Chinese-0.2B|YAYI 2|DeepSeek LLM&MoE|\n|MachineMindset(MBTI)|星辰语义（电信）|Chinese-Mixtral-8x7B|\n|Baby-Llama2-Chinese|XVERSE-13B-256K|Eagle 7B（RWKV-v5）|\n|iFlytekSpark-13B|MiniCPM|通义千问Qwen1.5|\n|RethinkTinyLM|Chinese-Mixtral|RWKV_Pytorch|\n|Qwen1.5-MoE-A2.7B|Symbol-LLM|Qwen1.5-32b|\n|build_MiniLLM_from_scratch|RWKV-6 World|Mengzi3|\n|Eurus|Chinese Tiny LLM|HammerLLM|\n|360智脑|Steel-LLM|XVERSE-MoE-A4.2B|\n|llama3-Chinese-chat|Llama3-Chinese-Chat（ORPO）|DeepSeek-V2|\n|PanGu-π|Eurux-8x22B|Chinese-LLaMA-Alpaca-3|\n|OpenBuddy-Llama3-70B-v21.1-8k|MAP-NEO|llms-from-scratch-cn|\n|Yi-1.5|Yuan2.0-M32|Skywork-MoE|\n|Index-1.9B|Qwen2|Gemma-2-9B-Chinese-Chat|\n|Gemma-2-27B-Chinese-Chat|RWKV-6-World 14B|Tele-FLM-1T|\n|Llama3.1-Chinese-Chat|INF-34B|InternLM2.5|\n|*【LongWriter】|*【Hunyuan-Large】|*【Qwen2.5】|\n|*【TeleChat2】|*【Marco-o1】|*【Skywork-o1】|\n|*【YuLan-Mini】|*【DeepSeek-R1】|*【simpleRL-reason】|\n|*【TinyZero】|*【STILL-3-1.5B-Preview】|*【MiniMax-01】|\n|*【SmallThinker-3B-preview】|*【DeepSeek-V3】|*【RWKV-7】|\n|*【FOX-1】|*【mini_qwen】|*【Qwen 0.5b on GRPO】|\n|*【Qwen2.5-Max】|*【minimind】|*【Nano】|\n\n| 医疗健康 |  |  |\n|---|---|---|\n| 本草 |华佗  |扁鹊  |\n| 灵心 | 启真 | 儿童情感陪伴大模型“巧板” |\n| OpenMEDLab 浦医|明医 (MING)：中文医疗问诊大模型 (原名：MedicalGPT-zh) |情感大模型PICA|\n|Chinese-Vicuna-medical|MedicalGPT| DISC-MedLLM （复旦）|\n|DoctorGLM|ChatMed-TCM&ChatMed-Consult|ChatGLM-Med|\n|MeChat|ShenNong-TCM-LLM|MindChat(漫谈): 心理大模型|\n|WiNGPT|CareGPT|孙思邈|\n|MolGen（药物研发）|Taiyi（太一）|MedAgents|\n|Molecule Optimization|MolTC|Mol-Instructions|\n|Multilingual Medicine|Sequel|Gene editing|\n|Llama-3-8B-UltraMedical|PH-LLM|ProLLM|\n|MolecularGPT|*【CHIEF（Clinical Histopathology Imaging Evaluation Foundation）】|*【HuatuoGPT-o1】|\n|*【Baichuan-14B-M1】|*【MedFound】||\n\n|经济/金融|||\n|---|---|---|\n|貔貅FinMA & PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance|轩辕|BBT-FinCUGE-Applications|\n|Cornucopia-LLaMA-Fin-Chinese|EcomGPT|FinGLM|\n|DISC-FinLLM|Deepmoney||\n\n|法律|||\n|---|---|---|\n| 韩非 HanFei| 智海 录问|ChatLaw 法律大模型|\n|LaWGPT|Lawyer LLaMA|LexiLaw|\n|LawGPT_zh|夫子•明察司法大模型|DISC-LawLLM|\n|LawBench|*【HK-O1aw】||\n\n|交通|城市|\n|---|---|\n|TransGPT · 致远|UrbanGPT|\n\n|教育&数学||\n|---|---|\n|桃李|EduChat|\n|chatglm-maths|Abel|\n|InternLM-Math|DeepSeekMath|\n|LeerooDedicated-Math-7b|SimpleGeometry|\n|Rho-1|ChatGLM-Math|\n|JiuZhang3.0|InternLM2-WQX|\n|Math-Minos|NuminaMath 7B TIR|\n|MathΣtral|LLaMAX（翻译）|\n|Qwen2-Math|*【AIMO-CMU_MATH】|\n|*【Qwen2.5-Math】|*【SocraticLM】|\n|*【Open Thoughts】|*【simpleRL-reason】|\n|*【DRT-o1（翻译）】||\n\n|表格/数据分析||\n|---|---|\n|TableGPT|Data-Copilot|\n|Tabular LLM|Chain-of-table|\n|Data Interpreter|TableLLM|\n|Lag-Llama|TabuLa-8B|\n|*【Time-MoE】||\n\n|自媒体/角色扮演/风格/故事|\n|---|\n|MediaGPT|\n|CharacterGLM-6B|\n|Haruhi-Zero|\n|Translational-Style-ChatLLM西式翻译腔|\n|StyleLLM|\n|Tianji来事儿AI|\n|TinyStories|\n|Higgs-Llama-3-70B|\n|persona-hub|\n|Peach-9B-8k-Roleplay|\n|*【Hermes 3】|\n|*【SkyReels（短剧）】|\n\n|古汉语|\n|---|\n|尔雅 Erya|\n|荀子|\n\n|编程/代码/系统/设备||\n|---|---|\n|CodeShell|CODEFUSION-75M|\n|DeepSeek Coder|DevOps-Model（运维）|\n|Magicoder|LLaMA-Pro|\n|HuixiangDou|CodeAct|\n|Design2Code|bGPT|\n|MobileLLM|Stable Code Instruct 3B|\n|ReALM|aiXcoder|\n|CodeQwen1.5|AutoCodeRover|\n|CodeGemma|Snowflake Arctic|\n|dolphin-2.9-llama3-70b|Granite|\n|StarCoder2-15B-Instruct-v0.1|AutoCoder|\n|CodeGeeX4|xLAM|\n|*【deepin V23】|*【WaveCoder】|\n|*【Llama-3.1-Storm-8B】|*【OpenCoder】|\n|*【Qwen2.5-Coder】|*【Ministraux】|\n|*【Reader-LM】|*【珠算】|\n|*【Lingma SWE-GPT】|*【GLM-Edge】|\n|*【SEMIKONG（半导体）】|*【ReaderLM-v2】|\n|*【O1-CODER】||\n\n|天文/海洋/地球科学/科学|\n|---|\n|星语StarWhisper|\n|OceanGPT|\n|K2&GeoGalactica|\n|SciGLM|\n|*【KAN 2.0】|\n\n*Recommendation/IR/Information Extraction*\n|||\n|---|---|\n|LLM for Recommendation Systems|Transformer Index for GEnerative Recommenders (TIGER)|\n|EasyRL4Rec|RLMRec|\n|RecAI|Actions Speak Louder than Words|\n|PPM|LLaRA|\n|Awesome Information Retrieval in the Age of Large Language Model|LLMs heart MIR|\n|When to Retrieve|Lite-LLM4Rec|\n|A Comprehensive Survey on Self-Supervised Learning for Recommendation|NoteLLM|\n|LEARN|YAYI-UIE|\n|XRec|Wukong|\n|Leveraging LLM Reasoning Enhances Personalized Recommender Systems|*【Transformers in music recommendation】|\n\n*文本向量/RAG*\n|  |  |\n|---|---|\n| Matryoshka Representation Learning |Jina Embeddings|\n|BGE-M3|Nomic Embed|\n|Moka Massive Mixed Embedding（M3E）|GRIT|\n|TinyRAG|RAFT|\n|Chat with MLX|LLocalSearch|\n|RAGFlow|Dot|\n|Ollama Embedding Models|LLM2Vec|\n|gecko|Cognita|\n|Piccolo2|NV-Embed|\n|RankRAG|LightRAG|\n|GraphRAG|*【gte-multilinguial】|\n|*【nano-graphrag】|*【MaxKB】|\n|*【Langchain-Chatchat】|*【RAGLite】|\n|*【OpenScholar】|*【MasteringRAG】|\n|*【FlashRAG-Paddle】|*【MiniRAG】|\n|*【XRAG】|*【Chronos】|\n|*【DeepRAG】|*【UltraRAG】|\n|*【CAG】|*【FlexRAG】|\n\n*Agent*\n|  |  |\n|---|---|\n| Auto-GPT | ToolBench&ToolLLM |\n|HuggingGPT |CAMEL:Communicative Agents for “Mind” Exploration of Large Scale Language Model Society|\n|AgentLM (AgentTuning, AgentInstruct) |XAgent|\n|OpenAgents|Personal LLM Agents - Survey|\n|AUTOACT|MetaGPT|\n|Multi-LLM-Agent|KwaiAgents|\n|Mistral-Interact|AgentLite|\n|KnowAgent|LlamaGym|\n|WorkArena|STE（Simulated Trial and Error）|\n|More Agents Is All You Need|AIOS|\n|TwoStep|Agent-FLAN|\n|Jan|APAM|\n|AgentStudio|AnyTool|\n|TinyAgent|Octopus v2|\n|ReadAgent|STORM|\n|AgentRun|OS-Copilot|\n|AutoWebGLM|Agent Hospital|\n|CodeR|Mobile-Agent-v2|\n|Husky|TinyAgent|\n|Tree Search for Language Model Agents|octo-planner|\n|MindSearch|*【AgentInstruct】|\n|*【AgentCourt】|*【AI-Scientist】|\n|*【RD-Agent】|*【AFlow: Automating Agentic Workflow Generation】|\n|*【swarm】|*【FinVision】|\n|*【Agent Mental Clinic (AMC)】|*【MedAI】|\n|*【Agent-0】|*【Large Language Model-Brained GUI Agents: A Survey】|\n|*【Building effective agents】|*【UI-TARS】|\n|*【PaSa】|*【Docling】|\n|*【Eko】|*【Search-o1】|\n|*【CogAgent】|*【Proactive Agent】|\n|*【Open-source DeepResearch】|*【RAGEN】|\n|*【smolagents】|*【Open Deep Research】|\n\n*可参考的其它开源模型（国外为主）*\n|  |  |\n|---|---|\n| Cerebras | MPT-7B |\n| ChatDoctor | OpenGPT |\n| Code Llama (Meta AI)| Orca |\n| Dolly 1&2 | OpenChatKit |\n| FinGPT | Open-Assistant |\n| Falcon | Platypus|\n| Facebook/Meta LLaMA/LLaMA2 | MedLLaMA-13B & PMC-LLaMA: Continue Training LLaMA on Medical Papers |\n| Giraffe| RedPajama |\n| GALACTICA | SQLCoder (Defog)|\n| Goar-7B for Arithmetic Tasks | StableLM |\n| HuggingChat | StableVicuna |\n| Koala: A Dialogue Model for Academic Research | Stanford Alpaca |\n| LongLLaMA | UltraLM-13B |\n| LLaMA复刻版OpenLLaMA | Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality |\n| Llama-X: Open Academic Research on Improving LLaMA to SOTA LLM | Wombat |\n| Lit-LLaMA ️ | WizardMath|\n| MammoTH | XGen-7B |\n|Mistral 7B|Xwin-LM|\n|LLaMA 2 Long|UltraLM-13B (UltraFeedback)|\n|Llemma: An Open Language Model For Mathematics|Mistral-Trismegistus-7B （神秘学/玄学/灵性）|\n|Memory-GPT(MemGPT)|MetaMath|\n|ChipNeMo (芯片设计)|Zephyr|\n|neural-chat-7b-v3-1（Intel）|SteerLM|\n|Llama Coder|Meditron|\n|RankZephyr|StableLM Zephyr 3B|\n|Orca 2|Mixtral 7b 8 Expert|\n|Phi|LLM360（Amber,CrystalCoder,Diamond）|\n|Mamba|SOLAR|\n|NexusRaven（function calling LLM）|LLaMA-MoE|\n|TinyLlama|Nous-Hermes-2 Mixtral 8x7B|\n|AlphaGeometry|MoE-Mamba|\n|StarCoder|OLMo|\n|H2O-Danube-1.8B|OpenMathInstruct-1|\n|Smaug-72B|Gemma|\n|Aya Model|MobiLlama|\n|StarCoder2|SmallLanguageModel-project|\n|Command-R|Grok|\n|DBRX|Jamba|\n|BioMedLM|JetMoE|\n|MicroLlama-300M|Mistral 7B v0.2 JAX|\n|gemma-1.1-7b-it|h2o-danube2-1.8b-chat|\n|WizardLM-2|RecurrentGemma|\n|CodecLM|MEGALODON|\n|Stable LM 2 12B|Mixtral 8x22B|\n|Phi-3|Llama 3|\n|OpenELM|base-7b-v0.2|\n|FILM-7B|llama3 implemented from scratch|\n|2.3MParams-LLM-From-Scratch-Python|KAN-GPT|\n|Aya-23|Mamba-2|\n|Recurrentgemma|Nemotron-4 340B|\n|Gemma-2|Gemini Nano|\n|TTT|Arcee-Spark|\n|Mistral NeMo|Llama 3.1 405B|\n|Mistral Large 2|SmolLM|\n|DCLM-7B|Minitron|\n|Gemma 2 2B/ShieldGemma/Gemma Scope|SmolLM|\n|nano-llama31|*【instant-smollm】|\n|*【Jamba 1.5】|*【Phi-3.5】|\n|*【1.5-Pints】|*【Llama-3.1-Minitron 4B】|\n|*【SmolLm2】|*【Ministral 3B/8B】|\n|*【Zamba2-7B】|*【IBM Granite 3.0】|\n|*【Tülu3】|*【Open-O1】|\n|*【open-r1】|*【sky-t1】|\n|*【Phi-4】|*【Dolphin 3.0】|\n|*【Falcon 3】|*【Bamba】|\n|*【Byte Latent Transformer】|*【Llama-3.3-70B-Instruct】|\n|*【Granite 3.1】|*【mini-deepseek-r1】|\n|*【RL, Reasoning & Writing: GRPO on Base model】|*【encoder-decoder-slm】|\n\n*训练/推理*\n|  |  |\n|---|---|\n| Alpaca-LoRA | llama2.mojo |\n| AlpacaFarm | LightLLM |\n| ColossalAI | Medusa |\n| ChatLLaMA | Megatron-LLaMA |\n| Chinese-Guanaco | MeZO: Fine-Tuning Language Models with Just Forward Passes |\n| DPO (Direct Preference Optimization) | MLC LLM |\n| DialogADV：Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality | PKU-Beaver 河狸 (Safe RLHF) |\n| DeepSpeed-Chat | PaLM + RLHF (Pytorch) |\n| FlexGen | RL4LMs |\n| FlagAI and FlagData | Reinforcement Learning with Language Model |\n| Guanaco & QloRA | SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression |\n| GPT4All | Scikit-LLM: Sklearn Meets Large Language Models |\n| HugNLP | Transformer Reinforcement Learning |\n| INSTRUCTEVAL | Train_Transformers_with_INT4 |\n| LOw-Memory Optimization (LOMO) | Transformer Reinforcement Learning X |\n| llama.cpp | vLLM |\n| llama2.c | LongLoRA |\n|RLLTE: Long-Term Evolution Project of Reinforcement Learning|FlashAttention|\n|ExecuTorch|TensorRT-LLM|\n|BPO（Black-Box Prompt Optimization）|S-LoRA|\n|SoRA|XuanCe(玄策): 开源的深度强化学习(DRL)库|\n|EasyLM（JAX/Flax）|FATE-LLM - Federated Learning for LLMs|\n|DeepSpeed-FastGen|NVIDIA NeMo-Aligner|\n|RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback|MLX|\n|OpenRLHF|CoLLiE: Collaborative Training of Large Language Models in an Efficient Way|\n|Superalignment|LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models|\n|Large Language Model Unlearning|PowerInfer|\n|m-LoRA|LASER|\n|StripedHyena-7B|SwiftInfer|\n|SPIN（Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models）|Self-Rewarding Language Models|\n|OPO（On-the-fly Preference Optimization）|ASPIRE|\n|The Impact of Reasoning Step Length on Large Language Models|SliceGPT|\n|FuseLLM|Tree of Thoughts|\n|CogGPT|KTO（Kahneman-Tversky Optimisation）|\n|Aligner|RPO（Robust Prompt Optimization）|\n|Inference-Time Training Helps Long Text Generation|LiPO|\n|ChatLLM.cpp|Self-Discover|\n|DoRA|GPO（Generalized Preference Optimization）|\n|CoT-decoding|FSDP&QLoRA（Answer）|\n|MindNLP|GaLore|\n|Mixture-of-LoRAs|LLaMA Factory|\n|InfLLM|MediaPipe|\n|OneBit|RWKV_Pytorch|\n|HQQ|Uni-RLHF|\n|LLMLingua-2|REST|\n|MetaAligner|DiJiang|\n|LISA（Layerwise Importance Sampled AdamW）|edge-infer|\n|NeFT|Aligning Large Language Models with Recommendation Knowledge|\n|llamafile|summarize_from_feedback_details|\n|EvoLLM|llm.c|\n|Mergoo|qwen-vllm|\n|SiLLM|How to Train Data-Efficient LLMs|\n|sDPO|PiSSA|\n|LongRoPE|ORPO|\n|How to Train Data-Efficient LLMs|Better & Faster Large Language Models via Multi-token Prediction|\n|Llama-3 70B Gradient Adapter|Unsloth|\n|RLHF Workflow|SimPO|\n|ODPO|ΨPO|\n|MoRA|LOFIT|\n|MEFT|PowerInfer-2|\n|Emulated Disalignment|Aligning Large Language Models with Representation Editing: A Control Perspective|\n|Q\\*|TDPO|\n|ExCP|MindStar|\n|LaMDA|MInference|\n|Instruction Pre-Training|PEER|\n|Step-DPO|Data, Data Everywhere|\n|Prover-Verifier Games|Mem0|\n|EAGLE-2|LoRA-GA|\n|Q-GaLore|*【rStar】|\n|*【T-MAC】|*【LLM-zero2hero】|\n|*【MobileQuant】|*【min-p sampling】|\n|*【Fast Best-of-N Decoding】|*【UNA: Unifying Alignments of RLHF/PPO, DPO and KTO】|\n|*【LongReward】|*【HybridFlow】|\n|*【The Surprising Effectiveness of Test-Time Training for Abstract Reasoning】|*【OpenR】|\n|*【A Theoretical Understanding of Self-Correction through In-context Alignment】|*【EfficientQAT】|\n|*【Cautious Optimizers】|*【Optimizing Large Language Model Training Using FP4 Quantization】|\n|*【Evolving Deeper LLM Thinking】|*【rStar-Math】|\n|*【Transformer²: Self-Adaptive LLMs】|*【test-time compute scaling】|\n|*【XGrammar】|*【Reverse Thinking Makes LLMs Stronger Reasoners】|\n|*【noise_step】||\n\n*评价*\n|  ||\n|---|---|\n| 天秤（FlagEval） |獬豸（Xiezhi）Benchmark |\n| C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models | HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models|\n| KoLA: Carefully Benchmarking World Knowledge of Large Language Models |LucyEval—中文大语言模型成熟度评测|\n|CMB: A Comprehensive Medical Benchmark in Chinese|Multiscale Positive-Unlabeled Detection of AI-Generated Texts |\n| PandaLM |Auto-J|\n|CLEVA: Chinese Language Models EVAluation Platform|ALCUNA: Large Language Models Meet New Knowledge|\n|HalluQA：Evaluating Hallucinations in Chinese Large Language Models|GLoRE: Evaluating Logical Reasoning of Large Language Models|\n|HelpSteer|AlignBench: 多维度中文对齐评测基准|\n|UHGEval|Purple Llama (Meta)|\n|OMGEval|SciGuard&SciMT-Safety|\n|HaluEval 2.0, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models|DebugBench: Evaluating Debugging Capability of Large Language Models|GenMedicalEval|\n|R-Judge|TravelPlanner|\n|EasyJailbreak|AgentBench|\n|中文MT-Bench|E-EVAL|\n|ConflictingQA|Medical Information Retrieval-Augmented Generation Evaluation （MIRAGE）|\n|∞Bench|Red Teaming Resistance Benchmark|\n|Fin-Eva|Cappy|\n|BAMBOO|Fast-DetectGPT|\n|GAMA-Bench|FineMath|\n|ToolEmu|ClongEval|\n|Counting-Stars|InfiCoder-Eval|\n|MathVerse|CoderUJB|\n|LooGLE|McEval|\n|CRAG|BigCodeBench|\n|Prometheus 2|Open LLM Leaderboard|\n|CriticGPT|Test  f Time|\n|WebCanvas|Lynx|\n|ComplexBench|Mr-Ben|\n|*【SimpleQA】|*【AppBench】|\n|*【CompassJudger/JudgerBench】|*【CMCOQA】|\n|*【CodevBench】|*【FrontierMath】|\n|*【GIFT-Eval】|*【LightEval】|\n|*【RMB-Reward-Model-Benchmark】|*【Chinese SimpleQA】|\n|*【Evalchemy】|*【WebWalker】|\n|*【Getting a Judge-LLM】|*【PRMBench】|\n|*【OmniDocBench】|*【CodeArena】|\n|*【HALoGEN】||\n\n*其它*\n|  |  |\n|---|---|\n| Alpaca-CoT | Self-Instruct |\n| ChatPiXiu | Wanda (Pruning by Weights and activations) |\n| Gorilla | Streaming LLM |\n| Sheared LLAMA (Structured Pruning) |gpu_poor|\n| LLMPruner：大语言模型裁剪工具 | QA-LoRA |\n| LLM-Pruner: On the Structural Pruning of Large Language Models |KnowPAT|\n|AuthentiGPT: Detecting Machine-Generated Text|Curiosity-driven Red-teaming for Large Language Models|Language Models are Super Mario（DARE, Drop And REscale）|\n|TinyGSM|MathPile|\n|Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM|Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding|\n|QAnything|Meta-Prompting|\n|Lepton Search|Transformer Debugger|\n|Open-Source AI Cookbook|MaLA-500|\n|NVIDIA Chat with RTX|RAG vs Fine-tuning|\n|Chain of Abstraction|序列猴子开源数据集|\n|synthetic-data-save-costs|Data is Better Together|\n|Large Language Models in Finance|WanJuan-CC|\n|Larimar|Financial Datasets|\n|LLM-UM-Reading|so-large-lm|\n|Fine-tune Llama 3 with ORPO|COIG-CQIA|\n|tiny-universe|llmc|\n|LLMBox|MarkLLM|\n|MobileCPM|LLM-Select|\n|Transformer Architecture (LLMs: Zero-to-Hero)|Build a Large Language Model (From Scratch)|\n|*【SynthID Text】|*【Small Language Models: Survey, Measurements, and Insights】|\n|*【Multi-IF (Multi-turn and multilingual instruction following)】|*【LLM from scratch with Pytorch】|\n|*【A Survey on Data Synthesis and Augmentation for Large Language Models】|*【A Survey of Small Language Models】|\n|*【LLMForEverybody】|*【Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner】|\n|*【CCI3.0-HQ】|*【rlhfbook】|\n|*【Deepseek R1可能找到了超越人类的办法】|*【train-llm-from-scratch】|\n|*【The Big Book of LLMs】|*【Primers • DeepSeek-R1】|\n|*【A vision researcher’s guide to some RL stuff: PPO & GRPO】|*【group relative policy optimization (GRPO)】|\n|*【DeepSeek R1 and R1-Zero Explained】|*【DeepSeek R1 阅读清单】|\n|*【DeepSeek R1 Explained to your grandma】|*【Deepseek R1 for Everyone】|\n|*【llm-course】|*【O1-Journey】|\n|*【a reinforcement learning guide】|*【llm-universe】|\n|*【smol-course】|*【self-llm】|\n|*【Agents（Chip Huyen）】|*【Building effective agents】|\n|*【LLMInterviewQuestions】|*【Transformers Laid Out】|\n\n## 相关文章\n- 穷穷穷孩子如何体验ColossalAI SFT（[Kaggle篇](https://mp.weixin.qq.com/s/Q29uSNxvPMy0rC-QxHiGZA)，[Colab篇](https://mp.weixin.qq.com/s/NS4yySeYd7QUYb7CB9V0lA)）\n- [通俗理解文本生成的常用解码策略](https://mp.weixin.qq.com/s/sVZuEkYXQ9ZZYXJCQz7F4A)\n- [通俗理解P-tuning (GPT Understands)](https://mp.weixin.qq.com/s/EvD9OW115XMnrxOcC2BKDA)\n- [通俗理解Gradient Checkpoint（附代码）](https://mp.weixin.qq.com/s/IwcfUP_j6JYFXH_xhnWWJQ)\n- 千“垂”百炼：垂直领域与语言模型\n  - [导语](https://mp.weixin.qq.com/s/G24skuUbyrSatxWczVxEAg)\n  - 垂直领域应用\n    - 【不限领域】利用未标注文本改进遵循指令的语言模型（[1](https://mp.weixin.qq.com/s/50wtP--W_cy-682g8cOYww) [2](https://mp.weixin.qq.com/s/q7nKnwtEKPahABiLFLWuSw) [3](https://mp.weixin.qq.com/s/CE8YNx19dc0EyNfTK_HYHQ) [4](https://mp.weixin.qq.com/s/yj4gnoymNLFuLE1v94VJ9A) [5](https://mp.weixin.qq.com/s/N4mUe7hrvXGFArl20kKRCA)）\n    - 【医疗健康】ChatDoctor （解读 [上](https://mp.weixin.qq.com/s/zSeRKUZ2te1wxwpvByhcvg) [中](https://mp.weixin.qq.com/s/TcwiQoIex7SDY5Teri9xnw) [下](https://mp.weixin.qq.com/s/I1hXRS7gBMLUyOWMObfpBg) / PDF版PPT [上](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20ChatDoctor%EF%BC%88%E4%B8%8A%EF%BC%89.pdf) [中](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20ChatDoctor%EF%BC%88%E4%B8%AD%EF%BC%89.pdf) [下](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20ChatDoctor%EF%BC%88%E4%B8%8B%EF%BC%89.pdf)）\n    - 【医疗健康】MedicalGPT-zh ([解读](https://mp.weixin.qq.com/s/QJKZYKh16fqLTC367WhzdA) / [PDF版PPT](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20MedicalGPT-zh.pdf))\n    - 【医疗健康】明医(MING) ([解读](https://mp.weixin.qq.com/s/uM4FZeDhAc6JuMlW7NCvUA) / [PDF版PPT](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20MING.pdf))\n    - 【医疗健康】灵心(SoulChat) ([解读](https://mp.weixin.qq.com/s/0HOYSr-zQsGLFL_H9UZ2HA) / [PDF版PPT](https://github.com/createmomo/Open-Source-Language-Model-Pocket/blob/main/%E5%8D%83%E2%80%9C%E5%9E%82%E2%80%9D%E7%99%BE%E7%82%BC%20-%20%E3%80%90%E5%8C%BB%E7%96%97%26%E5%81%A5%E5%BA%B7%E3%80%91%20SoulChat.pdf))\n    - 【手机交互】ReALM ([1](https://mp.weixin.qq.com/s/gOmUi4_MGvU1Nx3KxXdxVQ) [2](https://mp.weixin.qq.com/s/wTPMwtRVWIrioile-rFzQA) [3](https://mp.weixin.qq.com/s/NgyZG0439UGFoVE7InrX9g) [4](https://mp.weixin.qq.com/s/v1NEovURZr4v8R4_v7TjdA))\n  - 自动评估模型\n    - 【不限领域】[用语言模型评估语言模型（1）导语](https://mp.weixin.qq.com/s/SUN_ywkI8ld1edXY7uq_1Q)\n    - 【不限领域】[用语言模型评估语言模型（2）PandaLM](https://mp.weixin.qq.com/s/NTFu53MdVD9NusFJaORHcw)\n    - 【不限领域】用语言模型评估语言模型（3）Shepherd（[1](https://mp.weixin.qq.com/s/pbK1Zsv9j_DVtOJaTm_tPw) [2](https://mp.weixin.qq.com/s/n4_kVw8j42ZQv6VjQ_P-Dw) [3](https://mp.weixin.qq.com/s/PeGJOmQPyAhwl7czJgKnQQ) [4](https://mp.weixin.qq.com/s/7_NX7S2AHabX-xU254sq5g)）\n    - 【医疗/健康】[使用BERT-Score比较ChatDoctor与ChatGPT3.5](https://mp.weixin.qq.com/s/I1hXRS7gBMLUyOWMObfpBg)\n\n## 所有文章 (ALL Articles)\n- 中文：[https://mp.weixin.qq.com/s/hAqDqqwIHrCVwz4PYSd72A](https://mp.weixin.qq.com/s/hAqDqqwIHrCVwz4PYSd72A)\n- English: [https://createmomo.github.io/](https://createmomo.github.io/)\n\n---\n\n## Chinese Open Source Language Models\n\n### 本草\n- https://zhuanlan.zhihu.com/p/626536996\n- https://github.com/scir-hi/huatuo-llama-med-chinese\n\n基于中文医学知识的LLaMa指令微调模型\n\n在生物医学领域，LLM模型（如LLaMa，ChatGLM）因为缺乏一定的医学专业知识语料而表现不佳。该项目通过医学知识图谱和GPT3.5API构建了中文医学指令数据集，并对LLaMa模型进行了指令微调得到了一个针对医学领域的智能问诊模型HuaTuo，相比于未经过医学数据指令微调的原LLaMa而言，HuaTuo模型在智能问诊层面表现出色，可生成一些更为可靠的医学知识回答；与此同时，基于相同医学数据，该项目还训练了医疗版本的ChatGLM模型: ChatGLM-6B-Med，\n\n该团队还即将发布扁鹊模型PienChueh(同为基于医学数据训练的大模型)，欢迎大家届时使用体验。\n\n### 百川 Baichuan-7B\n- https://github.com/baichuan-inc/baichuan-7B\n- https://huggingface.co/baichuan-inc/baichuan-7B\n\nbaichuan-7B 是由百川智能开发的一个开源可商用的大规模预训练语言模型。基于 Transformer 结构，在大约1.2万亿 tokens 上训练的70亿参数模型，支持中英双语，上下文窗口长度为4096。在标准的中文和英文权威 benchmark（C-EVAL/MMLU）上均取得同尺寸最好的效果。\n\n原始数据包括开源的中英文数据和自行抓取的中文互联网数据，以及部分高质量知识性数据。\n\n参考相关数据工作，频率和质量是数据处理环节重点考虑的两个维度。 我们基于启发式规则和质量模型打分，对原始数据集进行篇章和句子粒度的过滤。在全量数据上，利用局部敏感哈希方法，对篇章和句子粒度做滤重。\n\n### 华佗\n- https://mp.weixin.qq.com/s/lwJb8N420xfMTvXJPM2gtg\n- https://arxiv.org/pdf/2305.15075.pdf\n- https://github.com/FreedomIntelligence/HuatuoGPT\n- https://www.huatuogpt.cn/ \n\n该论文提出的语言模型训练方法可以结合医生和 ChatGPT 的数据，充分发挥它们的互补作用，既保留真实医疗数据的专业性和准确性，又借助 ChatGPT 的多样性和内容丰富性的特点。\n\n### 扁鹊\n- https://github.com/scutcyr/BianQue\n\n基于主动健康的主动性、预防性、精确性、个性化、共建共享、自律性六大特征，华南理工大学未来技术学院-广东省数字孪生人重点实验室开源了中文领域生活空间主动健康大模型基座ProactiveHealthGPT，包括：\n- 经过千万规模中文健康对话数据指令微调的生活空间健康大模型扁鹊（BianQue）\n- 经过百万规模心理咨询领域中文长文本指令与多轮共情对话数据联合指令微调的心理健康大模型灵心（SoulChat）\n\n我们期望，生活空间主动健康大模型基座ProactiveHealthGPT 可以帮助学术界加速大模型在慢性病、心理咨询等主动健康领域的研究与应用。本项目为 生活空间健康大模型扁鹊（BianQue） 。\n\n### 灵心（SoulChat）\n- https://github.com/scutcyr/SoulChat\n\n我们调研了当前常见的心理咨询平台，发现，用户寻求在线心理帮助时，通常需要进行较长篇幅地进行自我描述，然后提供帮助的心理咨询师同样地提供长篇幅的回复，缺失了一个渐进式的倾诉过程。但是，在实际的心理咨询过程当中，用户和心理咨询师之间会存在多轮次的沟通过程，在该过程当中，心理咨询师会引导用户进行倾诉，并且提供共情，例如：“非常棒”、“我理解你的感受”、“当然可以”等等。\n\n考虑到当前十分欠缺多轮共情对话数据集，我们一方面，构建了超过15万规模的 单轮长文本心理咨询指令与答案（SoulChatCorpus-single_turn） ，回答数量超过50万（指令数是当前的常见的心理咨询数据集 PsyQA 的6.7倍），并利用ChatGPT与GPT4，生成总共约100万轮次的 多轮回答数据（SoulChatCorpus-multi_turn） 。特别地，我们在预实验中发现，纯单轮长本文驱动的心理咨询模型会产生让用户感到厌烦的文本长度，而且不具备引导用户倾诉的能力，纯多轮心理咨询对话数据驱动的心理咨询模型则弱化了模型的建议能力，因此，我们混合SoulChatCorpus-single_turn和SoulChatCorpus-multi_turn构造成超过120万个样本的 单轮与多轮混合的共情对话数据集SoulChatCorpus 。所有数据采用“用户：xxx\\n心理咨询师：xxx\\n用户：xxx\\n心理咨询师：”的形式统一为一种指令格式。\n\n我们选择了 ChatGLM-6B 作为初始化模型，进行了全量参数的指令微调，旨在提升模型的共情能力、引导用户倾诉能力以及提供合理建议的能力。更多训练细节请留意我们后续发布的论文。\n\n### 启真医学大模型\n- https://github.com/CMKRG/QiZhenGPT\n\n本项目利用启真医学知识库构建的中文医学指令数据集，并基于此在Chinese-LLaMA-Plus-7B、CaMA-13B、ChatGLM-6B模型上进行指令精调，大幅提高了模型在中文医疗场景下效果，首先针对药品知识问答发布了评测数据集，后续计划优化疾病、手术、检验等方面的问答效果，并针对医患问答、病历自动生成等应用展开拓展。\n\n### 貔貅FinMA & PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance\n- https://github.com/chancefocus/PIXIU\n- https://arxiv.org/abs/2306.05443\n- https://huggingface.co/spaces/ChanceFocus/FLARE\n\nThe advancement of Natural Language Processing (NLP) and machine learning (ML) techniques in financial technology (FinTech) has enabled a diverse set of capabilities from predicting stock price movements to advanced financial analytics. However, to effectively understand the complex financial language and concepts, domain-specific LLMs are necessary.\n\nDespite prior efforts, there is a lack of open-source financial LLMs and benchmarks to evaluate them. Additionally, these models are not fine-tuned to follow natural language instructions, limiting their performance in downstream financial tasks.\n\nTo address these gaps, we introduce PIXIU, providing:\n- Open-source LLMs tailored for finance called FinMA, by fine-tuning LLaMA with the dataset constructed in PIXIU.\n- Large-scale, high-quality multi-task and multi-modal financial instruction tuning data FIT.\n- Holistic financial evaluation benchmarks FLARE for assessing financial LLMs.\n\nKey Features\n- Open resources: PIXIU openly provides the financial LLM, instruction tuning data, and datasets included in the evaluation benchmark to encourage open research and transparency.\n- Multi-task: The instruction tuning data in PIXIU cover a diverse set of financial tasks, including four financial NLP tasks and one financial prediction task.\n- Multi-modality: PIXIU's instruction tuning data consist of multi-modality financial data, including time series data from the stock movement prediction task. It covers various types of financial texts, including reports, news articles, tweets, and regulatory filings.\n- Diversity: Unlike previous benchmarks focusing mainly on financial NLP tasks, PIXIU's evaluation benchmark includes critical financial prediction tasks aligned with real-world scenarios, making it more challenging.\n\n### 中文Alpaca模型Luotuo\n- https://sota.jiqizhixin.com/project/luotuo\n- https://github.com/LC1332/Luotuo-Chinese-LLM\n\nAlpaca 是斯坦福团队基于 LLaMA 7B 在 52k 指令上微调得到的模型，能出色适应多种自然语言应用场景。近日来自商汤科技和华中科技大学开源中文语言模型 Luotuo，基于 ChatGPT API 翻译 Alpaca 微调指令数据，并使用 lora 进行微调得到。目前该项目已公开训练的语料和模型权重文件（两个型号），供开发者可使用自己各种大小的语料，训练自己的语言模型，并适用到对应的垂直领域。\n\n### 中文LLaMA&Alpaca大模型\n- https://github.com/ymcui/Chinese-LLaMA-Alpaca\n\n以ChatGPT、GPT-4等为代表的大语言模型（Large Language Model, LLM）掀起了新一轮自然语言处理领域的研究浪潮，展现出了类通用人工智能（AGI）的能力，受到业界广泛关注。然而，由于大语言模型的训练和部署都极为昂贵，为构建透明且开放的学术研究造成了一定的阻碍。\n\n为了促进大模型在中文NLP社区的开放研究，本项目开源了中文LLaMA模型和经过指令精调的Alpaca大模型。这些模型在原版LLaMA的基础上扩充了中文词表并使用了中文数据进行二次预训练，进一步提升了中文基础语义理解能力。同时，在中文LLaMA的基础上，本项目使用了中文指令数据进行指令精调，显著提升了模型对指令的理解和执行能力。\n\n### 中文LLaMA&Alpaca大模型2\n- https://github.com/ymcui/Chinese-LLaMA-Alpaca-2\n- https://mp.weixin.qq.com/s/s8bOcwRYiRA88kPlJKeAKA\n- https://arxiv.org/abs/2304.08177v2\n\nChinese-LLaMA-Alpaca-2大模型项目正式发布v1.0版本，开源Chinese-LLaMA-2-7B（基座模型）和Chinese-Alpaca-2-7B（指令/chat模型）。这些模型在原版Llama-2的基础上扩充并优化了中文词表，使用了大规模中文数据进行增量预训练，进一步提升了中文基础语义和指令理解能力，相比一代相关模型获得了显著性能提升。相关模型支持4K上下文并可通过NTK方法最高扩展至18K+。\n\n### 中文对话式大语言模型Firefly\n- https://mp.weixin.qq.com/s/tyH9Ifcvw4DKqoIoYjT6Kg\n- https://github.com/yangjianxin1/Firefly\n\nFirefly（流萤） 是一个开源的中文对话式大语言模型，使用指令微调（Instruction Tuning）在中文数据集上进行调优。同时使用了词表裁剪、ZeRO、张量并行等技术，有效降低显存消耗和提高训练效率。 在训练中，我们使用了更小的模型参数量，以及更少的计算资源。\n\n我们构造了许多与中华文化相关的数据，以提升模型这方面的表现，如对联、作诗、文言文翻译、散文、金庸小说等。\n\n### 凤凰\n- https://mp.weixin.qq.com/s/beAAh_MdqssV8bEKsccElg\n- https://github.com/FreedomIntelligence/LLMZoo\n\nLLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.\n\n### 【复旦】MOSS\n- https://github.com/OpenLMLab/MOSS\n- https://mp.weixin.qq.com/s/LjToZVWjQ-ot5KJFCFtA3g\n\nMOSS是一个支持中英双语和多种插件的开源对话语言模型，moss-moon系列模型具有160亿参数，在FP16精度下可在单张A100/A800或两张3090显卡运行，在INT4/8精度下可在单张3090显卡运行。MOSS基座语言模型在约七千亿中英文以及代码单词上预训练得到，后续经过对话指令微调、插件增强学习和人类偏好训练具备多轮对话能力及使用多种插件的能力。\n\n### 【复旦】MOSS-RLHF\n- https://mp.weixin.qq.com/s/BjXtnEEVCQiPOy-_qCNM4g\n- https://openlmlab.github.io/MOSS-RLHF/paper/SecretsOfRLHFPart1.pdf\n- https://openlmlab.github.io/MOSS-RLHF/\n\nFudanNLP 团队通过大量、详实工作，设计实验充分探索了大模型 RLHF 的完整工作流程，仔细剖析了 RLHF 中的强化学习 PPO 算法的内部工作原理以及它在整个 RLHF 中的作用，并研究各种优化方法如何影响训练过程。通过这些努力，确定了使得 PPO 算法在大模型人类对齐方面行之有效的关键因素。\n\n综合上述发现，该团队进一步总结出在大模型上训练更稳定的 PPO 算法版本：PPO-max。并使用 Helpful 和 Harmless 数据集全面评估，结果显示经过 PPO-max 算法训练的模型展现出了出色的人类对齐性能！\n\n综合上述发现，该团队进一步总结出在大模型上训练更稳定的 PPO 算法版本：PPO-max。并使用 Helpful 和 Harmless 数据集全面评估，结果显示经过 PPO-max 算法训练的模型展现出了出色的人类对齐性能！\n\n### 【度小满】轩辕-首个千亿级中文金融对话模型\n- https://arxiv.org/pdf/2305.12002.pdf\n- https://huggingface.co/xyz-nlp/XuanYuan2.0\n- https://github.com/Duxiaoman-DI/XuanYuan\n- https://huggingface.co/xyz-nlp/XuanYuan2.0\n- https://zhuanlan.zhihu.com/p/632780608\n\n轩辕是国内首个开源的千亿级中文对话大模型，同时也是首个针对中文金融领域优化的千亿级开源对话大模型。轩辕在BLOOM-176B的基础上针对中文通用领域和金融领域进行了针对性的预训练与微调，它不仅可以应对通用领域的问题，也可以解答与金融相关的各类问题，为用户提供准确、全面的金融信息和建议。\n\n### 悟道·天鹰（Aquila）\n- https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila\n\n这是首个具备中英双语知识、支持商用许可协议、支持国内数据合规要求的开源语言大模型。悟道·天鹰（Aquila）系列模型包括 Aquila基础模型（7B、33B），AquilaChat对话模型（7B、33B）以及 AquilaCode “文本-代码”生成模型。 \n\n- https://github.com/FlagAI-Open/Aquila2\n\nWe announce that our Aquila2 series is now open source, comprising Aquila2 (the base language models: Aquila2-7B and Aquila2-34B) and AquilaChat2 (the chat models, namely AquilaChat2-7B and AquilaChat2-34B, as well as the long-text chat models, namely AquilaChat2-7B-16k and AquilaChat2-34B-16k). You can find the links in the following table. Kindly click on them to access the model cards.\n\n### 桃李：国际中文教育大模型\n- https://github.com/blcuicall/taoli\n\n随着ChatGPT引起全社会的关注，及各类大语言模型（Large Language Model）争相亮相，通用领域自然语言处理任务已获得巨大成功，引起了国际中文教育领域的普遍关注。\n\n国际中文教育人士纷纷展开了对大模型的探讨： 大模型是否可以根据学习者的水平，提供合适的语言表达，或根据学习者的问题给出详细的解答，从而在一定程度上辅助甚至充当学习伙伴、语言教师？ 然而，目前通用领域的大模型在垂直领域的效果仍有限。\n\n为解决上述问题，我们全面推出适用于国际中文教育领域的大模型 “桃李”（Taoli）1.0 ，一个在国际中文教育领域数据上进行了额外训练的模型。\n\n我们基于目前国际中文教育领域流通的500余册国际中文教育教材与教辅书、汉语水平考试试题以及汉语学习者词典等，构建了国际中文教育资源库。 我们设置了多种形式的指令来充分利用知识，构造了共计 88000 条的高质量国际中文教育问答数据集，并利用收集到的数据对模型进行指令微调，让模型习得将法律知识应用到具体场景中的能力。\n\n### 情感大模型PICA\n- https://mp.weixin.qq.com/s/E37EFe10185THHa3pSqBig\n- https://github.com/NEU-DataMining/PICA\n- https://huggingface.co/NEUDM/PICA-V1\n\nPICA 以清华大学开源的ChatGLM2-6B为基础，采用Prompt tuning技术在4 卡 A6000 训练大约15个小时得到。我们和SoulChat 进行了对比（最后部分），我们的模型在体验和安全上更有优势。我们只使用了2K的数据进行了p-tuning 微调，这充分说明了我们构造的数据质量比较高。模型权重可以在 HuggingFace 访问，欢迎各位使用并提出宝贵的意见。\n\n### 雅意大模型\n- https://github.com/wenge-research/YaYi\n- https://yayi.wenge.com/\n\n雅意大模型在百万级人工构造的高质量领域数据上进行指令微调得到，训练数据覆盖媒体宣传、舆情分析、公共安全、金融风控、城市治理等五大领域，上百种自然语言指令任务。雅意大模型从预训练初始化权重到领域模型的迭代过程中，我们逐步增强了它的中文基础能力和领域分析能力，并增加了多轮对话和部分插件能力。同时，经过数百名用户内测过程中持续不断的人工反馈优化，我们进一步提升了模型性能和安全性。\n\n通过雅意大模型的开源为促进中文预训练大模型开源社区的发展，贡献自己的一份力量，通过开源，与每一位合作伙伴共建雅意大模型生态。\n\n### 儿童情感陪伴大模型“巧板”\n- https://github.com/HIT-SCIR-SC/QiaoBan\n\n巧板大模型是一个7B规模的大语言模型。巧板”指七巧板，是一款承载着中国传统智慧的益智拼图玩具，更是一款教育益智工具。这次发布的儿童大模型正是希望通过陪伴、益智和教育功能，与儿童们建立更深厚的情感纽带。此外，为符合SCIR实验室发布大模型命名规范，故命名为“巧板”大模型。而这个特别的名称也蕴含着我们对儿童成长的悉心呵护，就像巧板一样，为他们拼出美好未来提供帮助。\n\n巧板大模型独具三大特点：\n1. 儿童心理学理论指导。基于情绪辅导理论的儿童情感陪伴对话数据构建，更有效地守护孩子的心理健康。\n\n2. 高质量的儿童对话数据构建。高质量对话数据由具有儿童心理学背景的志愿者与专家参与完成，确保数据的真实性与有效性。\n\n3. 温暖的儿童陪伴体验。与儿童的交互方式更加贴心，能够真正与他们建立深入的情感连接，让儿童感受到温暖和认同，成为他们坚实成长道路上的得力伙伴。\n\n### 通义千问Qwen\n- https://github.com/QwenLM/Qwen-7B\n- https://qwenlm.github.io/\n\n我们在🤖 ModelScope以及🤗 Hugging Face均开源了Qwen-7B系列模型。请在本文档顶部点击相关链接查看仓库信息。本仓库主要包括Qwen-7B的简介、使用指南、技术备忘等内容。想了解更多关于模型的信息，请点击链接查看我们的技术备忘录。\n\n通义千问-7B（Qwen-7B） 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。同时，在Qwen-7B的基础上，我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。Qwen-7B系列模型的特点包括：\n1. 大规模高质量预训练数据：我们使用了超过2.2万亿token的自建大规模预训练数据集进行语言模型的预训练。数据集包括文本和代码等多种数据类型，覆盖通用领域和专业领域。\n2. 优秀的模型性能：相比同规模的开源模型，Qwen-7B在多个评测数据集上具有显著优势，甚至超出12-13B等更大规模的模型。评测评估的能力范围包括自然语言理解与生成、数学运算解题、代码生成等。\n3. 更好地支持多语言：基于更大词表的分词器在分词上更高效，同时它对其他语言表现更加友好。用户可以在Qwen-7B的基础上更方便地训练特定语言的7B语言模型。\n4. 8K的上下文长度：Qwen-7B及Qwen-7B-Chat均能支持8K的上下文长度, 允许用户输入更长的prompt。\n5. 支持插件调用：Qwen-7B-Chat针对插件调用相关的对齐数据做了特定优化，当前模型能有效调用插件以及升级为Agent。\n\n### 活字\n- https://mp.weixin.qq.com/s/WEitgZjOxZpp7KIbRU0ewg\n- https://github.com/HIT-SCIR/huozi\n\n大规模语言模型（LLM）在自然语言处理的通用领域已取得了令人瞩目的成功。对于广泛的应用场景，这种技术展示了强大的潜力，学术界和工业界的兴趣也持续升温。哈工大自然语言处理研究所30余位老师和学生参与开发了通用对话大模型活字1.0，哈工大社会计算与信息检索研究中心(哈工大-SCIR)研发了活字2.0，致力于为自然语言处理的研究和实际应用提供更多可能性和选择。\n\n活字3.0是基于Chinese-Mixtral-8x7B，在大约30万行指令数据上微调得到的模型。该模型支持32K上下文，能够有效处理长文本。活字3.0继承了基座模型丰富的中英文知识，并在数学推理、代码生成等任务上具有强大性能。经过指令微调，活字3.0还在指令遵循能力和安全性方面实现了显著提升。\n\n### 韩非 HanFei\n- https://github.com/siat-nlp/HanFei\n\nHanFei-1.0(韩非)是国内首个全参数训练的法律大模型，参数量7b，主要功能包括：法律问答、多轮对话、撰写文章、检索（敬请期待）等。\n\n### 智海 录问\n- https://github.com/zhihaiLLM/wisdomInterrogatory\n\n智海-录问(wisdomInterrogatory)是由浙江大学、阿里巴巴达摩院以及华院计算三家单位共同设计研发的法律大模型。核心思想：以“普法共享和司法效能提升”为目标，从推动法律智能化体系入司法实践、数字化案例建设、虚拟法律咨询服务赋能等方面提供支持，形成数字化和智能化的司法基座能力。\n\n### Anima：基于QLoRA的33B中文大语言模型\n- https://github.com/lyogavin/Anima\n\nAI Community从来都是非常开放的，AI发展到今天，离不开很多以前的重要开源工作，开放共享的Paper，或者的开源数据和代码。我们相信AI的未来也一定是开放的。希望能为开源社区做一些贡献。\n\n为什么33B模型很重要？QLoRA是个Game Changer？\n\n之前大部分开源可finetune的模型大都是比较小的模型7B或者13B，虽然可以在一些简单的chatbot评测集上，通过finetune训练有不错的表现。但是由于这些模型规模还是有限，LLM核心的reasoning的能力还是相对比较弱。这就是为什么很多这种小规模的模型在实际应用的场景表现像是个玩具。如这个工作中的论述：chatbot评测集比较简单，真正比较考验模型能力的复杂逻辑推理及数学问题上小模型和大模型差距还是很明显的。\n\n因此我们认为QLoRA 的工作很重要，重要到可能是个Game Changer。通过QLoRA的优化方法，第一次让33B规模的模型可以比较民主化的，比较低成本的finetune训练，并且普及使用。我们认为33B模型既可以发挥大规模模型的比较强的reasoning能力，又可以针对私有业务领域数据进行灵活的finetune训练提升对于LLM的控制力。\n\n### BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models\n- https://github.com/ictnlp/BayLing\n- https://arxiv.org/abs/2306.10968\n\nBayLing (百聆, bǎi líng) is an instruction-following large language model equipped with advanced language alignment, showing superior capability in English/Chinese generation, instruction following and multi-turn interaction. BayLing can be effortlessly deployed on a consumer-grade GPU with 16GB of memory, and assists users with tasks such as translation, writing, creation, suggestion...\n\n### BBT-FinCUGE-Applications\n- https://github.com/ssymmetry/BBT-FinCUGE-Applications\n- https://arxiv.org/abs/2302.09432\n- https://bbt.ssymmetry.com/index.html\n\n1.目前最大规模的中文金融领域开源语料库BBT-FinCorpus。预训练语料库的规模与多样性对PLM的性能和泛化能力具有重要作用，所以为了更好的训练PLM，首先需要搜集大规模多样性的语料库。然而，目前中文金融领域缺乏大规模多样性开源语料库，已有的中文金融领域模型多数基于小规模的私有语料库，严重限制了中文金融PLM的能力提升。为此，我们构建了BBT-FinCorpus，一个包含有从四种异质性来源获取的约300GB文本的大规模多样性语料库。针对如何确定语料库的覆盖范围和语料来源集合的问题，我们首先搜集了中文互联网上可获取的所有中文金融NLP任务数据集，并根据其文本来源分布来确定所需要爬取的文本来源集合。在确认好需要爬取的文本来源集合之后，我们使用基于代理的分布式爬虫技术实现大规模爬取网页上的文本。\n\n2.目前最大规模的中文金融领域知识增强型预训练语言模型BBT-FinT5。PLM的架构与参数量对其性能有重要影响。现有的中文金融领域PLM都基于较为原始的BERT模型架构，参数量也相对较小，不能满足日益丰富的领域NLP需求。因此，我们基于T5模型架构构建了一个拥有十亿参数量的目前最大规模的中文金融领域预训练语言模型BBT-FinT5。为了在有限的硬件算力条件下，尽可能高效地利用好硬件算力，我们使用DeepSpeed加速框架对预训练过程进行效率优化。此外，我们还针对T5模型设计了独特的知识增强预训练方法，通过实验证明了该方法的有效性。\n\n3.首个中文金融领域自然语言处理评测基准CFLEB。现有的自然语言处理评估基准多是通用领域的，没有公开可用的中文金融领域评测基准。这导致中文金融领域现有的预训练语言模型在不同的任务集合上进行评测，难以相互比较，阻碍了中文金融领域PLM性能的快速提升。为此，我们首先构建了首个中文金融领域自然语言处理评测基准CFLEB，包含六种不同的任务，涵盖对PLM理解与生成能力的评估。针对评测基准任务的选择及其选择标准问题，我们认为领域评测基准应当着重强调任务的实用性，以更好的反映学术界改进PLM对现实世界的帮助。为此，我们首先邀请金融领域专家对所有可获取的中文金融任务进行了实用性评价，筛选出具有较高实用性评分的任务。之后，我们综合任务数据集的开源情况确定了六个任务数据集作为最终的评测基准。该评测基准的早期版本命名为FinCUGE，包含八个任务，该版本目前已舍弃。\n\n\n### BELLE: Bloom-Enhanced Large Language model Engine\n- https://huggingface.co/BelleGroup\n- https://github.com/LianjiaTech/BELLE\n- https://zhuanlan.zhihu.com/p/616079388\n\n本项目目标是促进中文对话大模型开源社区的发展，愿景做能帮到每一个人的LLM Engine。现阶段本项目基于一些开源预训练大语言模型（如BLOOM），针对中文做了优化，模型调优仅使用由ChatGPT生产的数据（不包含任何其他数据）。\n\n本项目基于 Stanford Alpaca ，Stanford Alpaca 的目标是构建和开源一个基于LLaMA的模型。 Stanford Alpaca 的种子任务都是英语，收集的数据也都是英文，因此训练出来的模型未对中文优化。\n\n本项目目标是促进中文对话大模型开源社区的发展。本项目针对中文做了优化，模型调优仅使用由ChatGPT生产的数据（不包含任何其他数据）。\n\n### Bloom\n- https://huggingface.co/blog/bloom\n- https://huggingface.co/bigscience/bloom\n\nBLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks.\n\n### BiLLa: A Bilingual LLaMA with Enhanced Reasoning Ability\n- https://zhuanlan.zhihu.com/p/628688680\n- https://github.com/Neutralzz/BiLLa\n\nBiLLa是开源的推理能力增强的中英双语LLaMA模型。模型的主要特性有：\n- 较大提升LLaMA的中文理解能力，并尽可能减少对原始LLaMA英文能力的损伤；\n- 训练过程增加较多的任务型数据，利用ChatGPT生成解析，强化模型理解任务求解逻辑；\n- 全量参数更新，追求更好的生成效果。\n\n### BLOOMChat176B\n- https://mp.weixin.qq.com/s/cY6ORD8CUyXRL0l20EjwqQ\n- https://sambanova.ai/blog/introducing-bloomchat-176b-the-multilingual-chat-based-llm/\n- https://huggingface.co/spaces/sambanovasystems/BLOOMChat\n- https://github.com/sambanova/bloomchat\n\n开源对话模型一直跟闭源模型在多语言能力上存在差距。SambaNova 和斯坦福 Together Computer 开源可商用的多语言聊天模型 BLOOMChat 176B，支持中文。BLOOMChat 在SambaNova 自研芯片 RDU 上完成训练，借助 SambaNova 的独特可重构数据流架构，利用 BLOOM 开源模型的核心能力，通过在 OpenChatKit、Dolly 2.0 和 OASST1 的 OIG 上进行微调。在基于六种语言的早期双盲测试中，BLOOMChat 在 66%的测评数据上产生的对话表现优于近期的开源对话模型。同时在与 GPT4 的基于六种语言的人工测评对比中，BLOOMChat 得到 45%对 55%的胜率，大大缩小开源和闭源模型的多语言对话能力差距。当前 BLOOMChat 开源模型文件，支持在 huggingface 在线推理试用。\n\n### ChatLaw 法律大模型\n- https://www.chatlaw.cloud/\n- https://github.com/PKU-YuanGroup/ChatLaw\n- https://arxiv.org/pdf/2306.16092.pdf\n\n但愿世间不纷争，何惜法典卷生尘\n\nChatGPT浪潮下，人工智能的不断扩展和发展为LLM的扩散提供了肥沃的土壤，目前医疗、教育、金融领域已逐渐有了各自的模型，但法律领域迟迟没有明显进展。\n\n为了促进LLM在法律甚至其他垂直应用落地的开放研究，本项目开源了中文法律大模型，并针对LLM和知识库的结合问题给出了法律场景下合理的解决方案。\n\nChatLaw法律大模型目前开源的仅供学术参考的版本底座为姜子牙-13B、Anima-33B，我们使用大量法律新闻、法律论坛、法条、司法解释、法律咨询、法考题、判决文书等原始文本来构造对话数据。\n\n基于姜子牙-13B的模型是第一版模型，得益于姜子牙的优秀中文能力和我们对数据清洗、数据增强过程的严格要求，我们在逻辑简单的法律任务上表现优异，但涉及到复杂逻辑的法律推理任务时往往表现不佳。\n\n随后基于Anima-33B，我们增加了训练数据，做成了ChatLaw-33B，发现逻辑推理能力大幅提升，由此可见，大参数的中文LLM是至关重要的。\n\n我们的技术报告在这里: arXiv: ChatLaw\n\n基于可商用的模型训练而成的版本会作为我们后续产品内部接入的版本，对外不开源，可以在这里进行开源版本模型的试用\n\n### Chinese-Llama-2-7b (LinkSoul-AI)\n- https://github.com/LinkSoul-AI/Chinese-Llama-2-7b\n- https://huggingface.co/spaces/LinkSoul/Chinese-Llama-2-7b\n\n全部开源，完全可商用的中文版 Llama2 模型及中英文 SFT 数据集，输入格式严格遵循 llama-2-chat 格式，兼容适配所有针对原版 llama-2-chat 模型的优化。\n\n### Chinese-Vicuna-medical\n- https://github.com/Facico/Chinese-Vicuna/blob/master/docs/performance-medical.md\n\n在cMedQA2上使用我们的checkpoint-11600 continue finetune\n\n目前从2个epoch的Vicuna开始continue finetune，效果比3个epoch的在医疗问答数据更具有专业性，同时由于数据集构建的问题，会更加规范，比如经常性的加上“到正规医院检查”等等\n\n- 同时验证了指令微调的有效性\n- 使用单指令continue-finetune能保留原来更多的性能\n\n### Cornucopia-LLaMA-Fin-Chinese\n- https://github.com/jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese\n\n聚宝盆(Cornucopia): 基于中文金融知识的LLaMA微调模型\n本项目开源了经过中文金融知识指令精调/指令微调(Instruct-tuning) 的LLaMA-7B模型。通过中文金融公开数据+爬取的金融数据构建指令数据集，并在此基础上对LLaMA进行了指令微调，提高了 LLaMA 在金融领域的问答效果。\n\n基于相同的数据，后期还会利用GPT3.5 API构建高质量的数据集，另在中文知识图谱-金融上进一步扩充高质量的指令数据集\n\n陆续会发布研发的新模型（next-pretrain、multi-task SFT、RLHF Optimize），欢迎大家届时使用体验。\n\n### chatglm-maths\n- https://github.com/yongzhuo/chatglm-maths\n\nchatglm-6b微调/LORA/PPO/推理, 样本为自动生成的整数/小数加减乘除运算, 可gpu/cpu。\n\n### Abel\n- https://github.com/GAIR-NLP/abel\n\nAbel is created as a tribute to Niels Henrik Abel for his groundbreaking work in algebra and analysis, at which our model is relatively better as well. There is still a long way for us to go, though 🏃‍♂️🏃‍♀️🏁🏃‍♂️🏃‍♀️.\n\nWe show that:\n- without tools\n- without continuing pretraining\n- without reward model\n- without RLHF\n- ONLY using SFT\n\nWe have established a new state-of-the-art performance across open-source LLMs (that do not use external tools) on the GSM8k (83.62) and MATH (28.26) benchmarks. Specifically\n\n### InternLM-Math\n- https://github.com/InternLM/InternLM-Math\n\nState-of-the-art bilingual open-sourced Math reasoning LLMs. A solver, prover, verifier, augmentor.\n\n### DeepSeekMath\n- https://arxiv.org/abs/2402.03300\n- https://github.com/deepseek-ai/DeepSeek-Math\n\nMathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.\n\n### LeerooDedicated-Math-7b\n- https://huggingface.co/leeroo/LeerooDedicated-Math-7b\n- https://arxiv.org/abs/2401.13979\n\nIn this paper, we propose an architecture to harness the collective knowledge of multiple trained LLMs to create a new state-of-the-art. At the core of this framework is a LLM-based orchestrator that is adept at picking the right underlying LLM experts for optimal task execution. Inspired by self-play in reinforcement learning, we created a loop of query generation, orchestration, and evaluation to generate training data for the orchestrator. Our evaluation focused on the MMLU benchmark, employing models with 7B, 13B, and 34B parameters available on Hugging Face. The results demonstrate new state-of-the-art open-source models: Our Leeroo orchestrator achieves performance on par with the Mixtral model while incurring only two-thirds of its cost. Moreover, increasing the allowed cost surpasses Mixtral's accuracy by over 5% at the same cost level, reaching an accuracy of 75.9%. Further enhancements were observed when integrating GPT4 into the underlying model pool. The Leeroo orchestrator nearly matches GPT4's performance at half the cost and even exceeds GPT4's results with a 25% cost reduction. These findings illustrate the potential of our architecture in creating state-of-the-art and cost-effective LLMs by optimizing the synergy between multiple LLMs to achieve superior performance outcomes.\n\n### SimpleGeometry\n- https://huggingface.co/datasets/bethgelab/simplegeometry\n- https://arxiv.org/abs/2404.06405\n\nProving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.\n\n### Rho-1\n- https://arxiv.org/abs/2404.07965\n\nPrevious language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that \"Not all tokens in a corpus are equally important for language model training\". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.\n\n### ChatGLM-Math\n- https://github.com/THUDM/ChatGLM-Math\n- https://arxiv.org/pdf/2404.02893.pdf\n\nLarge language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems. In this work, we tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM's own generations for data collection. Based on ChatGLM3-32B, we conduct a series of experiments on both academic and our newly created challenging dataset, \\textsc{MathUserEval}. Results show that our pipeline significantly enhances the LLM's mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger.\n\n### JiuZhang3.0\n- https://arxiv.org/abs/2405.14365\n- https://github.com/RUCAIBox/JiuZhang3.0\n- https://huggingface.co/ToheartZhang/JiuZhang3.0-8B\n- https://huggingface.co/datasets/ToheartZhang/JiuZhang3.0-Corpus-PT-CoT\n\nJiuZhang3.0 is a series of fine-tuned models for math reasoning continually pre-trained on corpus synthesized by our carefully trained small LLM.\n\n### InternLM2-WQX\n- https://github.com/InternLM/InternLM-WQX\n\nInternLM2-WQX与InternLM2-WQX-VL是InternLM团队于2024年高考前夕最新推出的文曲星系列模型。\n\n高考覆盖各类学科及题型，同时因其开考前的“绝密性”，被视作中国最具权威的考试之一，成为评估考生综合能力的“试金石”。这一面向人类设计的高难度综合性测试，目前普遍被研究者用于考察大模型的智能水平。InternLM2-WQX系列模型在2024年高考评测集GAOKAO-Eval上取得了优异的成绩，综合表现与GPT-4o相当，且超越了国内外一系列开源大模型，体现了InternLM2-WQX系列模型优秀的性能。\n\n### Math-Minos\n- https://arxiv.org/abs/2406.14024\n- https://github.com/KbsdJames/MATH-Minos\n\nMathematical verfier achieves success in mathematical reasoning tasks by validating the correctness of solutions. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedbacks as rationale labels (i.e., the correctness of the current step and the explanations). In this paper, we propose \\textbf{Math-Minos}, a natural language feedback enhanced verifier by constructing automatically-generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier by the accuracy of 1.6\\% (86.6\\% → 88.2\\%) on GSM8K and 0.8\\% (37.8\\% → 38.6\\%) on MATH. \n\n### NuminaMath 7B TIR\n- https://huggingface.co/AI-MO/NuminaMath-7B-TIR\n\nNuminaMath is a series of language models that are trained to solve math problems using tool-integrated reasoning (TIR). NuminaMath 7B TIR won the first progress prize of the AI Math Olympiad (AIMO), with a score of 29/50 on the public and private tests sets.\n\n### MathΣtral\n- https://mistral.ai/news/mathstral/\n\nMathstral can achieve significantly better results with more inference-time computation: Mathstral 7B scores 68.37% on MATH with majority voting and 74.59% with a strong reward model among 64 candidates.\n\nMathstral is an instructed model – use it or fine-tune it as such, referring to our documentation. Weights are hosted on HuggingFace. You can try Mathstral now with mistral-inference and adapt it with mistral-finetune.\n\n### LLaMAX\n- https://arxiv.org/pdf/2407.05975\n- https://github.com/CONE-MT/LLaMAX/\n\nLarge Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model.\n\n### Qwen2-Math\n- https://huggingface.co/collections/Qwen/qwen2-math-66b4c9e072eda65b5ec7534d\n- https://github.com/QwenLM/Qwen2-Math\n\nOver the past year, we have dedicated significant effort to researching and enhancing the reasoning capabilities of large language models, with a particular focus on their ability to solve arithmetic and mathematical problems. Today, we are delighted to introduce a series of math-specific large language models of our Qwen2 series, Qwen2-Math, and Qwen2-Math-Instruct-1.5B/7B/72B. Qwen2-Math is a series of specialized math language models built upon the Qwen2 LLMs, which significantly outperforms the mathematical capabilities of open-source models and even closed-source models (e.g., GPT4o). We hope that Qwen2-Math can contribute to the scientific community by solving advanced mathematical problems that require complex, multi-step logical reasoning.\n\n### AIMO-CMU_MATH\n- https://github.com/AIMO-CMU-MATH/CMU_MATH-AIMO\n\nthe proud winners of the 2nd place in the AI Mathematical Olympiad (AIMO).\n\nWe are pleased to share all the datasets and code used in our competition. This repository contains the resources needed to reproduce our models and solutions.\n\n### Qwen2.5-Math\n- https://qwenlm.github.io/zh/blog/qwen2.5-math/\n\nQwen2.5-Math主要被设计用于通过CoT或TIR的方式解中英数学题，我们不推荐在其他任务上使用该系列模型。\n\n### SocraticLM\n- https://openreview.net/pdf?id=qkoZgJhxsA\n- https://github.com/Ljyustc/SocraticLM\n\n### Open Thoughts\n- github.com/open-thoughts/open-thoughts  \n\nOur first goal is to curate a reasoning dataset to train state-of-the-art small reasoning models that surpass DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-7B on math and code reasoning benchmarks.\n\n### simpleRL-reason\n- https://github.com/hkust-nlp/simpleRL-reason\n\nThis repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward is used, the recipe is almost the same as the one used in DeepSeek-R1, except that the code currently uses PPO rather than GRPO. We have used this code to train small models (7B) on limited data (8K examples), achieving surprisingly strong results -- for example, starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to previous baselines that use >50x more data and more complicated components. You may check our Notion blog or the Introduction below for more details.\n\n### DRT-o1\n- https://github.com/krystalan/DRT-o1\n\nThis repository contains the resources for our paper \"DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought\"\n\n### ChatRWKV\n- https://github.com/BlinkDL/ChatRWKV\n\nChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves VRAM. Training sponsored by Stability EleutherAI :)\n\n### ChatYuan\n- https://github.com/clue-ai/ChatYuan\n- https://modelscope.cn/models/ClueAI/ChatYuan-large\n\n元语功能型对话大模型, 这个模型可以用于问答、结合上下文做对话、做各种生成任务，包括创意性写作，也能回答一些像法律、新冠等领域问题。它基于PromptCLUE-large结合数亿条功能对话多轮对话数据进一步训练得到。\n\nPromptCLUE-large在1000亿token中文语料上预训练，累计学习1.5万亿中文token，并且在数百种任务上进行Prompt任务式训练。针对理解类任务，如分类、情感分析、抽取等，可以自定义标签体系；针对多种生成任务，可以进行采样自由生成。\n\n### ChatGLM-6B\n- https://github.com/THUDM/ChatGLM-6B\n- https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning\n\nChatGLM-6B 是一个开源的、支持中英双语的对话语言模型，基于 General Language Model (GLM) 架构，具有 62 亿参数。结合模型量化技术，用户可以在消费级的显卡上进行本地部署（INT4 量化级别下最低只需 6GB 显存）。 ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练，辅以监督微调、反馈自助、人类反馈强化学习等技术的加持，62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。更多信息请参考我们的博客。\n\n### ChatGLM2-6B\n- https://github.com/THUDM/ChatGLM2-6B\n\nChatGLM2-6B 是开源中英双语对话模型 ChatGLM-6B 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM2-6B 引入了如下新特性：\n\n- 更强大的性能：基于 ChatGLM 初代模型的开发经验，我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 GLM 的混合目标函数，经过了 1.4T 中英标识符的预训练与人类偏好对齐训练，评测结果显示，相比于初代模型，ChatGLM2-6B 在 MMLU（+23%）、CEval（+33%）、GSM8K（+571%） 、BBH（+60%）等数据集上的性能取得了大幅度的提升，在同尺寸开源模型中具有较强的竞争力。\n- 更长的上下文：基于 FlashAttention 技术，我们将基座模型的上下文长度（Context Length）由 ChatGLM-6B 的 2K 扩展到了 32K，并在对话阶段使用 8K 的上下文长度训练，允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限，我们会在后续迭代升级中着重进行优化。\n- 更高效的推理：基于 Multi-Query Attention 技术，ChatGLM2-6B 有更高效的推理速度和更低的显存占用：在官方的模型实现下，推理速度相比初代提升了 42%，INT4 量化下，6G 显存支持的对话长度由 1K 提升到了 8K。\n- 更开放的协议：ChatGLM2-6B 权重对学术研究完全开放，在获得官方的书面许可后，亦允许商业使用。如果您发现我们的开源模型对您的业务有用，我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。\n\n### Chinese-Transformer-XL\n- https://github.com/THUDM/Chinese-Transformer-XL\n\n本项目提供了智源研究院\"文汇\" 预训练模型Chinese-Transformer-XL的预训练和文本生成代码。\n\n### ChatMed-TCM & ChatMed-Consult\n- https://github.com/michael-wzhu/ChatMed\n\n🚀 ChatMed-Consult : 基于中文医疗在线问诊数据集ChatMed_Consult_Dataset的50w+在线问诊+ChatGPT回复作为训练集。模型主干为LlaMA-7b,融合了Chinese-LlaMA-Alpaca的LoRA权重与中文扩展词表，然后再进行基于LoRA的参数高效微调。我们将全部代码都进行了公开。我们也将部署一个在线Gradio demo, 敬请关注。\n\n⏳ ChatMed-TCM : 大模型赋能中医药传承。这一模型的训练数据为中医药指令数据集ChatMed_TCM_Dataset。以我们开源的中医药知识图谱为基础，采用以实体为中心的自指令方法(entity-centric self-instruct)，调用ChatGPT得到2.6w+的围绕中医药的指令数据。ChatMed-TCM模型也是以LlaMA为底座，采用LoRA微调得到。\n\n### ChatGLM-Med\n- https://github.com/SCIR-HI/Med-ChatGLM\n\n基于中文医学知识的ChatGLM模型微调，本项目开源了经过中文医学指令精调/指令微调(Instruct-tuning) 的ChatGLM-6B模型。我们通过医学知识图谱和GPT3.5 API构建了中文医学指令数据集，并在此基础上对ChatGLM-6B进行了指令微调，提高了ChatGLM在医疗领域的问答效果。\n\n### CPM-Bee\n- https://mp.weixin.qq.com/s/UCW1BT60Lr9x24Rj0cLuxw\n- https://huggingface.co/openbmb/cpm-bee-10b\n- https://github.com/OpenBMB/CPM-Bee\n\nCPM-Bee 是一个 完全开源、允许商用 的百亿参数中英文基座模型。它采用 Transformer 自回归架构（auto-regressive），使用万亿级高质量语料进行预训练，拥有强大的基础能力。CPM-Bee 的特点可以总结如下：\n\n开源可商用：OpenBMB 始终秉承“让大模型飞入千家万户”的开源精神，CPM-Bee 基座模型将完全开源并且可商用，以推动大模型领域的发展。如需将模型用于商业用途，只需企业实名邮件申请并获得官方授权证书，即可商用使用。\n\n中英双语性能优异：CPM-Bee 基座模型在预训练语料上进行了严格的筛选和配比，同时在中英双语上具有亮眼表现，具体可参见评测任务和结果。\n\n超大规模高质量语料：CPM-Bee基座模型在万亿级语料上进行训练，是开源社区内经过语料最多的模型之一。同时，我们对预训练语料进行了严格的筛选、清洗和后处理以确保质量。\n\nOpenBMB大模型系统生态支持：OpenBMB 大模型系统在高性能预训练、适配、压缩、部署、工具开发了一系列工具，CPM-Bee 基座模型将配套所有的工具脚本，高效支持开发者进行进阶使用。 \n\n强大的对话和工具使用能力：结合OpenBMB 在指令微调和工具学习的探索，我们在 CPM-Bee 基座模型的基础上进行微调，训练出了具有强大对话和工具使用能力的实例模型，现已开放定向邀请内测，未来会逐步向公众开放。\n\n### DISC-MedLLM（复旦）\n- https://med.fudan-disc.com\n- https://github.com/FudanDISC/DISC-MedLLM\n- https://arxiv.org/abs/2308.14346\n\nDISC-MedLLM 是基于我们构建的高质量数据集 DISC-Med-SFT 在通用领域中文大模型 Baichuan-13B 上训练得到的医疗大模型。值得注意的是，我们的训练数据和训练方法可以被适配到任何基座大模型之上。\n\nDISC-MedLLM 具有三个关键特点：\n\n- 可靠丰富的专业知识。我们以医学知识图谱作为信息源，通过采样三元组，并使用通用大模型的语言能力进行对话样本的构造。\n- 多轮对话的问询能力。我们以真实咨询对话纪录作为信息源，使用大模型进行对话重建，构建过程中要求模型完全对齐对话中的医学信息。\n- 对齐人类偏好的回复。病人希望在咨询的过程中获得更丰富的支撑信息和背景知识，但人类医生的回答往往简练；我们通过人工筛选，构建高质量的小规模指令样本，对齐病人的需求。\n\n### Data-Copilot\n- https://github.com/zwq2018/Data-Copilot\n- https://arxiv.org/abs/2306.07209\n- https://huggingface.co/spaces/zwq2018/Data-Copilot\n\nData-Copilot 是一个基于 LLM 的系统，用于处理与数据相关的任务，连接了数十亿条数据和多样化的用户需求。它独立设计接口工具，以高效地管理、调用、处理和可视化数据。在接收到复杂请求时，Data-Copilot 会自主调用这些自设计的接口，构建一个工作流程来满足用户的意图。在没有人类协助的情况下，它能够熟练地将来自不同来源、不同格式的原始数据转化为人性化的输出，如图形、表格和文本。\n\n### Tabular LLM\n- https://github.com/SpursGoZmy/Tabular-LLM\n\n我们提出Tabular-LLM项目，项目的核心计划如下：\n\n- 探索不同类型表格的表示方法：训练LLM势必需要将表格转化为一个文本序列，ChatGPT等LLM使用Markdown格式来表示简单表格，但这种方法无法很好地表示更复杂的表格，比如包含合并单元格的层级表格，因此我们需要探索如何（统一）表示不同类型的表格，更多讨论见下一节。\n- 收集并整理涵盖多种类型表格、多种表格智能任务的数据：考虑学界目前研究较多的表格智能任务，收集开源的数据集并将其转化为指令微调格式的数据，以便用户按需选择。\n- 开源表格智能LLM并进行测试分析：利用收集到的数据去微调Alpaca-CoT等模型，构建首批面向表格智能任务的开源LLM，在此基础上对训练好的模型进行测试分析，比如测试训练后的模型在学界测试数据集上的表现，后续将相关实验结果整理为文档，希望能为大家提供一些有用的经验。\n\n### Chain-of-table\n- https://blog.research.google/2024/03/chain-of-table-evolving-tables-in.html\n- https://arxiv.org/abs/2401.04398\n\nTable-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.\n\n### Data Interpreter\n- https://arxiv.org/abs/2402.18679\n- https://github.com/geekan/MetaGPT\n\nLarge Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time data adjustment, expertise in optimization due to complex dependencies among various tasks, and the ability to identify logical errors for precise reasoning. In this study, we introduce the Data Interpreter, a solution designed to solve with code that emphasizes three pivotal techniques to augment problem-solving in data science: 1) dynamic planning with hierarchical graph structures for real-time data adaptability;2) tool integration dynamically to enhance code proficiency during execution, enriching the requisite expertise;3) logical inconsistency identification in feedback, and efficiency enhancement through experience recording. We evaluate the Data Interpreter on various data science and real-world tasks. Compared to open-source baselines, it demonstrated superior performance, exhibiting significant improvements in machine learning tasks, increasing from 0.86 to 0.95. Additionally, it showed a 26% increase in the MATH dataset and a remarkable 112% improvement in open-ended tasks.\n\n### TableLLM\n- https://arxiv.org/abs/2403.19318\n- https://github.com/TableLLM/TableLLM\n\nWe introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction.\n\n### Lag-Llama\n- https://github.com/time-series-foundation-models/lag-llama\n- https://arxiv.org/abs/2310.08278\n\nLag-Llama is the first open-source foundation model for time series forecasting!\n\n###  TabuLa-8B\n- https://github.com/mlfoundations/rtfm\n- https://arxiv.org/abs/2406.12031\n- https://huggingface.co/mlfoundations/tabula-8b\n\n\nTabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. \n\n### Time-MoE\n- https://arxiv.org/pdf/2409.16040\n- https://github.com/Time-MoE/Time-MoE\n\n1️⃣ Time-MoE is the first work to scale time series foundation models up to 2.4 billion parameters, trained from scratch.\n\n2️⃣ Time-300B is the largest open-access time series data collection comprising over 300 billion time points across more than 9 domains.\n\n### DoctorGLM\n- https://github.com/xionghonglin/DoctorGLM\n\nDoctorGLM，基于 ChatGLM-6B的中文问诊模型。\n\n### EduChat\n- https://github.com/icalk-nlp/EduChat\n\n教育是影响人的身心发展的社会实践活动，旨在把人所固有的或潜在的素质自内而外激发出来。因此，必须贯彻“以人为本”的教育理念，重点关注人的个性化、引导式、身心全面发展。为了更好地助力”以人为本“的教育，华东师范大学计算机科学与技术学院的EduNLP团队探索了针对教育垂直领域的对话大模型EduChat相关项目研发。该项目主要研究以预训练大模型为基底的教育对话大模型相关技术，融合多样化的教育垂直领域数据，辅以指令微调、价值观对齐等方法，提供教育场景下自动出题、作业批改、情感支持、课程辅导、高考咨询等丰富功能，服务于广大老师、学生和家长群体，助力实现因材施教、公平公正、富有温度的智能教育。\n\n### EVA: 大规模中文开放域对话系统\n- https://github.com/thu-coai/EVA\n\nEVA 是目前最大的开源中文预训练对话模型，拥有28亿参数，主要擅长开放域闲聊，目前有 1.0 和 2.0 两个版本。其中，1.0版本在 WudaoCorpus-Dialog 上训练而成，2.0 版本在从 WudaoCorpus-Dialog 中清洗出的更高质量的对话数据上训练而成，模型性能也明显好于 EVA1.0。\n\n### EcomGPT\n- https://arxiv.org/abs/2308.06966\n- https://github.com/Alibaba-NLP/EcomGPT\n\n- we proposed the first E-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data.\n- EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks.\n- We developed EcomGPT by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities.\n\n### FinGLM\n- https://github.com/MetaGLM/FinGLM/\n\n📈 一个旨在深度解析上市公司年报的对话交互智能系统。面对金融文本中的专业术语与暗含信息，我们致力于用AI实现专家级别的金融分析。\n\n🚀 在AI领域，虽然已在文本对话取得进展，但真正的金融交互场景仍然是一个巨大挑战。多方机构联手举办此次竞赛，探索金融领域AI的边界。\n\n📘 上市公司年报为投资者呈现了公司的经营状况、财务状况和未来规划。专业知识是解读的关键，而我们的目标是通过AI技术让这一过程变得更简单、更准确。\n\n### DISC-FinLLM\n- https://fin.fudan-disc.com\n- https://github.com/FudanDISC/DISC-FinLLM\n\nDISC-FinLLM 是一个专门针对金融场景下为用户提供专业、智能、全面的金融咨询服务的金融领域大模型，由复旦大学数据智能与社会计算实验室 (Fudan-DISC) 开发并开源。\n\n### Deepmoney\n- https://sota.jiqizhixin.com/project/deepmoney\n- https://huggingface.co/TriadParty\n\nDeepMoney是一个专注于金融领域投资的大型语言模型项目。该模型基于Yi-34B、DeepSeek 67B、miqu-70b构建，当前作者微调了三个模型版本：base和sft (基于yi-34B)、deepmoney-67b-chat (DeepSeek) ，和deepmoney-miqu-70b(migu-70b)。基础模型采用了全参数训练。其训练数据包括高质量的研究报告，覆盖2019年至2023年12月的数据，这些报告主要来自传统券商和专业研究机构，大多数为有偿且仅对机构开放。与大多数基于公共知识训练的金融模型不同，deepmoney能够提供深入的市场解释，弥补公共知识在实际金融领域中的不足。该项目还集成了多模态模型，以提取关键信息。\n\n### GPT2 for Multiple Language\n- https://github.com/imcaspar/gpt2-ml\n\n- 简化整理 GPT2 训练代码（based on Grover, supporting TPUs）\n- 移植 bert tokenizer，添加多语言支持\n- 15亿参数 GPT2 中文预训练模型( 15G 语料，训练 10w 步 )\n- 开箱即用的模型生成效果 demo #\n- 15亿参数 GPT2 中文预训练模型( 30G 语料，训练 22w 步 )\n\n### InternLM 书生・浦语\n- https://github.com/InternLM\n- https://mp.weixin.qq.com/s/oTXnvWZJVdoOpFLHngbTYQ\n- https://intern-ai.org.cn/home\n\nInternLM has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:\n\nIt leverages trillions of high-quality tokens for training to establish a powerful knowledge base.\nIt supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.\n\nIt provides a versatile toolset for users to flexibly build their own workflows.\nAdditionally, a lightweight training framework is offered to support model pre-training without the need for extensive dependencies. With a single codebase, it supports pre-training on large-scale clusters with thousands of GPUs, and fine-tuning on a single GPU while achieving remarkable performance optimizations. InternLM achieves nearly 90% acceleration efficiency during training on 1024 GPUs.\n\n### Llama2-chat-Chinese-50W\n- https://mp.weixin.qq.com/s/r_hKK5_cYm8ClqYVApkUYQ\n- https://huggingface.co/RicardoLee/Llama2-chat-Chinese-50W\n\n由于目前的LLama2-chat模型很难约束其以中文进行问题回复，因此该模型旨在提供一个能以中文进行问答的LLama2-chat 7B 模型。\n\n该模型使用LLama2-chat 7B 作为基底模型，使用带embedding 和 LM head 的Lora训练方式训练。模型已完成参数合并，可直接使用。也可以手动将sft_lora_model同Llama2-chat进行合并。\n\n训练数据使用BELLE项目中采样的50万SFT数据进行SFT训练。\n\n### Llama2-Chinese (FlagAlpha)\n- https://github.com/FlagAlpha/Llama2-Chinese\n- https://llama.family/\n\n我们是一个专注于Llama2模型在中文方面的优化和上层建设的高级技术社区。 *基于大规模中文数据，从预训练开始对Llama2模型进行中文能力的持续迭代升级*。 我们热忱欢迎对大模型LLM充满热情的开发者和研究者加入我们的行列。\n\n### LaWGPT\n- https://github.com/pengxiao-song/LaWGPT\n\nLaWGPT 是一系列基于中文法律知识的开源大语言模型。\n\n该系列模型在通用中文基座模型（如 Chinese-LLaMA、ChatGLM 等）的基础上扩充法律领域专有词表、大规模中文法律语料预训练，增强了大模型在法律领域的基础语义理解能力。在此基础上，构造法律领域对话问答数据集、中国司法考试数据集进行指令精调，提升了模型对法律内容的理解和执行能力。\n\n### 夫子•明察司法大模型\n- https://github.com/irlab-sdu/fuzi.mingcha\n\n夫子•明察司法大模型是由山东大学、浪潮云、中国政法大学联合研发，以 ChatGLM 为大模型底座，基于海量中文无监督司法语料（包括各类判决文书、法律法规等）与有监督司法微调数据（包括法律问答、类案检索）训练的中文司法大模型。该模型支持法条检索、案例分析、三段论推理判决以及司法对话等功能，旨在为用户提供全方位、高精准的法律咨询与解答服务。\n\n### DISC-LawLLM\n- https://law.fudan-disc.com\n- https://github.com/FudanDISC/DISC-LawLLM\n- https://arxiv.org/abs/2309.11325\n\n复旦大学数据智能与社会计算实验室（FudanDISC）发布大语言模型驱动的中文智慧法律系统——DISC-LawLLM。该系统可以面向不同用户群体，提供多样的法律服务。此外，构建了评测基准DISC-Law-Eval，从客观和主观两个方面来评测法律大语言模型，模型在评测中的表现相较现有的法律大模型有明显优势。\n\n课题组同时公开包含30万高质量的监督微调（SFT）数据集——DISC-Law-SFT，模型参数和技术报告也一并开源。\n\n### LawBench\n- https://github.com/open-compass/LawBench\n- https://arxiv.org/abs/2309.16289\n\nLawBench经过精心设计，可对大语言模型的法律能力进行精确评估。 在设计测试任务时，我们模拟了司法认知的三个维度，并选择了20个任务来评估大模型的能力。与一些仅有多项选择题的现有基准相比，我们包含了更多与现实世界应用密切相关的任务类型，如法律实体识别、阅读理解、犯罪金额计算和咨询等。 我们认识到当前大模型的安全性策略可能会拒绝回应某些法律询问，或在理解指令方面遇到困难，从而导致缺乏回应。因此，我们开发了一个单独的评估指标 \"弃权率\"，以衡量模型拒绝提供答案或未能正确理解指令的频率。 我们汇报了51种大语言模型在LawBench上的表现，包括20种多语言模型、22种中文模型和9种法律专用大语言模型。\n\n### HK-O1aw\n- https://github.com/HKAIR-Lab/HK-O1aw\n\nHK-O1aw is a legal assistant designed to handle complex legal reasoning, specifically for the Hong Kong legal system. It is built using the Align-Anything framework and trained on the O1aw-Dataset., based on the LLaMA-3.1-8B model. The primary goal of HK-O1aw is to improve the reasoning and problem-solving abilities of large language models in the legal domain. Importantly, all training data, code, and prompts used for synthetic data generation have been open-sourced, facilitating research and collaboration within the community.\n\nThis model addresses the need for intelligent legal assistance in Hong Kong, where legal issues require in-depth analysis and precise reasoning. HK-O1aw integrates advanced O1-style reasoning capabilities, allowing it to perform complex legal analysis, understand context, identify precedents, and interpret statutes. As the first complex reasoning model tailored for Hong Kong‘s common law system, it is particularly valuable for improving legal services and education.\n\n### Lawyer LLaMA\n- https://github.com/AndrewZhe/lawyer-llama\n\nLawyer LLaMA 首先在大规模法律语料上进行了continual pretraining，让它系统的学习中国的法律知识体系。 在此基础上，我们借助ChatGPT收集了一批对中国国家统一法律职业资格考试客观题（以下简称法考）的分析和对法律咨询的回答，利用收集到的数据对模型进行指令微调，让模型习得将法律知识应用到具体场景中的能力。\n\n我们的模型能够：\n- 掌握中国法律知识： 能够正确的理解民法、刑法、行政法、诉讼法等常见领域的法律概念。例如，掌握了刑法中的犯罪构成理论，能够从刑事案件的事实描述中识别犯罪主体、犯罪客体、犯罪行为、主观心理状态等犯罪构成要件。模型利用学到的法律概念与理论，能够较好回答法考中的大部分题目。\n- 应用于中国法律实务：能够以通俗易懂的语言解释法律概念，并且进行基础的法律咨询，涵盖婚姻、借贷、海商、刑事等法律领域。\n- 为了给中文法律大模型的开放研究添砖加瓦，本项目将开源一系列法律领域的指令微调数据和基于LLaMA训练的中文法律大模型的参数 。\n\n### LexiLaw\n- https://github.com/CSHaitao/LexiLaw\n\nLexiLaw 是一个经过微调的中文法律大模型，它基于 ChatGLM-6B 架构，通过在法律领域的数据集上进行微调，使其在提供法律咨询和支持方面具备更高的性能和专业性。\n\n该模型旨在为法律从业者、学生和普通用户提供准确、可靠的法律咨询服务。无论您是需要针对具体法律问题的咨询，还是对法律条款、案例解析、法规解读等方面的查询，LexiLaw 都能够为您提供有益的建议和指导。\n\n同时，我们将分享在大模型基础上微调的经验和最佳实践，以帮助社区开发更多优秀的中文法律大模型，推动中文法律智能化的发展。\n\n### LawGPT_zh 中文法律大模型（獬豸）\n- https://mp.weixin.qq.com/s/Pk4NdFQq5G6iZ3QmcyyFUg\n- https://github.com/LiuHC0428/LAW-GPT\n\n我们的愿景是为让所有人在遇到法律问题时能第一时间获得专业可靠的回答。因为专业的律师服务只有真正触手可及，才会让人们习惯运用，一如二十年前的搜索引擎，十年前的快递业务。我们希望让法律走进日常生活，为构建法治社会贡献我们的力量。项目海报由Midjourney生成。\n\n本项目开源的中文法律通用模型由ChatGLM-6B LoRA 16-bit指令微调得到。数据集包括现有的法律问答数据集和基于法条和真实案例指导的self-Instruct构建的高质量法律文本问答，提高了通用语言大模型在法律领域的表现，提高了模型回答的可靠性和专业程度。\n\n### Linly伶荔说 中文 LLaMA1-2 & OpenLLaMA & Falcon 大模型\n- https://github.com/CVI-SZU/Linly\n- https://mp.weixin.qq.com/s/zSxsArP1pxYNubNDZua7iA\n- https://mp.weixin.qq.com/s/AuAG3tw4JI8lHyLkSdM18g\n\n本项目向社区提供中文对话模型 Linly-ChatFlow 、中文基础模型 Chinese-LLaMA (1-2)、Chinese-Falcon 及其训练数据。 模型基于 TencentPretrain 预训练框架全参数训练（Full-tuning）。 中文基础模型以 LLaMA 和 Falcon 为底座，利用中文和中英平行增量预训练，将它在英文上语言能力迁移到中文上。进一步，项目汇总了目前公开的多语言指令数据，对中文模型进行了大规模指令跟随训练，实现了 Linly-ChatFlow 对话模型。\n\n此外，本项目还公开从头训练的 Linly-OpenLLaMA 模型，包含 3B、7B、13B 规模，在 1TB 中英文语料预训练，针对中文优化字词结合tokenizer，模型以 Apache 2.0 协议公开。\n\n### MediaGPT\n- https://github.com/IMOSR/MediaGPT\n\n虽然LLaMA模型在通用领域通过指令微调已经展示出了令人印象深刻的性能，但对于自媒体创作、直播和运营等领域，由于缺乏专业的训练数据，其能力仍有待提高。为了解决这个问题，我们提出了MediaGPT，一个针对自媒体领域进行特殊训练的模型。\n\nMediaGPT（曾用名Media LLaMA）首先在大规模自媒体语料上进行连续预训练，系统地学习自媒体的知识体系。然后，我们借助ChatGPT收集了一批关于抖音运营、短视频创作、巨量千川投放、直播运营和直播话术技巧等领域知识问题的分析和回答，并利用这些数据对模型进行指令微调，使模型习得如何将自媒体知识应用到实际场景中。\n\n我们的模型具有以下能力：\n1. 掌握自媒体知识： 能够理解抖音运营、短视频创作、巨量千川投放、直播运营等领域的核心概念和策略。\n\n2. 适用于实际操作： 能够以通俗易懂的语言解释自媒体概念，并进行基础的自媒体运营咨询，涵盖内容创作、平台运营、广告投放等领域。\n\n为了推动中文自媒体大模型的开放研究，我们将开源一系列自媒体领域的指令微调数据和基于LLaMA训练的中文自媒体大模型的参数。\n\n### CharacterGLM-6B\n- https://github.com/thu-coai/CharacterGLM-6B\n- https://arxiv.org/pdf/2311.16832.pdf\n\nIn this paper, we present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs. On top of CharacterGLM, we can customize various AI characters or social agents by configuring their attributes (identities, interests, viewpoints, experiences, achievements, social relationships, etc.) and behaviors (linguistic features, emotional expressions, interaction patterns, etc.). Our model outperforms most mainstream close-source large langauge models, including the GPT series, especially in terms of consistency, human-likeness, and engagement according to manual evaluations. We will release our 6B version of CharacterGLM and a subset of training data to facilitate further research development in the direction of character-based dialogue generation.\n\n### Haruhi-Zero\n- https://github.com/LC1332/Zero-Haruhi\n- https://huggingface.co/silk-road/Haruhi-Zero-7B-0_3\n\n凉宫春日-Zero是一个同时支持Zero-Shot角色构造和RAG角色构造(原ChatHaruhi)的角色扮演模型。\n\n### Translational-Style-ChatLLM西式翻译腔\n- https://github.com/Benson114/Translational-Style-ChatLLM\n\n本项目是一个小型的个人微调实验项目，其中全程纯使用Prompt工程和调用OpenAI的API构造指令微调数据集，微调基座模型选用Qwen1.5-7B-Chat，微调步骤使用的是开源框架LLaMA-Factory\n\n项目中包含构造数据集的全流程代码和微调部分的脚本\n\n### StyleLLM\n- https://github.com/stylellm/stylellm_models\n\nstylellm 是一个基于大语言模型（llm）的文本风格迁移（text style transfer）项目。项目利用大语言模型来学习指定文学作品的写作风格（惯用词汇、句式结构、修辞手法、人物对话等），形成了一系列特定风格的模型。\n\n利用stylellm模型可将学习到的风格移植至其他通用文本上，即：输入一段原始文本，模型可对其改写，输出带有该风格特色的文本，达到文字修饰、润色或风格模仿的效果。\n\n### Tianji来事儿AI\n- https://github.com/SocialAI-tianji/Tianji\n\n天机是 SocialAI（来事儿AI）制作的一款免费使用、非商业用途的人工智能系统。您可以利用它进行涉及传统人情世故的任务，如如何敬酒、如何说好话、如何会来事儿等，以提升您的情商和核心竞争能力。我们坚信，只有人情世故才是未来AI的核心技术，只有会来事儿的AI才有机会走向AGI，让我们携手见证通用人工智能的来临。 —— \"天机不可泄漏。\"\n\n### TinyStories\n- https://github.com/Mxoder/TinyStories\n\n这次打算用 Hugging Face 的 API 来写一份预训练大（小）模型的代码，也就是用 Trainer 来做预训练。由于只是想练习一下，因此打算选一个极小模型 + 小数据集。为了贴近主流，于是打算预训练一个 LLaMA 3——不过是超迷你版本，大小仅不到 20M。\n\n想起来曾经看到过的微软的工作 TinyStories，探索的是语言模型在多小的情况下还能流利地讲故事，工作非常直白、有趣，刚好也契合我的练习想法，于是这次来复现一下。\n\n### Higgs-Llama-3-70B\n- https://huggingface.co/bosonai/Higgs-Llama-3-70B\n- https://boson.ai/higgs-opensource/\n\nWe perform supervised fine-tuning with our in-house instruction-following and chat datasets. Afterwards, we construct preference pairs with a semi-automated pipeline that relies on both human-labelers and our private LLMs. We conduct iterative preference optimization to align the model. During alignment, we adopted a special strategy to align the model’s behavior with the system message. Compared with other instruct models, Higgs models follow their roles more closely.\n\n### persona-hub\n- https://github.com/tencent-ailab/persona-hub\n- https://arxiv.org/pdf/2406.20094\n\nWe propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.\n\n### Peach-9B-8k-Roleplay\n- https://huggingface.co/ClosedCharacter/Peach-9B-8k-Roleplay\n\nPeach-9B-8k-Roleplay is a chat large language model obtained by finetuning 01-ai/Yi-1.5-9B model on more than 100K conversations created through our data synthesis approach.\n\n### Hermes 3\n- https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf\n- https://nousresearch.com/hermes3/\n\nHermes 3 contains advanced long-term context retention and multi-turn conversation capability, complex roleplaying and internal monologue abilities, and enhanced agentic function-calling. Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fine-tuning Llama 3.1 8B, 70B and 405B, and training on a dataset of primarily synthetically generated responses. The model boasts comparable and superior performance to Llama 3.1 while unlocking deeper capabilities in reasoning and creativity.\n\n### SkyReels（短剧）\n- https://github.com/vaew/skyscript-100m\n- https://skyreels.ai/beta\n\nGenerating high-quality shooting scripts containing information such as scene and shot language is essential for short drama script generation. We collect 6,660 popular short drama episodes from the Internet, each with an average of 100 short episodes, and the total number of short episodes is about 80,000, with a total duration of about 2,000 hours and totaling 10 terabytes (TB). We perform keyframe extraction and annotation on each episode to obtain about 10,000,000 shooting scripts. We perform 100 script restorations on the extracted shooting scripts based on our self-developed large short drama generation model SkyReels. This leads to a dataset containing 1,000,000,000 pairs of scripts and shooting scripts for short dramas, called SkyScript-100M. We compare SkyScript-100M with the existing dataset in detail and demonstrate some deeper insights that can be achieved based on SkyScript-100M. Based on SkyScript-100M, researchers can achieve several deeper and more far-reaching script optimization goals, which may drive a paradigm shift in the entire field of text-to-video and significantly advance the field of short drama video generation.\n\n### MeChat (Mental Health Support Chatbot)\n- https://github.com/qiuhuachuan/smile\n- https://huggingface.co/qiuhuachuan/MeChat\n- https://mechat.fly.dev/\n\n我们的愿景是为让所有人在遇到心理健康问题时能够获得及时、有效的倾听和支持。我们相信，心理健康是每个人的权利，而不是奢侈品。我们的使命是为人们提供平等、全面、易于访问的心理健康服务，无论他们身在何处、面临何种挑战。我们的愿景还包括推动社会对心理健康问题的认识和理解，打破心理健康问题带来的污名和歧视，为创建一个更加健康、包容和平等的社会做出贡献。项目海报取自 flaticon 。\n\n### MedicalGPT\n- https://github.com/shibing624/MedicalGPT\n\nMedicalGPT 训练医疗大模型，实现包括二次预训练、有监督微调、奖励建模、强化学习训练。\n\n基于ChatGPT Training Pipeline，本项目实现了领域模型--医疗模型的四阶段训练：\n\n第一阶段：PT(Continue PreTraining)增量预训练，在海量领域文档数据上二次预训练GPT模型，以注入领域知识\n\n第二阶段：SFT(Supervised Fine-tuning)有监督微调，构造指令微调数据集，在预训练模型基础上做指令精调，以对齐指令意图\n\n第三阶段：RM(Reward Model)奖励模型建模，构造人类偏好排序数据集，训练奖励模型，用来对齐人类偏好，主要是\"HHH\"原则，具体是\"helpful, honest, harmless\"\n\n第四阶段：RL(Reinforcement Learning)基于人类反馈的强化学习(RLHF)，用奖励模型来训练SFT模型，生成模型使用奖励或惩罚来更新其策略，以便生成更高质量、更符合人类偏好的文本\n\n### 明医 (MING)：中文医疗问诊大模型 （原名：MedicalGPT-zh）\n- https://github.com/MediaBrain-SJTU/MING\n\n**MedicalGPT-zh**\n该开源了基于ChatGLM-6B LoRA 16-bit指令微调的中文医疗通用模型。基于共计28科室的中文医疗共识与临床指南文本，我们生成医疗知识覆盖面更全，回答内容更加精准的高质量指令数据集。\n\n**明医（MING）**\n本项目开源了基于医疗指令微调的中文医疗问诊模型：明医 (MING)。\n\n### OpenKG-KnowLLM\n- https://github.com/zjunlp/KnowLLM\n\nKnowledgable Large Language Model Series.\n\nWith the rapid development of deep learning technology, large language models such as ChatGPT have achieved significant success in the field of natural language processing. However, these large models still face some challenges and issues in learning and understanding knowledge, including the difficulty of knowledge updating, and issues with potential errors and biases within the model, known as knowledge fallacies. The Deep Model series aims to release a series of open-source large models to mitigate these knowledge fallacy issues. The first phase of this project released a knowledge extraction large model based on LLaMA, named Zhishi. To provide Chinese capabilities without disrupting the original model's distribution, we firstly (1) use Chinese corpora for the full-scale pre-training of LLaMA (13B), in order to improve the model's understanding of Chinese and knowledge reserve as much as possible while retaining its original English and code capabilities; Then (2) we fine-tune the model from the first step using an instruction dataset, to enhance the language model's understanding of human extraction instructions.\n\n### OpenMEDLab 浦医\n- https://github.com/OpenMEDLab\n- https://github.com/openmedlab/PULSE\n- https://stcsm.sh.gov.cn/xwzx/kjzl/20230630/c783c30d8e62494e83073535f841675f.html\n\nOpenMEDLab is an open-source platform to share medical foundation models in multi-modalities, e.g., medical imaging, medical NLP, bioinformatics, protein, etc. It targets promoting novel approaches to long-tail problems in medicine, and meanwhile, it seeks solutions to achieve lower cost, higher efficiency, and better generalizability in training medical AI models. The new learning paradigm of adapting foundation models to downstream applications makes it possible to develop innovative solutions for cross-domain and cross-modality diagnostic tasks efficiently. OpenMEDLab is distinguished by several features:\n- World's first open-source platform for medical foundation models.\n- 10+ medical data modalities targeting a variety of clinical and research problems.\n- Pioneering works of the new learning paradigm using foundation models, including pre-trained models, code, and data.\n- Releasing multiple sets of medical data for pre-training and downstream applications.\n- Collaboration with top medical institutes and facilities.\n\n### PromptCLUE\n- https://github.com/clue-ai/PromptCLUE\n\nPromptCLUE：大规模多任务Prompt预训练中文开源模型。\n\n中文上的三大统一：统一模型框架，统一任务形式，统一应用方式。\n\n支持几十个不同类型的任务，具有较好的零样本学习能力和少样本学习能力。针对理解类任务，如分类、情感分析、抽取等，可以自定义标签体系；针对生成任务，可以进行采样自由生成。\n\n千亿中文token上大规模预训练，累计学习1.5万亿中文token，亿级中文任务数据上完成训练，训练任务超过150+。比base版平均任务提升7个点+；具有更好的理解、生成和抽取能力，并且支持文本改写、纠错、知识图谱问答。\n\n### SkyText-Chinese-GPT3\n- https://github.com/SkyWorkAIGC/SkyText-Chinese-GPT3\n\nSkyText是由奇点智源发布的中文GPT3预训练大模型，可以进行聊天、问答、中英互译等不同的任务。 应用这个模型，除了可以实现基本的聊天、对话、你问我答外，还能支持中英文互译、内容续写、对对联、写古诗、生成菜谱、第三人称转述、创建采访问题等多种功能。\n\n### ShenNong-TCM-LLM\n- https://github.com/michael-wzhu/ShenNong-TCM-LLM\n\n为推动LLM在中医药领域的发展和落地，提升LLM的在中医药方面的知识与回答医学咨询的能力，同时推动大模型赋能中医药传承，我们现推出ShenNong中医药大规模语言模型:\n\n🚀 ShenNong-TCM :\n- 这一模型的训练数据为中医药指令数据集ShenNong_TCM_Dataset。\n- ChatMed_TCM_Dataset以我们开源的中医药知识图谱为基础；\n- 采用以实体为中心的自指令方法entity-centric self-instruct，调用ChatGPT得到11w+的围绕中医药的指令数据；\n- ShenNong-TCM模型也是以LlaMA为底座，采用LoRA (rank=16)微调得到。微调代码与ChatMed代码库相同\n\n### TableGPT\n- https://github.com/ZJU-M3/TableGPT-techreport\n\nTableGPT is a specifically designed for table analysis. By unifying tables, natural language, and commands into one model, TableGPT comprehends tabular data, understands user intent through natural language, dissects the desired actions, and executes external commands on the table. It subsequently returns the processed results in both tabular and textual explanations to the user. This novel approach simplifies the way users engage with table data, bringing an intuitive feel to data analysis.\n\n### TransGPT · 致远\n- https://mp.weixin.qq.com/s/WvzyjHqI0lOGIyPlCIFNQg\n- https://github.com/DUOMO/TransGPT\n\nTransGPT・致远的训练基于约 34.6 万条交通领域文本数据（用于领域内预训练）和 5.8 万条交通领域对话数据（用于微调），可支持实时类 APP 接入（地图、公交等应用）。目前，TransGPT・致远已开源，相关资源不仅对学术研究完全开放，仅需邮件申请并获得官方商用许可后，即可以免费商用。\n\n与通用型的多模态交通大模型产品不同，TransGPT 主要致力于在真实交通场景中发挥实际价值，包括交通情况预测、智能咨询助手、公共交通服务、交通规划设计、交通安全教育、协助管理、交通事故报告和分析、自动驾驶辅助系统等能力。\n\n### UrbanGPT\n- https://urban-gpt.github.io/\n- https://github.com/HKUDS/UrbanGPT\n- https://arxiv.org/abs/2403.00813\n- https://sites.google.com/view/chaoh/home\n\nIn this work, we present a spatio-temporal large language model that can exhibit exceptional generalization capabilities across a wide range of downstream urban tasks. To achieve this objective, we present the UrbanGPT, which seamlessly integrates a spatio-temporal dependency encoder with the instruction-tuning paradigm. This integration enables large language models (LLMs) to comprehend the complex inter-dependencies across time and space, facilitating more comprehensive and accurate predictions under data scarcity. Extensive experimental findings highlight the potential of building LLMs for spatio-temporal learning, particularly in zero-shot scenarios.\n\n### TechGPT\n- https://mp.weixin.qq.com/s/nF1He7jhAHfh7PzhjqHoZg\n- https://huggingface.co/neukg/TechGPT-7B\n- https://github.com/neukg/TechGPT\n\n2023年6月26日，“东北大学知识图谱研究组”正式发布大语言模型TechGPT。\n\nTechGPT的名字主要来源于小组在2018年推出的TechKG大规模中文学术多领域的知识库。\n\n与当前其他各类大模型相比，TechGPT主要强化了以“知识图谱构建”为核心的关系三元组抽取等各类信息抽取任务、以“逻辑推理”为核心的机器阅读理解等各类智能问答任务、以“文本理解”为核心的关键词生成等各类序列生成任务。\n\n在这三大自然语言处理核心能力之内，TechGPT还具备了对计算机科学、材料、机械、冶金、金融和航空航天等十余种垂直专业领域自然语言文本的处理能力。\n\n### TigerBot\n- https://github.com/TigerResearch/TigerBot\n\nTigerBot 是一个多语言多任务的大规模语言模型(LLM)。根据 OpenAI InstructGPT 论文在公开 NLP 数据集上的自动评测，TigerBot-7B 达到 OpenAI 同样大小模型的综合表现的 96%，并且这只是我们的 MVP，在此我们将如下探索成果开源：\n\n- 模型：TigerBot-7B, TigerBot-7B-base，TigerBot-180B (research version)，\n- 代码：基本训练和推理代码，包括双卡推理 180B 模型的量化和推理代码，\n- 数据：预训练 100G，从 2TB 过滤后的数据中经过去噪去重清洗而得；监督微调 1G 或 100 万条数据，按比例涵盖用户指令常见的 10 大类 120 小类任务，\n- API: chat, plugin, finetune, 让用户能在半小时内无代码的训练和使用专属于自己的大模型和数据，\n- 领域数据：涵盖金融，法律，百科，广邀大模型应用开发者，一起打造中国的世界级的应用。\n\n我们在 BLOOM 基础上，在模型架构和算法上做了如下优化：\n\n- 指令完成监督微调的创新算法以获得更好的可学习型(learnability)，\n- 运用 ensemble 和 probabilistic modeling 的方法实现更可控的事实性(factuality)和创造性(generativeness)，\n- 在并行训练上，我们突破了 deep-speed 等主流框架中若干内存和通信问题，使得在千卡环境下数月无间断，\n- 对中文语言的更不规则的分布，从 tokenizer 到训练算法上做了更适合的算法优化。\n\n### XVERSE-13B\n- https://github.com/xverse-ai/XVERSE-13B\n\nXVERSE-13B 是由深圳元象科技自主研发的支持多语言的大语言模型（Large Language Model），主要特点如下：\n- 模型结构：XVERSE-13B 使用主流 Decoder-only 的标准 Transformer 网络结构，支持 8K 的上下文长度（Context Length），为同尺寸模型中最长，能满足更长的多轮对话、知识问答与摘要等需求，模型应用场景更广泛。\n- 训练数据：构建了 1.4 万亿 token 的高质量、多样化的数据对模型进行充分训练，包含中、英、俄、西等 40 多种语言，通过精细化设置不同类型数据的采样比例，使得中英两种语言表现优异，也能兼顾其他语言效果。\n- 分词：基于 BPE（Byte-Pair Encoding）算法，使用上百 GB 语料训练了一个词表大小为 100,278 的分词器，能够同时支持多语言，而无需额外扩展词表。\n- 训练框架：自主研发多项关键技术，包括高效算子、显存优化、并行调度策略、数据-计算-通信重叠、平台和框架协同等，让训练效率更高，模型稳定性强，在千卡集群上的峰值算力利用率可达到 58.5%，位居业界前列。\n\n### YuLan-Chat & YuLan-Chat-2\n- https://github.com/RUC-GSAI/YuLan-Chat\n- https://huggingface.co/yulan-team\n- https://mp.weixin.qq.com/s/nPS4N3stAAG_51fnZANbMA\n\n**YuLan-Chat**\n中国人民大学高瓴人工智能学院相关研究团队（由多位学院老师联合指导）展开了一系列关于指令微调技术的研究，并发布了学院初版大语言对话模型——YuLan-Chat，旨在探索和提升大语言模型的中英文双语对话能力。\n\n我们分别开源了13B和65B的YuLan-Chat模型文件及相关代码，并采用量化技术使其分别可以在单张RTX3090-24G和A800-80G显卡上部署。YuLan-Chat模型基于LLaMA底座模型，采用精心优化的高质量中英文混合指令进行微调，其中YuLan-Chat-65B模型目前能够在中英文相关评测数据集上显著超越已有开源模型效果。后续我们会继续优化指令微调方法与底座模型，持续更新YuLan-Chat模型。\n\n**YuLan-Chat-2**\n在2023年6月发布YuLan-Chat第一版模型后，高瓴人工智能学院研究团队继续探索大语言模型预训练与指令微调等技术，并在LLaMA-2的基础上训练得到了新版基石模型YuLan-LLaMA-2-13B和对话模型YuLan-Chat-2-13B。这些模型在原版LLaMA-2的基础上扩充中文词表与上下文长度（达到8k），使用了大规模中英文数据进行增量预训练和指令微调，同时提升了模型的中英文基础语义和理解能力，相比前一代模型效果获得了显著提升，与同期其他基于LLaMA-2得到的模型相比，也具有显著性能优势。经过量化，模型可以在单张RTX-3090 24G显卡中部署。\n\n### Ziya-LLaMA\n- https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1\n- https://github.com/IDEA-CCNL/Fengshenbang-LM\n- https://mp.weixin.qq.com/s/IeXgq8blGoeVbpIlAUCAjA\n\n姜子牙通用大模型V1是基于LLaMa的130亿参数的大规模预训练模型，具备翻译，编程，文本分类，信息抽取，摘要，文案生成，常识问答和数学计算等能力。目前姜子牙通用大模型已完成大规模预训练、多任务有监督微调和人类反馈学习三阶段的训练过程。\n\nThe Ziya-LLaMA-13B-v1 is a large-scale pre-trained model based on LLaMA with 13 billion parameters. It has the ability to perform tasks such as translation, programming, text classification, information extraction, summarization, copywriting, common sense Q&A, and mathematical calculation. The Ziya-LLaMA-13B-v1 has undergone three stages of training: large-scale continual pre-training (PT), multi-task supervised fine-tuning (SFT), and human feedback learning (RM, PPO).\n\n### FLM-101B\n- https://arxiv.org/pdf/2309.03852.pdf\n- https://huggingface.co/CofeAI/FLM-101B\n\nFLM-101B is an open-source decoder-only LLM with 101 billion parameters. During the training process, model growth technique was employed. The model rapidly acquires knowledge on a small-scale model(16B) in the early stages of training and gradually scales up to 101B, resulting in a cost-effective 100B-scale LLM training(costing approximately $100,000). FLM-101B supports both Chinese and English languages. It has a context window length of 2048 in training. Thanks to the use of xPos rotary position embedding, it allows for efficient expansion of the window size during inference.\n\nTo advance the development of 100B-scale Large Language Models (LLMs), FLM-101B has now been fully open-sourced.\n\n### MindChat(漫谈): 心理大模型\n- https://github.com/X-D-Lab/MindChat\n\n心理大模型——漫谈(MindChat)期望从心理咨询、心理评估、心理诊断、心理治疗四个维度帮助人们纾解心理压力与解决心理困惑, 提高心理健康水平. 作为一个心理大模型, MindChat通过营造轻松、开放的交谈环境, 以放松身心、交流感受或分享经验的方式, 与用户建立信任和理解的关系. MindChat希望为用户提供隐私、温暖、安全、及时、方便的对话环境, 从而帮助用户克服各种困难和挑战, 实现自我成长和发展.\n\n无论是在工作场景还是在个人生活中, MindChat期望通过心理学专业知识和人工智能大模型技术, 在严格保护用户隐私的前提下, 全时段全天候为用户提供全面的心理支持和诊疗帮助, 同时实现自我成长和发展, 以期为建设一个更加健康、包容和平等的社会贡献力量。\n\n### WiNGPT\n- https://github.com/winninghealth/WiNGPT2\n\nWiNGPT是一个基于GPT的医疗垂直领域大模型，旨在将专业的医学知识、医疗信息、数据融会贯通，为医疗行业提供智能化的医疗问答、诊断支持和医学知识等信息服务，提高诊疗效率和医疗服务质量。\n\n### CareGPT\n- https://github.com/WangRongsheng/CareGPT\n\nCareGPT (关怀GPT)是一个医疗大语言模型，同时它集合了数十个公开可用的医疗微调数据集和开放可用的医疗大语言模型，包含LLM的训练、测评、部署等以促进医疗LLM快速发展。\n\n### 孙思邈\n- https://github.com/thomas-yanxin/Sunsimiao\n\n孙思邈中文医疗大模型(简称: Sunsimiao)希望能够遵循孙思邈的生平轨迹, 重视民间医疗经验, 不断累积中文医疗数据, 并将数据附加给模型, 致力于提供安全、可靠、普惠的中文医疗大模型.\n\n目前, Sunsimiao是由baichuan-7B和ChatGLM-6B系列在十万级高质量的中文医疗数据中微调而得, 后续将收集更多数据, 扩充模型能力, 不断迭代更新. 相关细节工作正在整理, 敬请期待.\n\n### MolGen（药物研发）\n- https://github.com/zjunlp/Mol-Instructions\n\nMol-Instructions comprises three cardinal components:\n\n🔬 Molecule-oriented instructions: This component delves into the world of small molecules, emphasizing their inherent properties and behaviors. It sheds light on the fundamental challenges of diverse chemical reactions and molecular design, with 148,4K instructions across six tasks.\n\n🧬 Protein-oriented instructions: Rooted in the biosciences, this component presents 505K instructions across five distinct categories of tasks. These tasks aim to predict the structure, function, and activity of proteins, and facilitate protein design based on textual directives.\n\n🥼 Biomolecular text instructions: Predominantly designed to cater to NLP tasks within the fields of bioinformatics and chemoinformatics, this part encapsulates six information extraction and Q&A tasks represented through 53K instructions.\n\n### Multilingual Medicine\n- https://github.com/FreedomIntelligence/Apollo/tree/main\n\nDespite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.\n\n### Sequel\n- https://github.com/SequelHQ/Sequel\n\nSequel is an open-source software application meticulously designed to be your ultimate companion in taking control of your health through personalized nutrition. By leveraging our cutting-edge platform, users can effortlessly analyze lab reports, track supplement and nutrient intake, and access a comprehensive library of evidence-based information. Our mission is to empower you with the tools and knowledge necessary to make informed decisions about your well-being, guiding you towards a healthier, longer life.\n\n### Gene editing\n- https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1.full.pdf\n\nGene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as human cells. Artificial intelligence (AI) enabled design provides a powerful alternative with potential to bypass evolutionary constraints and generate editors with optimal properties. Here, using large language models (LLMs) trained on biological diversity at scale, we demonstrate the first successful precision editing of the human genome with a programmable gene editor designed with AI. To achieve this goal, we curated a dataset of over one million CRISPR operons through systematic mining of 26 terabases of assembled genomes and meta-genomes. We demonstrate the capacity of our models by generating 4.8x the number of protein clusters across CRISPR-Cas families found in nature and tailoring single-guide RNA sequences for Cas9-like effector proteins. Several of the generated gene editors show comparable or improved activity and specificity relative to SpCas9, the prototypical gene editing effector, while being 400 mutations away in sequence. Finally, we demonstrate an AI-generated gene editor, denoted as OpenCRISPR-1, exhibits compatibility with base editing. We release OpenCRISPR-1 publicly to facilitate broad, ethical usage across research and commercial applications.\n\n### Llama-3-8B-UltraMedical\n- https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical\n\nLlama-3-8B-UltraMedical is an open-access large language model (LLM) specialized in biomedicine. Developed by the Tsinghua C3I Lab, this model aims to enhance medical examination access, literature comprehension, and clinical knowledge.\n\n### PH-LLM\n- https://research.google/blog/advancing-personal-health-and-wellness-insights-with-ai/\n- https://arxiv.org/abs/2406.06474\n- https://arxiv.org/abs/2406.06464\n\nThe Personal Health Large Language Model (PH-LLM) is a fine-tuned version of Gemini, designed to generate insights and recommendations to improve personal health behaviors related to sleep and fitness patterns. By using a multimodal encoder, PH-LLM is optimized for both textual understanding and reasoning as well as interpretation of raw time-series sensor data such as heart rate variability and respiratory rate from wearables.\n\n### ProLLM\n- https://github.com/MingyuJ666/ProLLM\n- https://arxiv.org/html/2405.06649v1\n\nThe prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions, ignoring the broader context of nonphysical connections through intermediate proteins, thus limiting their effectiveness. The emergence of Large Language Models (LLMs) provides a new opportunity for addressing this complex biological challenge. By transforming structured data into natural language prompts, we can map the relationships between proteins into texts. This approach allows LLMs to identify indirect connections between proteins, tracing the path from upstream to downstream. Therefore, we propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time. Specifically, we propose Protein Chain of Thought (ProCoT), which replicates the biological mechanism of signaling pathways as natural language prompts. ProCoT considers a signaling pathway as a protein reasoning process, which starts from upstream proteins and passes through several intermediate proteins to transmit biological signals to downstream proteins. Thus, we can use ProCoT to predict the interaction between upstream proteins and downstream proteins. The training of ProLLM employs the ProCoT format, which enhances the model’s understanding of complex biological problems. In addition to ProCoT, this paper also contributes to the exploration of embedding replacement of protein sites in natural language prompts, and instruction fine-tuning in protein knowledge datasets. We demonstrate the efficacy of ProLLM through rigorous validation against benchmark datasets, showing significant improvement over existing methods in terms of prediction accuracy and generalizability. Our results highlight the potential of LLMs to transform the field of PPI, serving as a robust potential tool for various categories of biological and medical research. \n\n### MolecularGPT\n- https://arxiv.org/abs/2406.12950\n- https://github.com/NYUSHCS/MolecularGPT\n\nMolecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 16.6% increase on classification accuracy and decrease of 199.17 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. \n\n### CHIEF（Clinical Histopathology Imaging Evaluation Foundation）\n- https://www.nature.com/articles/s41586-024-07894-z\n- https://scitechdaily.com/96-accuracy-harvard-scientists-unveil-revolutionary-chatgpt-like-ai-for-cancer-diagnosis/\n- https://hms.harvard.edu/news/new-artificial-intelligence-tool-cancer\n\nHistopathology image evaluation is indispensable for cancer diagnoses and subtype classification. Standard artificial intelligence methods for histopathology image analyses have focused on optimizing specialized models for each diagnostic task. Although such methods have achieved some success, they often have limited generalizability to images generated by different digitization protocols or samples collected from different populations. Here, to address this challenge, we devised the Clinical Histopathology Imaging Evaluation Foundation (CHIEF) model, a general-purpose weakly supervised machine learning framework to extract pathology imaging features for systematic cancer evaluation. CHIEF leverages two complementary pretraining methods to extract diverse pathology representations: unsupervised pretraining for tile-level feature identification and weakly supervised pretraining for whole-slide pattern recognition. We developed CHIEF using 60,530 whole-slide images spanning 19 anatomical sites. Through pretraining on 44 terabytes of high-resolution pathology imaging datasets, CHIEF extracted microscopic representations useful for cancer cell detection, tumour origin identification, molecular profile characterization and prognostic prediction. We successfully validated CHIEF using 19,491 whole-slide images from 32 independent slide sets collected from 24 hospitals and cohorts internationally. Overall, CHIEF outperformed the state-of-the-art deep learning methods by up to 36.1%, showing its ability to address domain shifts observed in samples from diverse populations and processed by different slide preparation methods. CHIEF provides a generalizable foundation for efficient digital pathology evaluation for patients with cancer.\n\n### HuatuoGPT-o1\n- https://github.com/FreedomIntelligence/HuatuoGPT-o1\n\nHuatuoGPT-o1 is a medical LLM designed for advanced medical reasoning. It can identify mistakes, explore alternative strategies, and refine its answers. \n\n### Baichuan-14B-M1\n- https://github.com/baichuan-inc/Baichuan-M1-14B\n\nBaichuan-14B-M1 是由百川智能开发的业界首款从零开始专为医疗场景优化的开源大语言模型。在拥有卓越通用能力的同时，在医疗领域方面有着强大的性能。在大部分通用榜单评测中达到了同尺寸模型的效果，而在医疗场景中达到了5倍甚至更大的模型的效果。\n\n### MedFound\n- https://github.com/medfound/medfound\n\nThe accurate diagnosis is crucial in healthcare. Here, we introduce MedFound, which is a medical large language model (Medical LLM) pretrained on medical text and real-world clinical records. We fine-tuned MedFound using a self-bootstrapping strategy to learn diagnostic reasoning and incorporated a preference alignment framework to align with standard clinical practice. Our approach results in MedFound-DX-PA, a LLM based diagnostic system that aligns with clinical requirements. This repository contains the code used for data preprocessing, model development, and evaluation in our study (A Generalist Medical Language Model for Disease Diagnosis Assistance).\n\n### Taiyi（太一）\n- https://github.com/DUTIR-BioNLP/Taiyi-LLM\n- https://arxiv.org/abs/2311.11608\n\n随着深度学习技术的迅速发展，类ChatGPT这样的大语言模型在自然语言处理领域已经取得了显著的进展。面向生物医学领域，大语言模型有助于医生与患者之间的沟通，提供有用的医学信息，并在辅助诊疗、生物医学知识发现、药物研发、个性化医疗方案等方面具有巨大潜力。然而，在人工智能社区中，已有的开源生物医学大模型相对较少，且大多主要专注于单语（中文或英语）的医疗问答对话。因此，本项目开展了面向生物医学领域大模型的研究，并发布初版中英双语生物医学大模型——太一（Taiyi），旨在探索大模型在生物医学领域中双语自然语言处理多任务的能力。\n\n### MedAgents\n- https://github.com/gersteinlab/MedAgents\n- https://arxiv.org/pdf/2311.10537.pdf\n\nWe propose a Multi-disciplinary Collaboration (MC) framework. The framework works in five stages: (i) expert gathering: gather experts from distinct disciplines according to the clinical question; (ii) analysis proposition: domain experts put forward their own analysis with their expertise; (iii) report summarization: compose a summarized report on the basis of a previous series of analyses; (iv) collaborative consultation: engage the experts in discussions over the summarized report. The report will be revised iteratively until an agreement from all the experts is reached; (v) decision making: derive a final decision from the unanimous report.\n\n### Molecule Optimization\n- https://github.com/blazerye/DrugAssist\n- https://arxiv.org/abs/2401.10334\n\nRecently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through human-machine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instruction-based dataset called MolOpt-Instructions for fine-tuning language models on molecule optimization tasks. \n\n### MolTC\n- https://github.com/MangoKiller/MolTC\n- https://arxiv.org/abs/2402.03781\n\nMolecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of information underutilization, as it hinders the sharing of interaction mechanism learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. For achieving a unified MRL, MolTC innovatively develops a dynamic parameter-sharing strategy for cross-dataset information sharing. Moreover, to train MolTC efficiently, we introduce a Multi-hierarchical CoT concept to refine its training paradigm, and conduct a comprehensive Molecular Interactive Instructions dataset for the development of biochemical LLMs involving MRL. Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines.\n\n### Mol-Instructions\n- https://arxiv.org/pdf/2306.08018.pdf\n- https://github.com/zjunlp/Mol-Instructions\n\nLarge Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.\n\n### TinyLlama\n- https://github.com/jzhang38/TinyLlama\n\nTinyLlama项目旨在在3万亿tokens上进行预训练，构建一个拥有11亿参数的Llama模型。经过精心优化，我们\"仅\"需16块A100-40G的GPU，便可在90天内完成这个任务🚀🚀。训练已于2023-09-01开始。\n\n我们采用了与Llama 2完全相同的架构和分词器。这意味着TinyLlama可以在许多基于Llama的开源项目中即插即用。此外，TinyLlama只有1.1B的参数，体积小巧，适用于需要限制计算和内存占用的多种应用。\n\n### Nous-Hermes-2 Mixtral 8x7B\n- https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT\n- https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\n- https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-adapter\n\nNous Hermes 2 Mixtral 8x7B SFT is the supervised finetune only version of our new flagship Nous Research model trained over the Mixtral 8x7B MoE LLM.\n\nNous Hermes 2 Mixtral 8x7B DPO is the new flagship Nous Research model trained over the Mixtral 8x7B MoE LLM.\n\nQLoRA Adapter for the DPO Phase of Nous-Hermes-2 Mixtral 8x7B Model.\n\n### AlphaGeometry\n- https://www.nature.com/articles/s41586-023-06747-5\n- https://github.com/google-deepmind/alphageometry\n\nProving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges1,5, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.\n\n### MoE-Mamba\n- https://arxiv.org/abs/2401.04081\n\nState Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.\n\n### StarCoder\n- https://huggingface.co/bigcode/starcoder\n- https://github.com/bigcode-project/starcoder/tree/main\n- https://arxiv.org/abs/2305.06161\n\nThe BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.\n\n### OLMo\n- https://allenai.org/olmo/olmo-paper.pdf\n- https://huggingface.co/allenai/OLMo-7B\n- https://github.com/allenai/OLMo\n- https://github.com/allenai/OLMo-Eval\n- https://github.com/allenai/open-instruct\n\nOLMo is a repository for training and using AI2's state-of-the-art open language models. It is built by scientists, for scientists.\n\n### H2O-Danube-1.8B\n- https://arxiv.org/abs/2401.16818\n\nWe present H2O-Danube-1.8B, a 1.8B language model trained on 1T tokens following the core principles of LLama 2 and Mistral. We leverage and refine various techniques for pre-training large language models. Although our model is trained on significantly fewer total tokens compared to reference models of similar size, it exhibits highly competitive metrics across a multitude of benchmarks. We additionally release a chat model trained with supervised fine-tuning followed by direct preference optimization. We make H2O-Danube-1.8B openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.\n\n### OpenMathInstruct-1\n- https://arxiv.org/abs/2402.10176\n- https://huggingface.co/collections/nvidia/openmath-65c5619de2ba059be0775014\n\nRecent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and MATH, two popular math reasoning benchmarks, using the recently released and permissively licensed Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which is competitive with the best gpt-distilled models. We release our code, models, and the OpenMathInstruct-1 dataset under a commercially permissive license.\n\n### Smaug-72B\n- https://huggingface.co/abacusai/Smaug-72B-v0.1\n\nWe recently released Smaug-72B-v0.1 which has taken first place on the Open LLM Leaderboard by HuggingFace. It is the first open-source model to have an average score more than 80.\n\nSmaug-72B is finetuned directly from moreh/MoMo-72B-lora-1.8.7-DPO and is ultimately based on Qwen-72B.\n\n### Gemma\n- https://ai.google.dev/gemma/\n\nA family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models.\n\n### Aya Model\n- https://arxiv.org/abs/2402.07827\n- https://hf.co/CohereForAI/aya-101\n\nRecent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages -- including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.\n\n### MobiLlama\n- https://github.com/mbzuai-oryx/MobiLlama\n- https://arxiv.org/abs/2402.16840\n\n\"Bigger the better\" has been the predominant trend in recent Large Language Models (LLMs) development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the \"less is more\" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices. Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands. MobiLlama is a SLM design that initiates from a larger model and applies a careful parameter sharing scheme to reduce both the pre-training and the deployment cost.\n\n### StarCoder2\n- https://huggingface.co/blog/starcoder2\n\nStarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.\n\n### SmallLanguageModel-project\n- https://github.com/shivendrra/SmallLanguageModel-project\n\nThis repository contains all the necessary items needed to build your own LLM from scratch. Just follow the instructions. Inspired from Karpathy's nanoGPT and Shakespeare generator, I made this repository to build my own LLM. It has everything from data collection for the Model to architecture file, tokenizer and train file.\n\n### Command-R\n- https://txt.cohere.com/command-r/\n- https://huggingface.co/CohereForAI/c4ai-command-r-v01\n\nCommand-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with our industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts: \n- Strong accuracy on RAG and Tool Use\n- Low latency, and high throughput\n- Longer 128k context and lower pricing\n- Strong capabilities across 10 key languages\n- Model weights available on HuggingFace for research and evaluation\n\n### Grok\n- https://github.com/xai-org/grok-1\n\nThis repository contains JAX example code for loading and running the Grok-1 open-weights model.\n\n### DBRX\n- https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm\n- https://huggingface.co/databricks/dbrx-base\n- https://huggingface.co/databricks/dbrx-instruct\n\nan open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a new state-of-the-art for established open LLMs. Moreover, it provides the open community and enterprises building their own LLMs with capabilities that were previously limited to closed model APIs; according to our measurements, it surpasses GPT-3.5, and it is competitive with Gemini 1.0 Pro. It is an especially capable code model, surpassing specialized models like CodeLLaMA-70B on programming, in addition to its strength as a general-purpose LLM.\n\n### Jamba\n- https://huggingface.co/ai21labs/Jamba-v0.1\n\nJamba is a state-of-the-art, hybrid SSM-Transformer LLM. It delivers throughput gains over traditional Transformer-based models, while outperforming or matching the leading models of its size class on most common benchmarks.\n\n### BioMedLM\n- https://arxiv.org/abs/2403.18421\n- https://huggingface.co/stanford-crfm/BioMedLM\n\nModels such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks. However, these models have hundreds of billions of parameters, are computationally expensive to run, require users to send their input data over the internet, and are trained on unknown data sources. Can smaller, more targeted models compete? To address this question, we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics. This demonstrates that smaller models can potentially serve as transparent, privacy-preserving, economical and environmentally friendly foundations for particular NLP applications, such as in biomedicine.\n\n### JetMoE\n- https://github.com/myshell-ai/JetMoE\n- https://research.myshell.ai/jetmoe\n- https://huggingface.co/jetmoe/jetmoe-8b\n- https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat\n\nJetMoE-8B is trained with less than $ 0.1 million1 cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.\n\nJetMoE-8B is very open and academia-friendly because:\n- It only uses public datasets for training, and the code is open-sourced. No proprietary resource is needed.\n- It can be finetuned with very limited compute budget (e.g., consumer-grade GPU) that most labs can afford.\n\nJetMoE-8B only has 2.2B active parameters during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma-2B, JetMoE-8B achieves constantly better performance.\n\n### Colossal-LLaMA-2\n- https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2\n\nThe Colossal-AI team has introduced the open-source model Colossal-LLaMA-2-7B-base. This model, a derivation of LLaMA-2, has undergone continual pre-training involving approximately 8.5 billion tokens over a duration of 15 hours with 64 A800 GPUs. At a cost of less than $1,000, you can achieve results similar to those that cost millions of dollars to pretrain from scratch. It is licensed under the LLaMA-2 license and Apache 2.0 License without any additional commercial use restrictions. This solution can also be used to build models of specific domain knowledge or tasks.\n\nColossal-LLaMA-2-7B-base is designed to accommodate both the Chinese and English languages, featuring an expansive context window spanning 4096 tokens. Remarkably, it has exhibited exceptional performance when benchmarked against models of equivalent scale in standard Chinese and English evaluation metrics, including C-Eval and MMLU, among others.\n\n### OpenBA (Encoder-Decoder)\n- https://github.com/OpenNLG/OpenBA\n\nWe are excited to unveil two distinguished versions of our model, with another on the horizon:\n\n- OpenBA-LM: The backbone language models was pre-trained on 340B English, Chinese, and code tokens.\n- OpenBA-Flan: We continually perform supervised fine-tuning with 40B tokens of constructed BiFlan Dataset.\n- OpenBA-Chat: We will release the Chat model soon\n\n### 尔雅 Erya\n- https://huggingface.co/RUCAIBox/Erya\n- https://github.com/RUCAIBox/Erya\n\n翻译、理解古汉语对于通会上下五千年的中华典籍与文明至关重要。为了实现高效的古汉语翻译，我们在此提出工具集“尔雅”，它包含：（1）一个经过清洗与分类的，截至目前体量最大的古汉语翻译数据集。（2）面向古汉语翻译的训练方法，它包含双音节对齐替换法(DAS)与双向掩码语言模型(DMLM)，以及基于此方法训练的模型。（3）一个覆盖各汉语世代与文体的古汉语翻译测试基准。“尔雅”模型的零样本能力超出GPT-3.5系列+12.0BLEU，并在人工评分上表现优于文心一言。继续微调则更进一步地以+6.2BLEU提升了模型表现。所有资源请见https://github.com/RUCAIBox/Erya\n\n### 荀子\n- https://github.com/Xunzi-LLM-of-Chinese-classics/XunziALLM\n\n随着科技的飞速发展，人工智能已深入到各个领域。为响应古籍活化利用号召，推动大语言模型与古籍处理深度融合，以古籍智能化的研究为目的，南京农业大学国家社科基金重大项目“中国古代典籍跨语言知识库构建及应用研究”课题组与中华书局古联公司推出了一系列古籍处理领域大语言模型：荀子古籍大语言模型。荀子不仅是我国先秦时期伟大的朴素唯物主义的思想家，也是一位散文大家。他在语言学理论的阐述上又是一位开拓者、奠基人。荀子系列专为古籍智能处理而设计，这一系列模型的推出将推动古籍研究与保护工作的新发展，提高中华传统文化传承的效率与质量。\n\n本次荀子系列模型开源包括两个部分：基座模型XunziALLM，作为本次模型开源的重点，本项目推出了完全开放使用的古籍领域大模型，与此同时，为方便非人工智能领域人员更好地了解本次开源模型，我们使用一部分数据构建了对话模型XunziChat，模型的调用方式与阿里云的Qwen系列大模型一致。\n\n### CodeShell\n- https://github.com/WisdomShell/codeshell\n\nCodeShell是北京大学知识计算实验室联合四川天府银行AI团队研发的多语言代码大模型基座。CodeShell具有70亿参数，在五千亿Tokens进行了训练，上下文窗口长度为8192。在权威的代码评估Benchmark（HumanEval与MBPP）上，CodeShell取得同等规模最好的性能。与此同时，我们提供了与CodeShell配套的部署方案与IDE插件，请参考代码库CodeShell。\n\n### CODEFUSION-75M\n- https://arxiv.org/pdf/2310.14820.pdf\n- https://github.com/microsoft/prose-benchmarks/tree/main/CodeFusion\n\nImagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.\n\n### DeepSeek Coder\n- https://github.com/deepseek-ai/DeepSeek-Coder\n\nDeepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.\n\n### DevOps-Model（运维）\n- https://github.com/codefuse-ai/CodeFuse-DevOps-Model\n\nDevOps-Model 是一系列业界首个开源的中文开发运维大模型，主要致力于在 DevOps 领域发挥实际价值。目前，DevOps-Model 能够帮助工程师回答在 DevOps 生命周期中遇到的问题。\n\n我们基于 Qwen 系列模型，经过高质量中文 DevOps 语料加训后产出 Base 模型，然后经过 DevOps QA 数据对齐后产出 Chat 模型。我们的 Base 模型和 Chat 模型在开源和 DevOps 领域相关的评测数据上可以取得同规模模型中的最佳效果。\n同时我们也在搭建 DevOps 领域专属的评测基准 DevOpsEval，用来更好评测 DevOps 领域模型的效果。\n\n### Magicoder\n- https://github.com/ise-uiuc/magicoder\n- https://arxiv.org/pdf/2312.02120.pdf\n\n🎩Magicoder is a model family empowered by 🪄OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets for generating low-bias and high-quality instruction data for code.\n\n🪄OSS-Instruct mitigates the inherent bias of the LLM-synthesized instruction data by empowering them with a wealth of open-source references to produce more diverse, realistic, and controllable data.\n\n### KwaiAgents\n- https://github.com/KwaiKEG/KwaiAgents\n\nKwaiAgents is a series of Agent-related works open-sourced by the KwaiKEG from Kuaishou Technology. The open-sourced content includes:\n\n- KAgentSys-Lite: a lite version of the KAgentSys in the paper. While retaining some of the original system's functionality, KAgentSys-Lite has certain differences and limitations when compared to its full-featured counterpart, such as: (1) a more limited set of tools; (2) a lack of memory mechanisms; (3) slightly reduced performance capabilities; and (4) a different codebase, as it evolves from open-source projects like BabyAGI and Auto-GPT. Despite these modifications, KAgentSys-Lite still delivers comparable performance among numerous open-source Agent systems available.\n- KAgentLMs: a series of large language models with agent capabilities such as planning, reflection, and tool-use, acquired through the Meta-agent tuning proposed in the paper.\n- KAgentInstruct: over 200k Agent-related instructions finetuning data (partially human-edited) proposed in the paper.\n- KAgentBench: over 3,000 human-edited, automated evaluation data for testing Agent capabilities, with evaluation dimensions including planning, tool-use, reflection, concluding, and profiling.\n\n### LLaMA-Pro\n- https://huggingface.co/TencentARC/LLaMA-Pro-8B\n\nLLaMA-Pro is a progressive version of the original LLaMA model, enhanced by the addition of Transformer blocks. It specializes in integrating both general language understanding and domain-specific knowledge, particularly in programming and mathematics.\n\n### HuixiangDou\n- https://github.com/InternLM/HuixiangDou\n- https://arxiv.org/abs/2401.08772 \n\nIn this work, we present HuixiangDou, a technical assistant powered by Large Language Models (LLM). This system is designed to assist algorithm developers by providing insightful responses to questions related to open-source algorithm projects, such as computer vision and deep learning projects from OpenMMLab. We further explore the integration of this assistant into the group chats of instant messaging (IM) tools such as WeChat and Lark. Through several iterative improvements and trials, we have developed a sophisticated technical chat assistant capable of effectively answering users' technical questions without causing message flooding. This paper's contributions include: 1) Designing an algorithm pipeline specifically for group chat scenarios; 2) Verifying the reliable performance of text2vec in task rejection; 3) Identifying three critical requirements for LLMs in technical-assistant-like products, namely scoring ability, In-Context Learning (ICL), and Long Context.\n\n### CodeAct\n- https://arxiv.org/pdf/2402.01030.pdf\n- https://github.com/xingyaoww/code-act\n\nLarge Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.\n\n### Design2Code\n- https://arxiv.org/abs/2403.03163\n- https://github.com/NoviScl/Design2Code\n- https://salt-nlp.github.io/Design2Code/\n\nGenerative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.\n\n### bGPT\n- https://byte-gpt.github.io/\n- https://huggingface.co/sander-wood/bgpt\n- https://github.com/sanderwood/bgpt\n- https://arxiv.org/abs/2402.19155\n\nIn this paper, we introduce bGPT, a model designed for binary data processing and digital world modelling by next byte prediction. The digital world includes not only digital media files, traditionally the focus of deep learning models, but also extends to the intricate realm of digital systems, ranging from hardware architectures to complex algorithms. bGPT transcends traditional deep learning boundaries by directly interpreting and manipulating binary data, enabling a more intrinsic and holistic understanding of the digital world. Its advantages are two-fold: 1) Interpreting Digital System: By training on byte sequences, bGPT can learn the patterns of digital systems, enabling it to predict, simulate, and diagnose algorithm or hardware behaviour. This ability allows for the reconstruction of complex systems from binary data. 2) Unified Modelling: bGPT integrates various data types into a single framework, treating everything as a byte sequence. This simplifies modelling and allows for easy integration of various data sources.\n\n### MobileLLM\n- https://arxiv.org/abs/2402.14905\n- https://github.com/facebookresearch/MobileLLM\n\nThis paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.\n\n### Stable Code Instruct 3B\n- https://huggingface.co/stabilityai/stable-code-instruct-3b\n- https://huggingface.co/spaces/stabilityai/stable-code-instruct-3b\n- https://static1.squarespace.com/static/6213c340453c3f502425776e/t/6601c5713150412edcd56f8e/1711392114564/Stable_Code_TechReport_release.pdf\n\nstable-code-3b is a 2.7B billion parameter decoder-only language model pre-trained on 1.3 trillion tokens of diverse textual and code datasets. stable-code-3b is trained on 18 programming languages (selected based on the 2023 StackOverflow Developer Survey) and demonstrates state-of-the-art performance (compared to models of similar size) on the MultiPL-E metrics across multiple programming languages tested using BigCode's Evaluation Harness.\n\n### ReALM\n- https://arxiv.org/abs/2403.20329\n\nReference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.\n\n### aiXcoder\n- https://github.com/aixcoder-plugin/aiXcoder-7B\n- https://gitee.com/aixcoder-model/aixcoder-7b\n- https://www.gitlink.org.cn/aixcoder/aixcoder-7b-model\n\nWelcome to the official repository of aiXcoder-7B Code Large Language Model. This model is designed to understand and generate code across multiple programming languages, offering state-of-the-art performance in code completion, comprehension, generation, and more tasks about programming languages.\n\n### CodeQwen1.5\n- https://github.com/QwenLM/CodeQwen1.5\n\nCodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes.\n\n### AutoCodeRover\n- https://github.com/nus-apr/auto-code-rover\n- https://arxiv.org/abs/2404.05427\n\nAutoCodeRover is a fully automated approach for resolving GitHub issues (bug fixing and feature addition) where LLMs are combined with analysis and debugging capabilities to prioritize patch locations ultimately leading to a patch.\n\nAutoCodeRover resolves ~16% of issues of SWE-bench (total 2294 GitHub issues) and ~22% issues of SWE-bench lite (total 300 GitHub issues), improving over the current state-of-the-art efficacy of AI software engineers.\n\n### CodeGemma\n- https://developers.googleblog.com/2024/04/gemma-family-expands.html\n- https://www.kaggle.com/models/google/codegemma\n\nCodeGemma: Code completion, generation, and chat for developers and businesses\n\nHarnessing the foundation of our Gemma models, CodeGemma brings powerful yet lightweight coding capabilities to the community. CodeGemma models are available as a 7B pretrained variant that specializes in code completion and code generation tasks, a 7B instruction-tuned variant for code chat and instruction-following, and a 2B pretrained variant for fast code completion that fits on your local computer.\n\n### Snowflake Arctic\n- https://github.com/Snowflake-Labs/snowflake-arctic\n\nAt Snowflake, we see a consistent pattern in AI needs and use cases from our enterprise customers. Enterprises want to use LLMs to build conversational SQL data copilots, code copilots and RAG chat bots. From a metrics perspective, this translates to LLMs that excel at SQL, code, complex instruction following and the ability to produce grounded answers. We capture these abilities into a single metric we call enterprise intelligence by taking an average of Coding (HumanEval+ and MBPP+), SQL Generation (Spider), and Instruction following (IFEval).\n\n### dolphin-2.9-llama3-70b\n- https://huggingface.co/cognitivecomputations/dolphin-2.9-llama3-70b\n\n### Granite\n- https://arxiv.org/abs/2405.04324\n- https://github.com/ibm-granite/granite-code-models\n\nLarge Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile all around code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.\n\n### StarCoder2-15B-Instruct-v0.1\n- https://hf.co/bigcode/starcoder2-15b-instruct-v0.1\n- https://github.com/bigcode-project/starcoder2-self-align\n- https://hf.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k/\n- https://hf.co/blog/sc2-instruct\n\nWe introduce StarCoder2-15B-Instruct-v0.1, the very first entirely self-aligned code LLM trained with a fully permissive and transparent pipeline. Our open-source pipeline uses StarCoder2-15B to generate thousands of instruction-response pairs, which are then used to fine-tune StarCoder-15B itself without any human annotations or distilled data from huge and proprietary LLMs.\n\n### AutoCoder\n- https://arxiv.org/pdf/2405.14906\n- https://github.com/bin123apple/AutoCoder\n\nWe introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024). (90.9% vs 90.2%).\n\nAdditionally, compared to previous open-source models, AutoCoder offers a new feature: it can automatically install the required packages and attempt to run the code until it deems there are no issues, whenever the user wishes to execute the code.\n\n### CodeGeeX4\n- https://github.com/THUDM/CodeGeeX4\n- https://huggingface.co/THUDM/codegeex4-all-9b\n- https://modelscope.cn/models/ZhipuAI/codegeex4-all-9b\n- https://wisemodel.cn/models/ZhipuAI/codegeex4-all-9b\n\nWe introduce CodeGeeX4-ALL-9B, the open-source version of the latest CodeGeeX4 model series. It is a multilingual code generation model continually trained on the GLM-4-9B, significantly enhancing its code generation capabilities. Using a single CodeGeeX4-ALL-9B model, it can support comprehensive functions such as code completion and generation, code interpreter, web search, function call, repository-level code Q&A, covering various scenarios of software development. CodeGeeX4-ALL-9B has achieved highly competitive performance on public benchmarks, such as BigCodeBench and NaturalCodeBench. It is currently the most powerful code generation model with less than 10B parameters, even surpassing much larger general-purpose models, achieving the best balance in terms of inference speed and model performance.\n\n### xLAM\n- https://github.com/SalesforceAIResearch/xLAM\n\nWe are excited to announce the release of our two function-calling models: xLAM-1b-fc-r and xLAM-7b-fc-r. These models have achieved impressive rankings, placing #3 and #25 on the Berkeley Function-Calling Leaderboard, outperforming many significantly larger models. Stay tuned for more powerful models coming soon.\n\n### deepin V23\n- https://www.deepin.org/en/deepin-os-deepin-v23-beta-is-officially-released/\n\ndeepin is an open-source desktop operating system based on Linux. As the first desktop operating system in China rooted in its community, the deepin community is thrilled to announce the official release of deepin V23 beta! Come and experience it, and contribute your thoughts and feedback.\n\n### WaveCoder\n- https://arxiv.org/pdf/2312.14187\n- https://github.com/microsoft/WaveCoder\n\nWaveCoder 🌊 is a series of large language models (LLMs) for the coding domain, designed to solve relevant problems in the field of code through instruction-following learning. Its training dataset was generated from a subset of code-search-net data using a generator-discriminator framework based on LLMs that we proposed, covering four general code-related tasks: code generation, code summary, code translation, and code repair.\n\n### Llama-3.1-Storm-8B\n- https://huggingface.co/akjindal53244/Llama-3.1-Storm-8B\n\nLlama-3.1-Storm-8B builds upon the foundation of Llama-3.1-8B-Instruct, aiming to enhance both conversational and function calling capabilities within the 8B parameter model class.\n\n### OpenCoder\n- https://opencoder-llm.github.io/\n\nOpenCoder is an open and reproducible code LLM family which matches the performance of Top-Tier Code LLM. We provide not just the final models, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research.\n\n### Qwen2.5-Coder\n- https://github.com/QwenLM/Qwen2.5-Coder\n\nOpen source the “Powerful”, “Diverse”, and “Practical” Qwen2.5-Coder series (formerly known as CodeQwen1.5), dedicated to continuously promoting the development of Open CodeLLMs.\n\n### Ministraux\n- https://mistral.ai/news/ministraux/\n- https://huggingface.co/mistralai/Ministral-8B-Instruct-2410\n\nIntroducing the world’s best edge models\nOn the first anniversary of the release of Mistral 7B, the model that revolutionized independent frontier AI innovation for millions, we are proud to introduce two new state-of-the-art models for on-device computing and at-the-edge use cases. We call them les Ministraux: Ministral 3B and Ministral 8B.\n\nThese models set a new frontier in knowledge, commonsense, reasoning, function-calling, and efficiency in the sub-10B category, and can be used or tuned to a variety of uses, from orchestrating agentic workflows to creating specialist task workers. Both models support up to 128k context length (currently 32k on vLLM) and Ministral 8B has a special interleaved sliding-window attention pattern for faster and memory-efficient inference.\n\nUse cases\nOur most innovative customers and partners have increasingly been asking for local, privacy-first inference for critical applications such as on-device translation, internet-less smart assistants, local analytics, and autonomous robotics. Les Ministraux were built to provide a compute-efficient and low-latency solution for these scenarios. From independent hobbyists to global manufacturing teams, les Ministraux deliver for a wide variety of use cases.\n\nUsed in conjunction with larger language models such as Mistral Large, les Ministraux are also efficient intermediaries for function-calling in multi-step agentic workflows. They can be tuned to handle input parsing, task routing, and calling APIs based on user intent across multiple contexts at extremely low latency and cost.\n\n### Reader-LM\n- https://huggingface.co/jinaai/reader-lm-0.5b\n- https://huggingface.co/jinaai/reader-lm-1.5b\n- https://colab.research.google.com/drive/1wXWyj5hOxEHY6WeHbOwEzYAC0WB1I5uA\n\n将原始HTML转换为干净Markdown的小型语言模型。\n\n### 珠算\n- https://github.com/HIT-SCIR/Abacus\n\n​从DeepMind发布的AlphaCode在竞赛级编程能力上超越人类平均水平之日起，代码大模型便广受关注。与此同时，OpenAI发布的CodeX更是展示出了代码大模型具有超越传统编程范畴的数值推理、逻辑推断、工具调用等高阶能力，进一步引爆了对该领域的研究与讨论。以BigCode StarCoder为代表的开源项目对完善该领域的研究生态做出了卓越的贡献。然而，目前开源代码大模型提升编程能力的同时会严重损害通用语言能力。为此，哈尔滨工业大学社会计算与信息检索研究中心（HIT-SCIR）推出了“珠算”代码大模型，其以2.7B参数在代码与通用语言平均性能上均超越了DeepSeek-Coder-1.3B、Yi-Coder-1.5B、Stable Code-3B、Granite-3B-Code等参数量3B及以下的代码大模型，希望通过开放权重、训练细节以及配套的微调适配平台与插件，助力开源社区的发展。\n\n### Lingma SWE-GPT:\n- https://arxiv.org/abs/2411.00622\n\nAn Open Development-Process-Centric Language Model for Automated Software Improvement\n\nRecent advancements in LLM-based agents have led to significant progress in automatic software engineering, particularly in software maintenance and evolution. Despite these encouraging advances, current research faces two major challenges. First, SOTA performance primarily depends on closed-source models, which significantly limits the technology's accessibility, and potential for customization in diverse SE tasks. Second, these models are predominantly trained on static code data, lacking a deep understanding of the dynamic interactions, iterative problem-solving processes, and evolutionary characteristics inherent in software development. To address these challenges, our study adopts a software engineering perspective. We recognize that real-world software maintenance and evolution processes encompass not only static code data but also developers' thought processes, utilization of external tools, and the interaction between different functional personnel. Consequently, we introduce the Lingma SWE-GPT series, comprising Lingma SWE-GPT 7B and 72B. By learning from and simulating real-world code submission activities, Lingma SWE-GPT systematically incorporates the dynamic interactions and iterative problem-solving inherent in software development process, thereby achieving a more comprehensive understanding of software improvement processes. We conducted experimental evaluations using SWE-bench Verified benchmark. The results demonstrate that Lingma SWE-GPT 72B successfully resolves 30.20% of the GitHub issues, marking a significant improvement in automatic issue resolution (22.76% relative improvement compared to Llama 3.1 405B), approaching the performance of closed-source models (31.80\\% issues of GPT-4o resolved). Notably, Lingma SWE-GPT 7B resolves 18.20% of the issues, highlighting the potential for applying smaller models to ASE tasks.\n\n### GLM-Edge\n- https://github.com/THUDM/GLM-Edge\n\nGLM-Edge 系列是我们在面向端侧真实落地使用的场景下的一次尝试，由两种尺寸的大语言对话模型和多模态理解模型组成（ GLM-Edge-1.5B-Chat，GLM-Edge-4B-Chat，GLM-Edge-V-2B，GLM-Edge-V-5B）。其中，1.5B / 2B模型主要面向手机、车机等平台， 4B / 5B 模型主要面向PC等平台。\n\n### SEMIKONG（半导体）\n- https://www.semikong.ai/\n\nSEMIKONG is a collaborative open-source project born from the AI Alliance.\n\nSpecial THANKs to OUR SUPPORTERS FOR HELPING MAKE IT HAPPEN.\n\n### ReaderLM-v2\n- https://jina.ai/models/ReaderLM-v2/\n\n用于将原始 HTML 转换为 markdown 或 JSON 的小型语言模型\n\n### O1-CODER\n- https://github.com/ADaM-BJTU/O1-CODER\n\nO1-CODER is an attempt to replicate OpenAI's O1 model, focused on coding tasks. The approach combines Reinforcement Learning (RL) and Monte Carlo Tree Search (MCTS) to enhance the model’s System-2 thinking capabilities, aiming to generate more efficient and logical code.\n\n### 星语StarWhisper\n- https://github.com/Yu-Yang-Li/StarWhisper\n\n在天文科学教育联盟、集思谱文献平台、司天工程的支持下，基于天文大模型StarGLM开发经验，我们进一步训练了星语StarWhisper系列模型(包括6B,7B,13B,14B,20B)。\n\n以进一步缓解大模型在天文通用知识的幻觉现象，为接下来可处理天文多模态任务、部署于望远镜阵列的科学具身智能——司天大脑打下基础。\n\n### OceanGPT\n- https://www.zjukg.org/project/OceanGPT\n- https://huggingface.co/zjunlp/oceangpt-7b\n- https://arxiv.org/abs/2310.02031\n- https://huggingface.co/zjunlp/OceanGPT-14B-v0.1\n- http://oceangpt.zjukg.cn/\n- https://huggingface.co/datasets/zjunlp/OceanInstruct\n\n(Warning: The model in this paper might produce hallucinations and reader discretion is recommended) Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reason may be the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean domain, which is expert in various ocean science tasks. We propose DoInstruct, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology.\n\n### K2&GeoGalactica\n- https://github.com/davendw49/k2\n- https://arxiv.org/abs/2401.00434\n- https://github.com/geobrain-ai/geogalactica\n\nK2: We introduce K2 (7B), an open-source language model trained by firstly further pretraining LLaMA on collected and cleaned geoscience literature, including geoscience open-access papers and Wikipedia pages, and secondly fine-tuning with knowledge-intensive instruction tuning data (GeoSignal). As for preliminary evaluation, we use GeoBench (consisting of NPEE and AP Test on Geology, Geography, and Environmental Science) as the benchmark. K2 outperforms the baselines on objectiv e and subjective tasks compared to several baseline models with similar parameters. In this repository, we will share the following code and data.\n\nGeoGalactica: GeoGalactica is from further pre-training of Galactica -- a top-performing LLM trained with a large number of scientific documents. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an open-sourced LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain.\n\n### SciGLM\n- https://arxiv.org/abs/2401.07950\n- https://github.com/THUDM/SciGLM\n\nLarge Language Models (LLMs) have shown promise in assisting scientific discovery. However, such applications are currently limited by LLMs' deficiencies in understanding intricate scientific concepts, deriving symbolic equations, and solving advanced numerical calculations. To bridge these gaps, we introduce SciGLM, a suite of scientific language models able to conduct college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated SciInstruct, a diverse and high-quality dataset encompassing mathematics, physics, chemistry, and formal proofs. We fine-tuned the ChatGLM family of language models with SciInstruct, enhancing their capabilities in scientific and mathematical reasoning. Remarkably, SciGLM consistently improves both the base model (ChatGLM3-6B-Base) and larger-scale models (12B and 32B), without sacrificing the language understanding capabilities of the base model. This makes SciGLM a suitable foundational model to facilitate diverse scientific discovery tasks.\n\n### KAN 2.0\n- https://arxiv.org/abs/2408.10205\n\nA major challenge of AI + Science lies in their inherent incompatibility: today's AI is primarily based on connectionism, while science depends on symbolism. To bridge the two worlds, we propose a framework to seamlessly synergize Kolmogorov-Arnold Networks (KANs) and science. The framework highlights KANs' usage for three aspects of scientific discovery: identifying relevant features, revealing modular structures, and discovering symbolic formulas. The synergy is bidirectional: science to KAN (incorporating scientific knowledge into KANs), and KAN to science (extracting scientific insights from KANs). We highlight major new functionalities in the pykan package: (1) MultKAN: KANs with multiplication nodes. (2) kanpiler: a KAN compiler that compiles symbolic formulas into KANs. (3) tree converter: convert KANs (or any neural networks) to tree graphs. Based on these tools, we demonstrate KANs' capability to discover various types of physical laws, including conserved quantities, Lagrangians, symmetries, and constitutive laws.\n\n### Ziya-Reader-13B\n- https://huggingface.co/IDEA-CCNL/Ziya-Reader-13B-v1.0\n\nZiya-Reader-13B-v1.0是一个知识问答模型，给定问题和知识文档可以准确回答问题，用于多文档或单文档问答。该模型具有8k的上下文窗口，相比其他具有更长窗口的模型，我们在多个长文本任务的评测中胜出。包括多文档问答、合成任务（文档检索）长文本摘要。\n\n该模型主要面向知识库问答、检索问答、电商客服等场景，在私域知识问答中有着不错的效果，能广泛应用于法律、金融、医疗等垂直领域。因为它解决了多文档问答中当正确信息不在首个或末尾文档中时，回答准确率大幅降低的问题。\n\n另外，模型的通用能力同样出众，可以进行通用问答。它在我们的通用能力评估集上的效果超过了Ziya-Llama-13B-v1.1.\n\n### Firefly-LLaMA2-Chinese\n- https://github.com/yangjianxin1/Firefly-LLaMA2-Chinese\n\n本项目与Firefly一脉相承，专注于低资源增量预训练，既支持对Baichuan2、Qwen、InternLM等原生中文模型进行增量预训练，也可对LLaMA2、Falcon等英文模型进行中文词表扩充，然后进行增量预训练。\n\n我们开源了Firefly-LLaMA2-Chinese模型，这是中英双语系列模型。我们以LLaMA2🦙为基座模型，对LLaMA2进行中文词表扩充，使用22GB中英文预训练语料对其进行增量预训练。 最后使用大规模中英文多轮对话指令对模型进行训练。我们对模型进行了榜单评测和人工评测，与现有的开源工作相比，具有不错的竞争力。\n\n在Open LLM Leaderboard和CMMLU上，我们的模型超越了Linly、Yayi、FlagAlpha等模型； 在Open LLM Leaderboard上超越Ziya，在CMMLU上比Ziya略低0.43分。在人工测评中，我们的模型以33.08%获胜、60.77%平局、6.15%失败的成绩，超越Linly。 我们还开源了firelfy-baichuan2-13b模型，在OpenCompass的CMMLU榜单上以56.83的分数，位列第8，比百川官方模型略低1.57分。\n\n更重要的是，在整个增量预训练和指令微调阶段，我们最多仅使用了4*V100的GPU，训练更加低资源高效。相较于Ziya的160*A100，Linly的32*A100，Chinese-LLaMA-Alpaca的48*A40，我们所使用的训练资源少得多。\n\n授人以鱼🐟，不如授人以渔🎣，我们不仅开源了模型权重，也开源了项目全流程的训练代码、训练数据，以及训练细节。\n\n### MindLLM\n- https://arxiv.org/abs/2310.15777\n\nLarge Language Models (LLMs) have demonstrated remarkable performance across various natural language tasks, marking significant strides towards general artificial intelligence. While general artificial intelligence is leveraged by developing increasingly large-scale models, there could be another branch to develop lightweight custom models that better serve certain domains, taking into account the high cost of training and deploying LLMs and the scarcity of resources. In this paper, we present MindLLM, a novel series of bilingual lightweight large language models, trained from scratch, alleviating such burdens by offering models with 1.3 billion and 3 billion parameters. A thorough account of experiences accrued during large model development is given, covering every step of the process, including data construction, model architecture, evaluation, and applications. Such insights are hopefully valuable for fellow academics and developers. MindLLM consistently matches or surpasses the performance of other open-source larger models on some public benchmarks. We also introduce an innovative instruction tuning framework tailored for smaller models to enhance their capabilities efficiently. Moreover, we explore the application of MindLLM in specific vertical domains such as law and finance, underscoring the agility and adaptability of our lightweight models.\n\n### ChatGLM3\n- https://github.com/THUDM/ChatGLM3\n\nChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 引入了如下特性：\n\n- 更强大的基础模型： ChatGLM3-6B 的基础模型 ChatGLM3-6B-Base 采用了更多样的训练数据、更充分的训练步数和更合理的训练策略。在语义、数学、推理、代码、知识等不同角度的数据集上测评显示，ChatGLM3-6B-Base 具有在 10B 以下的基础模型中最强的性能。\n- 更完整的功能支持： ChatGLM3-6B 采用了全新设计的 Prompt 格式，除正常的多轮对话外。同时原生支持工具调用（Function Call）、代码执行（Code Interpreter）和 Agent 任务等复杂场景。\n- 更全面的开源序列： 除了对话模型 ChatGLM3-6B 外，还开源了基础模型 ChatGLM3-6B-Base、长文本对话模型 ChatGLM3-6B-32K。以上所有权重对学术研究完全开放，在填写问卷进行登记后亦允许免费商业使用。\n\n### Skywork大模型\n- https://github.com/SkyworkAI/Skywork\n\nSkywork是由昆仑万维集团·天工团队开发的一系列大型模型，本次开源的模型有Skywork-13B-Base模型、Skywork-13B-Chat模型、Skywork-13B-Math模型和Skywork-13B-MM模型，以及每个模型的量化版模型，以支持用户在消费级显卡进行部署和推理。\n\n我们开源的Skywork系列模型可以用于商业用途，但需要遵循我们的协议，不进行有害活动。\n\n### Yi-6B/34B（零一万物）\n- https://github.com/01-ai/Yi\n- https://arxiv.org/abs/2403.04652\n\nThe Yi series models are large language models trained from scratch by developers at 01.AI. The first public release contains two bilingual (English/Chinese) base models with the parameter sizes of 6B and 34B. Both of them are trained with 4K sequence length and can be extended to 32K during inference time.\n\n### Nanbeige-16B（南北阁-16B）\n- https://github.com/Nanbeige/Nanbeige\n\nNanbeige-16B（南北阁-16B）是南北阁大模型实验室研发的160亿参数规模的大语言模型，采用了2.5T Tokens进行预训练，数据包含大量互联网高质量语料、各类书籍、代码等领域脱敏文本，在各个权威测评数据集上都取得了不错的效果。本次发布包含有 Base、Chat 以及扩展上下文长度的 Base-32k、Chat-32k 版本。\n\nBase-32k 版本基于Nanbeige-16B-Base模型，采用YaRN插值方法对位置编码进行修改，并以32k为最大长度进行了20B Tokens的中文、英文、代码语料的全参数增量预训练。\n\nChat 版本和 Chat-32k 版本分别基于Nanbeige-16B-Base模型和Nanbeige-16B-Base-32k模型，经过了大量人类对齐训练，能够更好、更安全地回复用户的问题。\n\n### OrionStar-Yi-34B-Chat\n- https://modelscope.cn/studios/OrionStarAI/OrionStar-Yi-34B-Chat/summary\n- https://github.com/OrionStarAI/OrionStar-Yi-34B-Chat\n\nOrionStar-Yi-34B-Chat 是猎户星空基于零一万物开源的Yi-34B模型，使用 15W+ 的高质量语料训练而来微调大模型，旨在为大模型社区用户提供卓越的交互体验。\n\n### 源2.0\n- https://github.com/IEIT-Yuan/Yuan-2.0\n- https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/docs/Yuan2_llama-factory.md\n\n源2.0 是浪潮信息发布的新一代基础语言大模型。我们开源了全部的3个模型源2.0-102B，源2.0-51B和源2.0-2B。并且我们提供了预训练，微调，推理服务的相关脚本，以供研发人员做进一步的开发。源2.0是在源1.0的基础上，利用更多样的高质量预训练数据和指令微调数据集，令模型在语义、数学、推理、代码、知识等不同方面具备更强的理解能力。\n\n### TechGPT2.0\n- https://github.com/neukg/TechGPT-2.0\n\nTechGPT-2.0 较TechGPT-1.0 新加了许多领域知识。除了TechGPT-1.0 所具备的计算机科学、材料、机械、冶金、金融和航空航天等十余种垂直专业领域能力，TechGPT-2.0 还在医学、法律等领域文本处理上展现出优秀的能力，并扩充了对地理地区、运输、组织、作品、生物、自然科学、天文对象、建筑等领域文本的处理能力。TechGPT-2.0还对幻觉、不可回答、长文本处理等问题进行了能力增强。同时，TechGPT-2.0对部署的硬件要求更低，使用NVIDIA 4090单机单卡、或昇腾910A单机单卡就可完成TechGPT-2.0模型部署。\n\n### SUS-Chat-34B\n- https://hf.co/SUSTech/SUS-Chat-34B\n- https://github.com/SUSTech-IDEA/SUS-Chat\n\nSUS-Chat-34B模型是南方科技大学联合IDEA研究院CCNL团队开源的通用大模型， 2023-12-05在Huggingface的权威榜单上open_llm_leaderboard取得了同级别模型最好成绩。\n\nSUS-Chat-34B是一个340亿参数规模的双语模型，基于01-ai/Yi-34B预训练模型通过数百万高质量、多语言的指令数据进行了微调。 在保持基础模型强大的语言能力的同时，SUS-Chat-34B模型通过高质量指令微调改善了模型对人类指令的响应方式，并擅长通过思维链的方式模仿人类思考过程。 与Yi-34B和Yi-34B-chat相比，它不仅在几乎所有基准测试中提升了性能，而且能够更好地满足了复杂多语言任务的实际需求。 在指令微调阶段，我们加入了大量高质量长文本和多轮对话指令数据，将文本窗口从基础模型的4K扩展到8K。 这种扩展有助于模型更有效地遵循多轮对话中的指令，显著减少在扩展对话和长文本理解中上下文丢失的问题。为此我们也开发了更高效的训练框架，不久也将进行开源，敬请期待。\n\n### Alaya 元识\n- https://github.com/DataCanvasIO/Alaya\n\n九章云极DataCanvas重磅发布的元识大模型Alaya，在自主整理的高品质多语言数据集上训练了1.5T+ tokens。\n\n首先在Hugging Face开源了7B-Base和7B-Chat版本，模型表现业内领先，知识丰富且富有时效性，最新数据覆盖2023年10月的内容。Alaya-7B-Chat具备多轮对话、自我认知和偏见拒答的能力，能够完成知识问答、代码编写、信息提取、阅读理解、创意写作等多项语言任务。\n\n### OpenBuddy\n- https://github.com/OpenBuddy/OpenBuddy\n- https://huggingface.co/OpenBuddy\n- https://openbuddy.ai\n\nOpenBuddy is a powerful open multilingual chatbot model aimed at global users, emphasizing conversational AI and seamless multilingual support for English, Chinese, and other languages.\n\nBuilt upon Tii's Falcon model and Facebook's LLaMA model, OpenBuddy is fine-tuned to include an extended vocabulary, additional common characters, and enhanced token embeddings. By leveraging these improvements and multi-turn dialogue datasets, OpenBuddy offers a robust model capable of answering questions and performing translation tasks across various languages.\n\nOur mission with OpenBuddy is to provide a free, open, and offline-capable AI model that operates on users' devices, irrespective of their language or cultural background. We strive to empower individuals worldwide to access and benefit from AI technology.\n\n### MiniGPT4Qwen\n- https://github.com/Coobiw/MiniGPT4Qwen\n\nCleaned Lavis + DeepSpeed Support! Align MiniGPT4 with Qwen-Chat LLM. I just use 18.8k high-quality instruction-tuning data(Bi-lingual, from minigpt4 and llava). Just fine-tune the projection layer.\n\n### ChatLM-Chinese-0.2B\n- https://github.com/charent/ChatLM-mini-Chinese\n\n现在的大语言模型的参数往往较大，消费级电脑单纯做推理都比较慢，更别说想自己从头开始训练一个模型了。本项目的目标是整理生成式语言模型的训练流程，包括数据清洗、tokenizer训练、模型预训练、SFT指令微调、RLHF优化等。\n\nChatLM-mini-Chinese为中文对话小模型，模型参数只有0.2B（算共享权重约210M），可以在最低4GB显存的机器进行预训练（batch_size=1，fp16或者 bf16），float16加载、推理最少只需要512MB显存。\n\n### YAYI 2\n- https://github.com/wenge-research/YAYI2/tree/main\n- https://arxiv.org/abs/2312.14862\n- https://yayi.wenge.com/\n\nYAYI 2 是中科闻歌研发的新一代开源大语言模型，包括 Base 和 Chat 版本，参数规模为 30B。YAYI2-30B 是基于 Transformer 的大语言模型，采用了超过 2 万亿 Tokens 的高质量、多语言语料进行预训练。针对通用和特定领域的应用场景，我们采用了百万级指令进行微调，同时借助人类反馈强化学习方法，以更好地使模型与人类价值观对齐。\n\n本次开源的模型为 YAYI2-30B Base 模型。我们希望通过雅意大模型的开源来促进中文预训练大模型开源社区的发展，并积极为此做出贡献。通过开源，我们与每一位合作伙伴共同构建雅意大模型生态。\n\n### DeepSeek LLM&MoE\n- https://huggingface.co/deepseek-ai\n- https://arxiv.org/abs/2401.02954\n- https://github.com/deepseek-ai/DeepSeek-LLM\n- https://github.com/deepseek-ai/DeepSeek-MoE\n\nIntroducing DeepSeek LLM, an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. In order to foster research, we have made DeepSeek LLM 7B/67B Base and DeepSeek LLM 7B/67B Chat open source for the research community.\n\n### MachineMindset(MBTI)\n- https://github.com/PKU-YuanGroup/Machine-Mindset\n\nMM (Machine_Mindset) series models are developed through a collaboration between FarReel AI Lab(formerly known as the ChatLaw project) and Peking University's Deep Research Institute. These models are large-scale language models for various MBTI types in both Chinese and English, built on the Baichuan and LLaMA2 platforms. 🤖🌐\n\nOur core asset is a self-constructed extensive MBTI dataset consisting of hundreds of thousands of entries. Our models are crafted through multiple stages of pre-training, fine-tuning, and DPO training. We are committed to continuously updating the models to offer superior performance and will consistently supplement them with experimental test results. 📊📈\n\nIn contrast to merely using prompts to alter a model's personality, we have found that this method is highly unstable. It's akin to a controlling parent's dissatisfaction with their introverted child, attempting to force them to become outgoing through simple and coercive commands – a rather ludicrous approach. 🙅‍♂️😄\n\nWe have successfully achieved personality alignment for various MBTI types using models such as Baichuan, Qwen, LLaMA, and Mistral. This means we can obtain 16 different versions of MBTI personality models by combining different base models with our dataset and training methods, tailoring each model for specific tasks. 🛠🧩\n\nDue to resource constraints, we are initially releasing 16 Chinese models based on Baichuan-7b-chat and several English models based on LLaMA2-7b. However, rest assured that we can quickly add different versions of models if needed. 🌍📦\n\n### 星辰语义（电信）\n- https://gitee.com/Tele-AI/tele-chat \n- https://github.com/Tele-AI/Telechat \n\n星辰语义大模型-TeleChat\n- 星辰语义大模型TeleChat是由中电信人工智能科技有限公司研发训练的大语言模型，采用1.5万亿 Tokens中英文高质量语料进行训练。\n- 本次开源了对话模型TeleChat-7B-bot，以及其huggingface格式的权重文件。此外，我们还开源了7B模型的int8和int4量化版本。\n\n### Chinese-Mixtral-8x7B\n- https://github.com/HIT-SCIR/Chinese-Mixtral-8x7B\n\n本项目基于Mistral发布的模型Mixtral-8x7B进行了中文扩词表增量预训练，希望进一步促进中文自然语言处理社区对MoE模型的研究。我们扩充后的词表显著提高了模型对中文的编解码效率，并通过大规模开源语料对扩词表模型进行增量预训练，使模型具备了强大的中文生成和理解能力。\n\n### Baby-Llama2-Chinese\n- https://github.com/DLLXW/baby-llama2-chinese\n\n本项目致力于构建一个小参数量的中文Llama2仓库。\n\n包含：预训练、SFT指令微调、奖励模型以及强化学习（待做）完整流程。\n\n除此之外，本项目还会梳理一套完整的LLM学习资料（正在进行中）。\n\n希望该开源项目可以帮助LLM初学者以最快速度入门！\n\n### XVERSE-13B-256K\n- https://huggingface.co/xverse/XVERSE-13B-256K\n\nXVERSE-13B-256K是XVERSE-13B-2模型经过ABF+继续预训练、NTK+SFT 微调后的版本。\n\n### Eagle 7B（RWKV-v5）\n- https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers\n\nA brand new era for the RWKV-v5 architecture and linear transformer's has arrived - with the strongest multi-lingual model in open source today.\n\n### iFlytekSpark-13B\n- https://gitee.com/iflytekopensource/iFlytekSpark-13B\n- https://openi.pcl.ac.cn/iflytek/iFlytekSpark-13B\n- https://xihe.mindspore.cn/modelzoo/iflytek/introduce\n\n此次开源不仅包括基础模型iFlytekSpark-13B-base、精调模型iFlytekSpark-13B-chat，还有微调工具iFlytekSpark-13B-Lora，以及人设定制工具iFlytekSpark-13B-Charater。\n\n### MiniCPM\n- https://github.com/OpenBMB/MiniCPM\n\nMiniCPM 是面壁智能与清华大学自然语言处理实验室共同开源的系列端侧大模型，主体语言模型 MiniCPM-2B 仅有 24亿（2.4B）的非词嵌入参数量。\n\n### 通义千问Qwen1.5\n- https://github.com/QwenLM/Qwen1.5\n- https://huggingface.co/spaces/Qwen/Qwen1.5-72B-Chat\n- https://qwenlm.github.io/\n\nWith Qwen1.5, we are open-sourcing base and chat models across six sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B. In line with tradition, we’re also providing quantized models, including Int4 and Int8 GPTQ models, as well as AWQ and GGUF quantized models. \n\n### RethinkTinyLM\n- https://github.com/YuchuanTian/RethinkTinyLM\n- https://arxiv.org/pdf/2312.17276.pdf\n- https://arxiv.org/pdf/2402.02791.pdf\n\nThe power of large language models (LLMs) has been demonstrated through numerous data and computing resources. However, the application of language models on mobile devices is facing huge challenge on the computation and memory costs, that is, tiny language models with high performance are urgently required. Limited by the highly complex training process, there are many details for optimizing language models that are seldom studied carefully. In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component. Three perspectives are mainly discussed, \\ie, neural architecture, parameter initialization, and optimization strategy. Several design formulas are empirically proved especially effective for tiny language models, including tokenizer compression, architecture tweaking, parameter inheritance and multiple-round training. Then we train PanGu-\\pi-1B Pro and PanGu-\\pi-1.5B Pro on 1.6T multilingual corpora, following the established formulas. Experimental results demonstrate the improved optimization and architecture yield a notable average improvement of 8.87 on benchmark evaluation sets for PanGu-\\pi-1B Pro. Besides, PanGu-\\pi-1.5B Pro surpasses a range of SOTA models with larger model sizes, validating its superior performance. \n\n### Chinese-Mixtral\n- https://github.com/ymcui/Chinese-Mixtral\n\n本项目基于Mistral.ai发布的Mixtral模型进行开发，该模型使用了稀疏混合专家模型（Sparse MoE）架构。本项目利用大规模中文无标注数据进行了中文增量训练，得到了中文Mixtral基础模型，并且进一步通过指令精调，得到了中文Mixtral-Instruct指令模型。该模型原生支持32K上下文（实测可达128K），能够有效地处理长文本，同时在数学推理、代码生成等方面获得了显著性能提升。使用llama.cpp进行量化推理时，最低只需16G内存（或显存）。\n\n### RWKV_Pytorch\n- https://github.com/yuunnn-w/RWKV_Pytorch\n\nThis is an inference framework for the RWKV large language model implemented purely in native PyTorch. The official native implementation is overly complex and lacks extensibility. Let's join the flexible PyTorch ecosystem and open-source it together!\n\n这是一个用纯Pytorch原生实现的RWKV大语言模型的推理框架，官方的原生实现过于复杂且无法拓展生态，让我们加入灵活的Pytorch阵营，一起开源起来吧！\n\n### Qwen1.5-MoE-A2.7B\n- https://qwenlm.github.io/blog/qwen-moe/\n- https://github.com/QwenLM/Qwen1.5\n- https://huggingface.co/Qwen\n\nCompared to Qwen1.5-7B, which contains 6.5 billion non-embedding parameters, Qwen1.5-MoE-A2.7B contains only 2.0 billion non-embedding parameters, approximately one-third of Qwen1.5-7B’s size. Notably, it achieves a 75% decrease in training expenses and accelerates inference speed by a factor of 1.74, offering substantial improvements in resource utilization without compromising performance.\n\n### Symbol-LLM\n- https://arxiv.org/abs/2311.09278\n- https://huggingface.co/Symbol-LLM/Symbol-LLM-7B-Instruct\n- https://xufangzhi.github.io/symbol-llm-page/\n\nAlthough Large Language Models (LLMs) demonstrate remarkable ability in processing and generating human-like text, they do have limitations when it comes to comprehending and expressing world knowledge that extends beyond the boundaries of natural language(e.g., chemical molecular formula). Injecting a collection of symbolic data directly into the training of LLMs can be problematic, as it disregards the synergies among different symbolic families and overlooks the need for a balanced mixture of natural and symbolic data. In this work, we tackle these challenges from both a data and framework perspective and introduce Symbol-LLM series models. First, we curated a data collection consisting of 34 tasks and incorporating approximately 20 distinct symbolic families, intending to capture the interrelations and foster synergies between symbols. Then, a two-stage tuning framework succeeds in injecting symbolic knowledge without loss of the generality ability. Extensive experiments on both symbol- and NL-centric tasks demonstrate the balanced and superior performances of Symbol-LLM series models. \n\n### Qwen1.5-32b\n- https://qwenlm.github.io/zh/blog/qwen1.5-32b/\n- https://huggingface.co/spaces/Qwen/Qwen1.5-32B-Chat-demo\n\n开源社区长期以来一直在寻求一种能在性能、效率和内存占用之间达到理想平衡的模型。尽管出现了诸如Qwen1.5-72B和DBRX这样的SOTA模型，但这些模型持续面临诸如内存消耗巨大、推理速度缓慢以及显著的微调成本等问题。当前，参数量约30B的模型往往在这方面被看好，得到很多用户的青睐。顺应这一趋势，我们推出Qwen1.5语言模型系列的最新成员：Qwen1.5-32B和Qwen1.5-32B-Chat。\n\n过去数月中，我们精心研发了Qwen1.5-32B基础模型，旨在对标甚至超越当前最先进的30B模型所设定的性能基准。同时，我们在对齐方面取得了进展，特别是在RLHF方面，以提升Qwen1.5-32B-Chat的对话能力。\n\n### build_MiniLLM_from_scratch\n- https://github.com/Tongjilibo/build_MiniLLM_from_scratch\n\n本项目旨在构建一个小参数量的llm，走完预训练 -> 指令微调 -> 奖励模型 -> 强化学习 四个阶段，以可控的成本完成一个可以完成简单聊天任务的chat模型，目前完成前两个阶段\n\n### RWKV-6 World\n- https://huggingface.co/BlinkDL/rwkv-6-world\n\nRWKV-6 trained on 100+ world languages (70% English, 15% multilang, 15% code).\n\nWorld = Some_Pile + Some_SlimPajama + Some_StarCoder + Some_OSCAR + All_Wikipedia + All_ChatGPT_Data_I_can_find\n\n### Mengzi3\n- https://github.com/Langboat/Mengzi3\n\nMengzi3-13B模型基于Llama架构，语料精选自网页、百科、社交、媒体、新闻，以及高质量的开源数据集。通过在万亿tokens上进行多语言语料的继续训练，模型的中文能力突出并且兼顾多语言能力。\n\n### Eurus\n- https://arxiv.org/abs/2404.02078\n- https://github.com/OpenBMB/Eurus\n\nWe introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.\n\n### Chinese Tiny LLM\n- https://huggingface.co/collections/m-a-p/chinese-tiny-llm-660d0133dff6856f94ce0fc6\n- https://arxiv.org/abs/2404.04167\n\nIn this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.\n\n### HammerLLM\n- https://github.com/Academic-Hammer/HammerLLM\n\nWelcome to the pre-training repository for our small-size Large Language Model (sLLM) with 1.4 billion parameters, leveraged on the Llama 2 architecture. \n\n### 360智脑\n- https://github.com/Qihoo360/360zhinao\n- https://ai.360.com/\n\nWe released 360Zhinao-7B v1.0, including the base model and three chat models with context lengths 4K, 32K and 360K.\n\n### Steel-LLM\n- https://github.com/zhanshijinwat/Steel-LLM/tree/main\n\nSteel-LLM是一个从零开始预训练中文大模型的项目。我们的目标是使用1T+的数据预训练一个1B左右参数量的中文LLM，对标TinyLlama。项目持续更新，维持3个月+。我们会分享数据收集、数据处理、预训练框架选择、模型设计等全过程，并开源全部代码。让每个人在有8~几十张卡的情况下都能复现我们的工作。\n\n### XVERSE-MoE-A4.2B\n- https://hf.co/xverse/XVERSE-MoE-A4.2B\n- https://modelscope.cn/models/xverse/XVERSE-MoE-A4.2B\n- https://github.com/xverse-ai/XVERSE-MoE-A4.2B\n\nXVERSE-MoE-A4.2B 是由深圳元象科技自主研发的支持多语言的大语言模型（Large Language Model），使用混合专家模型（MoE，Mixture-of-experts）架构，模型的总参数规模为 258 亿，实际激活的参数量为 42 亿，本次开源的模型为底座模型 XVERSE-MoE-A4.2B\n\n### llama3-Chinese-chat\n- https://github.com/CrazyBoyM/llama3-Chinese-chat\n\nfirst version of llama3 in Chinese (首个llama3 中文版)\n\n### Llama3-Chinese-Chat（ORPO）\n- https://github.com/Shenzhi-Wang/Llama3-Chinese-Chat\n\nWe introduce the first Chinese chat model specifically fine-tuned for Chinese through ORPO based on the Meta-Llama-3-8B-Instruct model.\n\n### DeepSeek-V2\n- https://github.com/deepseek-ai/DeepSeek-V2/tree/main\n\nwe’re introducing DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.\n\n### PanGu-π\n- https://arxiv.org/abs/2402.02791\n- https://github.com/YuchuanTian/RethinkTinyLM\n\nThe power of large language models (LLMs) has been demonstrated through numerous data and computing resources. However, the application of language models on mobile devices is facing huge challenge on the computation and memory costs, that is, tiny language models with high performance are urgently required. Limited by the highly complex training process, there are many details for optimizing language models that are seldom studied carefully. In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component. Three perspectives are mainly discussed, \\ie, neural architecture, parameter initialization, and optimization strategy. Several design formulas are empirically proved especially effective for tiny language models, including tokenizer compression, architecture tweaking, parameter inheritance and multiple-round training. Then we train PanGu-π-1B Pro and PanGu-π-1.5B Pro on 1.6T multilingual corpora, following the established formulas. Experimental results demonstrate the improved optimization and architecture yield a notable average improvement of 8.87 on benchmark evaluation sets for PanGu-π-1B Pro. Besides, PanGu-π-1.5B Pro surpasses a range of SOTA models with larger model sizes, validating its superior performance. \n\n### Eurux-8x22B \n- https://github.com/OpenBMB/Eurus\n- https://huggingface.co/openbmb/Eurux-8x22b-nca\n\nWe release a suite of LLMs and a reward model. Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. Besides, Eurux-8x22B's performance further improves and achieves superb reasoning performance as well as excellent chat & instruction-following capabilities. We also train a reward model that demonstrates especially strong preference modeling performance on reasoning tasks.\n\n### Chinese-LLaMA-Alpaca-3\n- https://github.com/ymcui/Chinese-LLaMA-Alpaca-3\n\n本项目基于Meta最新发布的新一代开源大模型Llama-3开发，是Chinese-LLaMA-Alpaca开源大模型相关系列项目（一期、二期）的第三期。本项目开源了中文Llama-3基座模型和中文Llama-3-Instruct指令精调大模型。这些模型在原版Llama-3的基础上使用了大规模中文数据进行增量预训练，并且使用精选指令数据进行精调，进一步提升了中文基础语义和指令理解能力，相比二代相关模型获得了显著性能提升。\n\n### OpenBuddy-Llama3-70B-v21.1-8k\n- https://github.com/OpenBuddy/OpenBuddy\n\nOpenBuddy is a powerful open multilingual chatbot model aimed at global users, emphasizing conversational AI and seamless multilingual support for English, Chinese, and other languages.\n\n### MAP-NEO\n- https://github.com/multimodal-art-projection/MAP-NEO\n\nMAP-NEO is a fully open-sourced Large Language Model that includes the pretraining data, a data processing pipeline (Matrix), pretraining scripts, and alignment code. It is trained from scratch on 4.5T English and Chinese tokens, exhibiting performance comparable to LLaMA2 7B. The MAP-Neo model delivers proprietary-model-like performance in challenging tasks such as reasoning, mathematics, and coding, outperforming its peers of similar size. For research purposes, we aim to achieve full transparency in the LLM training process. To this end, we have made a comprehensive release of MAP-Neo, including the final and intermediate checkpoints, a self-trained tokenizer, the pre-training corpus, and an efficient, stable optimized pre-training codebase.\n\n### llms-from-scratch-cn\n- https://github.com/datawhalechina/llms-from-scratch-cn\n\n提供了一个如何从头开始实现类似ChatGPT的大语言模型（LLM）的详细教程。\n\n### Yi-1.5\n- https://github.com/01-ai/Yi-1.5\n\nYi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.\n\nCompared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension.\n\nYi-1.5 comes in 3 model sizes: 34B, 9B, and 6B.\n\n### Yuan2.0-M32\n- https://github.com/IEIT-Yuan/Yuan2.0-M32\n\nYuan2.0-M32 is a Mixture-of-Experts (MoE) language model with 32 experts, of which 2 are active. A new router network, Attention Router, is proposed and has been adopted for more efficient expert selection, boosting accuracy by 3.8% over models using a classical router network. Yuan 2.0-M32 is trained from scratch with 2000B tokens, and its training computation is only 9.25% of that required by a dense model of the same parameter scale. Demonstrating competitive capabilities in coding, math, and various specialized fields, Yuan2.0-M32 operates with only 3.7B active parameters out of a total 40B, and a forward computation of 7.4 GFLOPS per token, which is just 1/19th of Llama3-70B's requirement. Yuan 2.0-M32 has surpassed Llama3-70B on the MATH and ARC-Challenge benchmarks, achieving accuracies of 55.9% and 95.8%, respectively. \n\n### Skywork-MoE\n- https://huggingface.co/Skywork/Skywork-MoE-base\n- https://huggingface.co/Skywork/Skywork-MoE-Base-FP8\n- https://github.com/SkyworkAI/Skywork-MoE\n\nSkywork-MoE is a high-performance mixture-of-experts (MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model.\n\n### Index-1.9B\n- https://github.com/bilibili/Index-1.9B\n\nIndex-1.9B系列是Index系列模型中的轻量版本，包含以下模型：\n\nIndex-1.9B base : 基座模型，具有 19亿 非词嵌入参数量，在2.8T 中英文为主的语料上预训练，多个评测基准上与同级别模型比处于领先.\n\nIndex-1.9B pure : 基座模型的对照组，与base具有相同的参数和训练策略，不同之处在于我们严格过滤了该版本语料中所有指令相关的数据，以此来验证指令对benchmark的影响\n\nIndex-1.9B chat : 基于index-1.9B base通过SFT和DPO对齐后的对话模型，我们发现由于我们预训练中引入了较多互联网社区语料，聊天的趣味性明显更强，并且拥有同级别模型中较强的多语种（尤其是东亚语种）互译能力\n\nIndex-1.9B character : 在SFT和DPO的基础上引入了RAG来实现fewshots角色扮演定制\n\n### Qwen2\n- https://github.com/QwenLM/Qwen2\n\nThe Qwen2 series include base and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, Qwen2-72B. \n\n### Gemma-2-9B-Chinese-Chat\n- https://huggingface.co/shenzhi-wang/Gemma-2-9B-Chinese-Chat\n\nWe now introduce Gemma-2-9B-Chinese-Chat, which is the first instruction-tuned language model built upon google/gemma-2-9b-it for Chinese & English users with various abilities such as roleplaying & tool-using.\n\n### Gemma-2-27B-Chinese-Chat\n- https://huggingface.co/shenzhi-wang/Gemma-2-27B-Chinese-Chat\n\n### RWKV-6-World 14B\n- https://huggingface.co/BlinkDL/rwkv-6-world\n\nRWKV-6-World 14B is currently the most powerful dense RNN Large Language Model, with outstanding performance in multilingual tasks.\n\n### Tele-FLM-1T\n- https://huggingface.co/CofeAI/Tele-FLM-1T\n\nTele-FLM-1T (aka FLM-2-1T) is a 1T open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities. Built upon the decoder-only transformer architecture, it has been trained on approximately 2T tokens. Tele-FLM series demonstrate superior performances at its scale, and sometimes surpass larger models. In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.\n\n### Llama3.1-Chinese-Chat\n- https://huggingface.co/collections/shenzhi-wang/llama31-chinese-chat-66a33d570b32f2cdb42512ac\n\nWe now introduce shenzhi-wang/Llama3.1-8B-Chinese-Chat! Compared to the original Meta-Llama-3.1-8B-Instruct model, our llama3.1-8B-Chinese-Chat model significantly reduces the issues of \"Chinese questions with English answers\" and the mixing of Chinese and English in responses. The training dataset contains >100K preference pairs, and it exhibits significant enhancements, especially in roleplay, function calling, and math capabilities!\n\n### INF-34B\n- https://github.com/infly-ai/INF-LLM/\n\nINF-34B has 34 billion parameters with a context window length of 32K, and is trained on about 3.5T well-processed tokens from English and Chinese bilingual corpus. Compared with open source models of the comparable size, INF-34B not only provides competitive performance in the OpenCompass evaluation, but also has impressive potential on both finance and healthcare domains. Besides, the quantized INF-34B runs on graphics cards of 24GB VRAM with negligible accuracy loss, which facilitates commercial applications, especially low-resource scenarios.\n\n### InternLM2.5\n- https://internlm.intern-ai.org.cn\n- https://hf.co/collections/internlm/internlm25-66853f32717072d17581bc13\n- https://github.com/InternLM/InternLM\n\nWe release InternLM2.5-1.8B, InternLM2.5-1.8B-Chat, InternLM2.5-20B and InternLM2.5-20B-Chat. See model zoo below for download or model cards for more details.\n\n### LongWriter\n- https://github.com/THUDM/LongWriter\n- https://huggingface.co/THUDM/LongWriter-glm4-9b\n- https://arxiv.org/abs/2408.07055\n\nWe open-source two models: LongWriter-glm4-9b and LongWriter-llama3.1-8b, trained based on GLM-4-9B and Meta-Llama-3.1-8B, respectively. These two models point to the \"LongWriter-9B-DPO\" and \"LongWriter-8B\" models in our paper. \n\nCurrent long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability.\n\n\n### Hunyuan-Large\n- https://llm.hunyuan.tencent.com/ \n- https://github.com/Tencent/Hunyuan-Large \n\nWith the rapid development of artificial intelligence technology, large language models (LLMs) have made significant progress in fields such as natural language processing, computer vision, and scientific tasks. However, as the scale of these models increases, optimizing resource consumption while maintaining high performance has become a key challenge. To address this challenge, we have explored Mixture of Experts (MoE) models. The currently unveiled Hunyuan-Large (Hunyuan-MoE-A52B) model is the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 52 billion active parameters. This is currently the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 52 billion active parameters.\n\n### Qwen2.5\n- https://github.com/QwenLM/Qwen2.5\n\nIn the past three months since Qwen2's release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5.\n\n### TeleChat2\n- https://github.com/Tele-AI/TeleChat2\n\n星辰语义大模型TeleChat2是由中国电信人工智能研究院研发训练的大语言模型，该系列模型完全基于国产算力训练。\n本次开源的 TeleChat2-3B、TeleChat2-7B、TeleChat2-35B 模型已支持工具调用功能。在 Function Call 方面，我们针对性进行了效果优化，在相关榜单评测上相比同尺寸模型均有较好表现。\nTeleChat2-115B模型采用10万亿 Tokens中英文高质量语料进行训练，同步开源对话模型TeleChat2-115B的多格式、多平台权重文件。\n\n### Marco-o1\n- https://arxiv.org/abs/2411.14405\n- https://github.com/AIDC-AI/Marco-o1\n- https://huggingface.co/AIDC-AI/Marco-o1\n\nMarco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding—which are well-suited for reinforcement learning (RL)—but also places greater emphasis on open-ended resolutions. We aim to address the question: \"Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?\n\n### Skywork-o1\n- https://tinyurl.com/skywork-o1\n\nSkywork o1 open model collections\n\n### YuLan-Mini\n- https://huggingface.co/yulan-team/YuLan-Mini\n\nYuLan-Mini is a lightweight language model with 2.4 billion parameters. It achieves performance comparable to industry-leading models trained on significantly more data, despite being pre-trained on only 1.08T tokens. \n\n### DeepSeek-R1\n- https://github.com/deepseek-ai/DeepSeek-R1\n\nWe introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning.\n\n### simpleRL-reason\n- https://github.com/hkust-nlp/simpleRL-reason\n\nThis repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward is used, the recipe is almost the same as the one used in DeepSeek-R1, except that the code currently uses PPO rather than GRPO. We have used this code to train small models (7B) on limited data (8K examples), achieving surprisingly strong results -- for example, starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to previous baselines that use >50x more data and more complicated components. You may check our Notion blog or the Introduction below for more details.\n\n### TinyZero\n- https://github.com/Jiayi-Pan/TinyZero\n\nTinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. We built upon veRL.\n\nThrough RL, the 3B base LM develops self-verification and search abilities all on its own\n\n### STILL-3-1.5B-Preview\n- https://github.com/RUCAIBox/Slow_Thinking_with_LLMs\n- https://huggingface.co/RUC-AIBOX/STILL-3-1.5B-preview\n- https://huggingface.co/datasets/RUC-AIBOX/STILL-3-Preview-RL-Data\n\nSTILL-3-1.5B-preview: We release STILL-3-1.5B-preview, a 1.5B slow-thinking reasoning model achieves 39.33% accuracy on AIME benchmark! We utilize 30k queries to adapt reinforcement learning on 1.5B model (DeepSeek-R1-Distill-Qwen-1.5B) and observe the continuous performance improvement as the number of training steps increased. For better reproducing our work and advancing research progress, we open-source our code, model, and data.\n\n### MiniMax-01\n- https://github.com/MiniMax-AI/MiniMax-01\n\nWe are delighted to introduce two remarkable models, MiniMax-Text-01 and MiniMax-VL-01. MiniMax-Text-01 is a powerful language model boasting 456 billion total parameters, with 45.9 billion activated per token. To unlock its long-context capabilities, it adopts a hybrid architecture integrating Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies like Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, and Expert Tensor Parallel (ETP), its training context length extends to 1 million tokens, and it can handle up to 4 million tokens during inference. Consequently, MiniMax-Text-01 showcases top-tier performance on various academic benchmarks. Building on MiniMax-Text-01's prowess, we developed MiniMax-VL-01 for enhanced visual capabilities. It uses the “ViT-MLP-LLM” framework common in multimodal LLMs. It is initialized and trained using three key components: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and MiniMax-Text-01 as the base LLM. This model features a dynamic resolution mechanism. Input images are resized according to a pre-set grid, with resolutions ranging from 336×336 to 2016×2016, while maintaining a 336×336 thumbnail. The resized images are split into non - overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined to form a full image representation. As a result, MiniMax-VL-01 has achieved top-level performance on multimodal leaderboards, demonstrating its edge in complex multimodal tasks.\n\n### SmallThinker-3B-preview\n- https://huggingface.co/PowerInfer/SmallThinker-3B-Preview\n\nSmallThinker is designed for the following use cases:\n\nEdge Deployment: Its small size makes it ideal for deployment on resource-constrained devices.\n\nDraft Model for QwQ-32B-Preview: SmallThinker can serve as a fast and efficient draft model for the larger QwQ-32B-Preview model. From my test, in llama.cpp we can get 70% speedup (from 40 tokens/s to 70 tokens/s).\n\n### DeepSeek-V3\n- https://github.com/deepseek-ai/DeepSeek-V3\n\nWe present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.\n\n### RWKV-7\n- https://github.com/BlinkDL/RWKV-LM\n- https://github.com/JL-er/RWKV-PEFT\n\nRWKV-7 is a meta-in-context learner, test-time-training its state on the context via in-context gradient descent at every token.\n\n### FOX-1\n- https://arxiv.org/abs/2411.05281\n\nWe present Fox-1, a series of small language models (SLMs) consisting of Fox-1-1.6B and Fox-1-1.6B-Instruct-v0.1. These models are pre-trained on 3 trillion tokens of web-scraped document data and fine-tuned with 5 billion tokens of instruction-following and multi-turn conversation data. Aiming to improve the pre-training efficiency, Fox-1-1.6B model introduces a novel 3-stage data curriculum across all the training data with 2K-8K sequence length. In architecture design, Fox-1 features a deeper layer structure, an expanded vocabulary, and utilizes Grouped Query Attention (GQA), offering a performant and efficient architecture compared to other SLMs. Fox-1 achieves better or on-par performance in various benchmarks compared to StableLM-2-1.6B, Gemma-2B, Qwen1.5-1.8B, and OpenELM1.1B, with competitive inference speed and throughput. The model weights have been released under the Apache 2.0 license, where we aim to promote the democratization of LLMs and make them fully accessible to the whole open-source community.\n\n### mini_qwen\n- github.com/qiufengqijun/mini_qwen\n\nA 1B parameter large language model (LLM) project that includes pre-training, fine-tuning, and direct preference optimization, supporting both Chinese and English.\n\n### Qwen 0.5b on GRPO\n- https://colab.research.google.com/drive/1bfhs1FMLW3FGa8ydvkOZyBNxLYOu0Hev?usp=sharing\n\nThis notebook is an alternate version of the GRPO demo by will brown, training llama-1b on the gsm8k math dataset.\n\n### Qwen2.5-Max\n- https://qwen2.org/qwen2-5-max/\n- https://qwenlm.github.io/blog/qwen2.5-max/\n\nQwen2.5-Max is a large-scale MoE model, pretrained on more than 20 trillion tokens and further refined through curated Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). \n\n### minimind\n- https://github.com/jingyaogong/minimind\n\n本开源项目旨在完全从0开始，最快仅用3小时！即可训练出仅为26.88M大小的微型语言模型MiniMind。\n\n### Nano\n- https://github.com/bd4sur/Nano\n\nNano是Transformer结构的自回归语言模型，供个人赏玩、研究、魔改和炼丹炉煲机之用。\n\n### Transformer Architecture (LLMs: Zero-to-Hero)\n- https://medium.com/@waylandzhang/transformer-architecture-llms-zero-to-hero-98b1ee51a838\n\nThis is the 3rd article in my Zero-to-Hero series. In this article we will walk through and explain each step of a Transformer based Large Language Model.\n\n### Build a Large Language Model (From Scratch)\n- https://github.com/rasbt/LLMs-from-scratch/tree/main\n\nIn Build a Large Language Model (From Scratch), you'll learn and understand how large language models (LLMs) work from the inside out by coding them from the ground up, step by step. In this book, I'll guide you through creating your own LLM, explaining each stage with clear text, diagrams, and examples.\n\n### SynthID Text\n- https://deepmind.google/technologies/synthid/\n- https://www.nature.com/articles/s41586-024-08025-4\n- https://ai.google.dev/responsible/docs/safeguards/synthid\n\nLarge language models (LLMs) have enabled the generation of high-quality synthetic text, often indistinguishable from human-written content, at a scale that can markedly affect the nature of the information ecosystem. Watermarking can help identify synthetic text and limit accidental or deliberate misuse, but has not been adopted in production systems owing to stringent quality, detectability and computational efficiency requirements. Here we describe SynthID-Text, a production-ready text watermarking scheme that preserves text quality and enables high detection accuracy, with minimal latency overhead. SynthID-Text does not affect LLM training and modifies only the sampling procedure; watermark detection is computationally efficient, without using the underlying LLM. To enable watermarking at scale, we develop an algorithm integrating watermarking with speculative sampling, an efficiency technique frequently used in production systems5. Evaluations across multiple LLMs empirically show that SynthID-Text provides improved detectability over comparable methods, and standard benchmarks and human side-by-side ratings indicate no change in LLM capabilities. To demonstrate the feasibility of watermarking in large-scale-production systems, we conducted a live experiment that assessed feedback from nearly 20 million Gemini6 responses, again confirming the preservation of text quality. We hope that the availability of SynthID-Text will facilitate further development of watermarking and responsible use of LLM systems.\n\n### Small Language Models: Survey, Measurements, and Insights\n- https://arxiv.org/abs/2409.15790\n\nSmall language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 59 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.\n\n### Multi-IF (Multi-turn and multilingual instruction following)\n- https://huggingface.co/datasets/facebook/Multi-IF\n- https://arxiv.org/pdf/2410.15553v2\n- https://github.com/facebookresearch/Multi-IF\n\nLarge Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including instruction following, which is crucial for aligning model outputs with user expectations. However, evaluating LLMs' ability to follow instructions remains challenging due to the complexity and subjectivity of human language. Current benchmarks primarily focus on single-turn, monolingual instructions, which do not adequately reflect the complexities of real-world applications that require handling multi-turn and multilingual interactions. To address this gap, we introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4,501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities. We release Multi-IF prompts and the evaluation code base to encourage further research in this critical area.\n\n### LLM from scratch with Pytorch\n- https://medium.com/@msouza.os/llm-from-scratch-with-pytorch-9f21808c6319\n\nGenerative models are currently one of the most intriguing fields in AI, more specifically, those text-to-text models that generate text based on an initial user prompt. One famous example is ChatGPT by OpenAI, which is an Assistant model capable to respond user questions about multiple topics.\n\nIn this paper, we cover LLM, how it works and how to train it from scratch. I’ll try to be clear in all topics of this paper, and I hope most of you could understand and learn something from it 😁.\n\nIf you are going to run all of the code examples, make sure that you have imported the libraries first.\n\n### A Survey on Data Synthesis and Augmentation for Large Language Models\n- https://arxiv.org/abs/2410.12896\n\nThe success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.\n\n### A Survey of Small Language Models\n- https://arxiv.org/abs/2410.20011\n\nSmall Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.\n\n### LLMForEverybody\n- https://github.com/luhengshiwo/LLMForEverybody\n\n每个人都能看懂的大模型知识分享，LLMs秋招大模型面试前必看，让你和面试官侃侃而谈\n\n### Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner\n- https://arxiv.org/abs/2406.11978\n- https://github.com/likenneth/dialogue_action_token\n- https://thegradient.pub/dialog/\n\nWe present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4’s performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.\n\n### CCI3.0-HQ\n- https://hf.co/datasets/BAAI/CCI3-HQ\n- http://open.flopsera.com/flopsera-open/data-details/BAAI-CCI3-HQ\n- https://data.baai.ac.cn/details/BAAI-CCI3-HQ\n- https://arxiv.org/abs/2410.18505\n\n近年来，自然语言基础模型（LLM）取得了显著进展，训练数据的规模扩展以及数据质量的提升是提升模型性能的关键因素。目前英文开源语料的质量过滤已经从基础的规则方法转向了模型驱动的方法。然而，中文开源语料相对稀缺，同时针对中文网络数据进行质量分类提升的研究较少，导致数据质量尚未达到理想水平，进而影响模型中文性能。\n\n为解决以上问题，进一步缓解中文预训练语料规模和质量上的差距，2024年9月20日，智源研究院发布并开源了中文预训练数据集CCI3.0和高质量子集CCI3.0-HQ。2024年10月25日，智源研究院发布中文高质量预训练数据集CCI3.0-HQ技术报告，全面解析数据集的构建过程。\n\n### rlhfbook\n- https://rlhfbook.com/\n\nReinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF – both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. We detail the popular algorithms and future frontiers of RLHF.\n\n### Deepseek R1可能找到了超越人类的办法\n- https://mazzzystar.com/2025/01/30/chatgpt-to-deepseek-r1-zh/?continueFlag=ed95b1adbe6ed21a4c466209fa20489d\n\n我本想写一篇关于 DeepSeek R1 的科普文，但发现很多人仅仅把它理解为 OpenAI 的复制品，而忽略了它在论文中揭示的“惊人一跃”，所以，我决定重新写一篇，讲讲从 AlphaGo 到 ChatGPT，再到最近的 DeepSeek R1 底层原理的突破，以及为什么它对所谓的 AGI/ASI 很重要。作为一名普通的 AI 算法工程师，我可能无法做到非常深入，如有错误欢迎指出。\n\n### train-llm-from-scratch\n- github.com/FareedKhan-dev/train-llm-from-scratch\n\nI implemented a transformer model from scratch using PyTorch, based on the paper Attention is All You Need. You can use my scripts to train your own billion or million parameter LLM using a single GPU.\n\n### The Big Book of LLMs\n- https://book.theaiedge.io/?continueFlag=3b71e815bfd484170b234a52a15adc73\n\n### Primers • DeepSeek-R1\n- https://aman.ai/primers/ai/deepseek-R1/\n\nThis primer explores its architecture, multi-stage training pipeline, GRPO mechanics, and emergent reasoning behaviors, alongside how distillation propagates reasoning capabilities to smaller models.\n\n### A vision researcher’s guide to some RL stuff: PPO & GRPO\n-  https://yugeten.github.io/posts/2025/01/ppogrpo/\n\nIt has been a while since I last wrote a blog post. Life has been hectic since I started work, and the machine learning world is also not what it was since I graduated in early 2023. Your average parents having LLM apps installed on their phones is already yesterday’s news – I took two weeks off work to spend Lunar New Year in China, which only serves to give me plenty of time to scroll on twitter and witness DeepSeek’s (quite well-deserved) hype peak on Lunar New Year’s eve while getting completely overwhelmed.\n\n### DeepSeek R1 and R1-Zero Explained\n- https://thelmbook.com/articles/#!./DeepSeek-R1.md?continueFlag=3b71e815bfd484170b234a52a15adc73\n\n### DeepSeek R1 阅读清单\n- https://ninehills.tech/articles/121.html\n\n随着 DeepSeek R1 的发布，如果想复刻 R1 或者在某个领域实践 RFT（Reinforcement Fine-Tuning），可以看看我整理的清单，会持续更新。 同时我个人尝试的结果也会更新上。\n\n### DeepSeek R1 Explained to your grandma\n- https://www.youtube.com/watch?v=kv8frWeKoeo\n\nDescribing the key insights from the DeepSeek R1 paper in a way even your grandma could understand. I focus on the key concepts of chain of thought reasoning, reinforcement learning, and model distillation.\n\n### Deepseek R1 for Everyone\n- https://trite-song-d6a.notion.site/Deepseek-R1-for-Everyone-1860af77bef3806c9db5e5c2a256577d\n\nWe’re gonna discuss how the Deepseek R1 model actually works in detail but with very less math!\n\n### llm-course\n- https://github.com/mlabonne/llm-course\n\n### O1-Journey\n- https://github.com/GAIR-NLP/O1-Journey\n\nThe core development team of this project mainly consists of third- and fourth-year undergraduate students, as well as first-year PhD students from the GAIR research group at Shanghai Jiao Tong University. The project has been guided by leading research scientists in the field of large language models, including those from NYU and MBZUAI.\n\n### a reinforcement learning guide\n- naklecha.notion.site/a-reinforcement-learning-guide\n\nHi! I’m @naklecha & I love learning through examples and jumping right into things. It works well for me, it’s fun and imo it’s the best way to learn anything. So, that’s what I’m going to do. :) Fuck it! Let’s start by trying to solve chess.\n\n### llm-universe\n- datawhalechina.github.io/llm-universe/\n\n### smol-course\n- github.com/huggingface/smol-course\n\nThis is a practical course on aligning language models for your specific use case. It's a handy way to get started with aligning language models, because everything runs on most local machines. There are minimal GPU requirements and no paid services. The course is based on the SmolLM2 series of models, but you can transfer the skills you learn here to larger models or other small language models.\n\n### self-llm\n- github.com/datawhalechina/self-llm\n\n本项目是一个围绕开源大模型、针对国内初学者、基于 Linux 平台的中国宝宝专属大模型教程，针对各类开源大模型提供包括环境配置、本地部署、高效微调等技能在内的全流程指导，简化开源大模型的部署、使用和应用流程，让更多的普通学生、研究者更好地使用开源大模型，帮助开源、自由的大模型更快融入到普通学习者的生活中。\n\n### Agents（Chip Huyen）\n- https://huyenchip.com//2025/01/07/agents.html\n\nThis post is adapted from the Agents section of AI Engineering (2025) with minor edits to make it a standalone post.\n\n### Building effective agents\n- https://www.anthropic.com/research/building-effective-agents\n\nIn this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.\n\n### LLMInterviewQuestions\n- https://github.com/llmgenai/LLMInterviewQuestions\n\nThis repository contains over 100+ interview questions for Large Language Models (LLM) used by top companies like Google, NVIDIA, Meta, Microsoft, and Fortune 500 companies. Explore questions curated with insights from real-world scenarios, organized into 15 categories to facilitate learning and preparation.\n\n### Transformers Laid Out\n- https://goyalpramod.github.io/blogs/Transformers_laid_out/\n\nHere I aim to:\n\nGive an intuition of how transformers work\n\nExplain what each section of the paper means and how you can understand and implement it\n\nCode it down using PyTorch from a beginners perspective\nAll in one place.\n\n### group relative policy optimization (GRPO)\n- https://superb-makemake-3a4.notion.site/group-relative-policy-optimization-GRPO-18c41736f0fd806eb39dc35031758885\n\nhere, i will explain and implement GRPO in an intuitive way\n\n### HQQ\n- https://mobiusml.github.io/hqq_blog/\n- https://github.com/mobiusml/hqq\n- https://mobiusml.github.io/1bit_blog\n\nHalf-Quadratic Quantization\n\nBasic quantization often results in a loss of model accuracy. This is because the weights in these models can have a wide range of values that can be significantly altered after the quantization process. Weights that deviate from the distribution, notably known as outliers, pose a particular challenge. Group-wise Precision Tuning Quantization (GPTQ) and Activation-Aware Layer Quantization (AWQ) are algorithms that try to overcome this issue by relying on calibration data to minimize the error on layer outputs.\n\n### Uni-RLHF\n- https://uni-rlhf.github.io/\n- https://github.com/pickxiguapi/Uni-RLHF-Platform\n- https://github.com/pickxiguapi/Clean-Offline-RLHF\n- https://arxiv.org/abs/2402.02423\n\nReinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback.\n\n### LLMLingua-2\n- https://arxiv.org/abs/2403.12968\n\nThis paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective.\nTo address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.\n\nWe evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.\n\n## 2 训练/推理\n### 高效对齐算法RAFT「木筏」\n- https://github.com/OptimalScale/LMFlow\n- https://arxiv.org/abs/2304.06767\n- https://optimalscale.github.io/LMFlow/examples/raft.html\n\nAn extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community.\n\n### Alpaca-LoRA\n- https://github.com/tloen/alpaca-lora\n\nLow-Rank LLaMA Instruct-Tuning\n\nThis repository contains code for reproducing the Stanford Alpaca results using low-rank adaptation (LoRA). We provide an Instruct model of similar quality to text-davinci-003 that can run on a Raspberry Pi (for research), and the code can be easily extended to the 13b, 30b, and 65b models.\n\nIn addition to the training code, which runs within five hours on a single RTX 4090, we publish a script for downloading and inference on the foundation model and LoRA, as well as the resulting LoRA weights themselves. To fine-tune cheaply and efficiently, we use Hugging Face's PEFT as well as Tim Dettmers' bitsandbytes.\n\nWithout hyperparameter tuning or validation-based checkpointing, the LoRA model produces outputs comparable to the Stanford Alpaca model. (Please see the outputs included below.) Further tuning might be able to achieve better performance; I invite interested users to give it a try and report their results.\n\n### AlpacaFarm\n- https://mp.weixin.qq.com/s/CIF2F5Vx_RSN1-LwU_ppOQ\n- https://tatsu-lab.github.io/alpaca_farm_paper.pdf\n- https://github.com/tatsu-lab/alpaca_farm\n\n主流的大型语言模型训练都离不开RLHF(人工反馈强化学习)，其主要思想是使用人类专家提供的反馈示例来指导模型的学习过程，它可以加速强化学习过程，提高大模型的性能，但「目前RLHF这个过程既复杂又昂贵」。\n\n 针对RLHF这个问题，学术界目前主要有两种解决方法：「1）避开RLHF」，比如Meta最近研究的“Meta最新模型：LIMA-65B，没有RLHF，模型效果远胜Alpaca！！”，验证了精心制作的少量标注数据同样能达到不错的效果。2）「简化RLHF」，就是今天给大家分享的这篇文章：斯坦福发布了一个名为AlpacaFarm（羊驼农场）的模拟器，旨在降低训练语言模型的成本，且比人工成本低45倍，并表现出与人类反馈的高度一致性，同时也为RLHF的研究开辟了新的道路。\n \n### ColossalAI\n- https://github.com/hpcaitech/ColossalAI\n\nColossal-AI: Making large AI models cheaper, faster and more accessible\n\nColossal-AI provides a collection of parallel components for you. We aim to support you to write your distributed deep learning models just like how you write your model on your laptop. We provide user-friendly tools to kickstart distributed training and inference in a few lines.\n\n### ChatLLaMA\n- https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/chatllama\n\nChatLLaMA 🦙 has been designed to help developers with various use cases, all related to RLHF training and optimized inference.\n\nChatLLaMA is a library that allows you to create hyper-personalized ChatGPT-like assistants using your own data and the least amount of compute possible. Instead of depending on one large assistant that “rules us all”, we envision a future where each of us can create our own personalized version of ChatGPT-like assistants. Imagine a future where many ChatLLaMAs at the \"edge\" will support a variety of human's needs. But creating a personalized assistant at the \"edge\" requires huge optimization efforts on many fronts: dataset creation, efficient training with RLHF, and inference optimization.\n\n### Chinese-Guanaco\n- https://github.com/jianzhnie/Chinese-Guanaco\n\nThis is the repo for the Chinese-Guanaco project, which aims to build and share instruction-following Chinese LLaMA/Pythia/GLM model tuning methods which can be trained on a single Nvidia RTX-2080TI, multi-round chatbot which can be trained on a single Nvidia RTX-3090 with the context len 2048.\n\nChinese-Guanaco uses bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries.\n\n### DPO (Direct Preference Optimization)\n- https://arxiv.org/abs/2305.18290\n- https://zhuanlan.zhihu.com/p/641045324\n- https://huggingface.co/lyogavin/Anima33B-DPO-Belle-1k-merged\n- https://github.com/lyogavin/Anima/tree/main/rlhf\n\nDPO的核心原理是：PPO训练难度核心是因为需要通过reward model来表达偏好，进行强化学习。\n\n为了不再依赖于reward model进行强化学习，他进行了一系列的数学变换，直接推导出了基于Policy Language Model的标注偏好的概率表达形式，从而可以直接求解一个Language Model的最大似然估计。不再需要复杂繁琐的reward model和强化学习。\n\nWhile large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.\n\n### DialogADV：Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality\n- https://github.com/misonsky/DialogADV\n- https://mp.weixin.qq.com/s/Ga0a6a1L6CmCXgk6WDz0Xg\n- https://arxiv.org/abs/2305.14658\n\n我们构建了两个具有挑战的元验证对话数据集，通过实验分析表明\n\n大型语言模型作为评估器评估对话文本生成质量仍然存在很多问题：1）LLMs无法识别与事实不一致的、虚构的回复，对不合理的回复仍然给出较高的评价；2） LLMs自身的知识有限，对于依赖知识的样例大语言模型无法依靠自身的知识给出合理的判断；3）LLMs利用外部知识的能力有待提高。在给定外部知识的情况下，LLMs仍然会对不合理的回复给出较高的评价。\n\n### DeepSpeed-Chat\n- https://mp.weixin.qq.com/s/t3HA4Hu61LLDC3h2Njmo_Q\n- https://github.com/microsoft/DeepSpeed\n\n微软宣布开源 DeepSpeed-Chat，帮助用户轻松训练类 ChatGPT 等大语言模型。\n\n据悉，Deep Speed Chat 是基于微软 Deep Speed 深度学习优化库开发而成，具备训练、强化推理等功能，还使用了 RLHF（基于人类反馈的强化学习）技术，可将训练速度提升 15 倍以上，而成本却大大降低。\n\n### FlexGen\n- https://github.com/FMInference/FlexGen\n\nFlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.\n\nLimitation. As an offloading-based system running on weak GPUs, FlexGen also has its limitations. FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.\n\n### FlagAI and FlagData\n\n- https://github.com/FlagAI-Open/FlagAI\n\nFlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality.\n\n- https://github.com/FlagOpen/FlagData\n\nFlagData, a data processing toolkit that is easy to use and expand. FlagData integrates the tools and algorithms of multi-step data processing, including cleaning, condensation, annotation and analysis, providing powerful data processing support for model training and deployment in multiple fields, including natural language processing and computer vision. \n\n### Guanaco & QloRA\n- https://mp.weixin.qq.com/s/SGJQHsEJTNB6hiVqdc87sg\n- https://arxiv.org/abs/2305.14314\n- https://github.com/artidoro/qlora\n- https://huggingface.co/blog/hf-bitsandbytes-integration\n- Integration: https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing\n- Training: https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing\n\nWe present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. We release all of our models and code, including CUDA kernels for 4-bit training.\n\n### GPT4All\n- https://github.com/nomic-ai/gpt4all\n\nDemo, data and code to train an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa\n\n### HugNLP\n- https://mp.weixin.qq.com/s/IpgOQJ8vrIvnjdrmGCT2FA\n- https://github.com/HugAILab/HugNLP\n- https://arxiv.org/abs/2302.14286\n\n华师大HugAILab团队研发了HugNLP框架，这是一个面向研究者和开发者的全面统一的NLP训练框架，可支持包括文本分类、文本匹配、问答、信息抽取、文本生成、小样本学习等多种NLP任务模型搭建和训练。\n\nHugNLP还集成了大量最新的Prompt技术，例如Prompt-Tuning、In-Context Learning、Instruction-tuning，未来还将引入Chain-of-thought\n\nHugAILab团队还研发了一系列的应用，例如CLUE&GLUE刷榜工具，可支持ChatGPT类模型训练和部署产品HugChat，以及统一信息抽取产品HugIE等。\n\nHugNLP是一个分层式框架，遵循“高内聚低耦合”的开发模式，其核心包括模型层（Models）、处理器层（Processors）、评估器层（Evaluators）和应用层（Applications）四部分。\n\n### INSTRUCTEVAL\n- https://mp.weixin.qq.com/s/E6hq0AUy_hItA5HGo2tCAQ\n- https://github.com/declare-lab/instruct-eval\n- https://arxiv.org/abs/2306.04757\n\n本文引入了一个名为INSTRUCTEVAL的新型评估套件。该套件专用于对指令调优大型语言模型的全面评估，相比之前对LLMs的评估方法，该评估策略不仅详细评估了模型解决问题的能力、文字写作能力，而且还严格评估了模型与人类价值的对齐能力。\n\n### LOw-Memory Optimization (LOMO)\n- https://arxiv.org/abs/2306.09782\n- https://github.com/OpenLMLab/LOMO\n\nLarge Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting both academia and society. While existing approaches have focused on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs with limited resources. In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce memory usage to 10.8% compared to the standard approach (DeepSpeed solution). Consequently, our approach enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.\n\n### llama.cpp\n- https://github.com/ggerganov/llama.cpp\n\nInference of LLaMA model in pure C/C++\n\nThe main goal is to run the model using 4-bit quantization on a MacBook\n- Plain C/C++ implementation without dependencies\n- Apple silicon first-class citizen - optimized via ARM NEON\n- AVX2 support for x86 architectures\n- Mixed F16 / F32 precision\n- 4-bit quantization support\n- Runs on the CPU\n\n### llama2.c\n- https://github.com/karpathy/llama2.c\n- https://mp.weixin.qq.com/s/RFo6B5yfEhv4mihkBiOH4Q\n\nWith the code in this repo you can train the Llama 2 LLM architecture from scratch in PyTorch, then export the weights to a binary file, and load that into one ~simple 500-line C file (run.c) that inferences the model. Alternatively, you can load, finetune, and inference Meta's Llama 2 (but this is still being actively fleshed out). Hence, this repo is a \"fullstack\" train + inference solution for Llama 2 LLM, with a focus on minimalism and simplicity. You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough. I recommend looking at the TinyStories paper for inspiration.\n\nPlease note that this started recently as just a fun weekend project: I took my earlier nanoGPT, tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in run.c. So the project is young and moving quickly. Hat tip to the awesome llama.cpp for inspiring this project. I wanted something super minimal so I chose to hard-code the Llama 2 architecture, stick to fp32, and just roll one inference file of pure C with no dependencies.\n\n### LongLoRA\n- https://github.com/dvlab-research/longlora\n- https://arxiv.org/pdf/2309.12307v1.pdf\n\nWe present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. On the other hand, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs. \n\n### RLLTE: Long-Term Evolution Project of Reinforcement Learning\n- https://github.com/RLE-Foundation/rllte\n\n受通信领域长期演进（LTE）标准项目的启发，RLLTE旨在提供用于推进RL研究和应用的开发组件和工程标准。除了提供一流的算法实现外，RLLTE还能够充当开发算法的工具包。\n\n### FlashAttention\n- https://github.com/Dao-AILab/flash-attention\n\nThis repository provides the official implementation of FlashAttention and FlashAttention-2.\n\n### ExecuTorch\n- https://github.com/pytorch/executorch\n- https://pytorch.org/executorch/stable/index.html\n\nExecuTorch 是一个端到端的解决方案，可以在移动和边缘设备（包括可穿戴设备、手机等）上实现推理功能。\n\nExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices.\n\nKey value propositions of ExecuTorch are:\n- Portability: Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers.\n- Productivity: Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms.\n- Performance: Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs.\n\n### TensorRT-LLM\n- https://github.com/NVIDIA/TensorRT-LLM\n\nTensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).\n\nThe Python API of TensorRT-LLM is architectured to look similar to the PyTorch API. It provides users with a functional module containing functions like einsum, softmax, matmul or view. The layers module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the models module.\n\nTensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. See below for a list of supported models.\n\nTo maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (see examples/gpt for concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.\n\n### BPO（Black-Box Prompt Optimization）\n- https://github.com/thu-coai/BPO\n- https://arxiv.org/abs/2311.04155\n\nBlack-box Prompt Optimization (BPO) offers a conceptually new perspective to bridge the gap between humans and LLMs. (Lower) On Vicuna Eval’s pairwise evaluation, we show that BPO further aligns gpt-3.5-turbo and claude-2 without training. It also outperforms both PPO & DPO and presents orthogonal improvements.\n\n### S-LoRA\n- https://arxiv.org/pdf/2311.03285.pdf\n- https://github.com/S-LoRA/S-LoRA\n\nThe \"pretrain-then-finetune\" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.\n\n### SoRA\n- https://github.com/TsinghuaC3I/SoRA\n- https://arxiv.org/abs/2311.11696\n\nSparse Low-rank Adaptation of Pre-trained Language Models\n\nFine-tuning pre-trained large language models in a parameter-efficient manner is widely studied for its effectiveness and efficiency. The popular method of low-rank adaptation (LoRA) offers a notable approach, hypothesizing that the adaptation process is intrinsically low-dimensional. Although LoRA has demonstrated commendable performance, it is implemented with a fixed and unalterable intrinsic rank that might not always be the ideal choice. Recognizing the need for more flexible adaptation, we extend the methodology of LoRA to an innovative approach we call sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. We achieve this through the incorporation of a gate unit optimized with proximal gradient method in the training stage, controlling the cardinality of rank under the sparsity of the gate. In the subsequent inference stage, we eliminate the parameter blocks corresponding to the zeroed-out ranks, to reduce each SoRA module back to a concise yet rank-optimal LoRA. Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters via updating in a sparse way. We further introduce a sparsifying scheduler for SoRA, aiming to examine the impact of the number of non-zero parameters on the model's memorization and generalization. Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.\n\n### XuanCe(玄策): 开源的深度强化学习(DRL)库\n- https://github.com/agi-brain/xuance\n\nXuanCe is an open-source ensemble of Deep Reinforcement Learning (DRL) algorithm implementations.\n\nWe call it as Xuan-Ce (玄策) in Chinese. \"Xuan (玄)\" means incredible and magic box, \"Ce (策)\" means policy.\n\nDRL algorithms are sensitive to hyper-parameters tuning, varying in performance with different tricks, and suffering from unstable training processes, therefore, sometimes DRL algorithms seems elusive and \"Xuan\". This project gives a thorough, high-quality and easy-to-understand implementation of DRL algorithms, and hope this implementation can give a hint on the magics of reinforcement learning.\n\nWe expect it to be compatible with multiple deep learning toolboxes( PyTorch, TensorFlow, and MindSpore), and hope it can really become a zoo full of DRL algorithms.\n\n### EasyLM（JAX/Flax）\n- https://github.com/hamishivi/EasyLM\n\nLarge language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.\n\nBuilding on top of Hugginface's transformers and datasets, this repo provides an easy to use and easy to customize codebase for training large language models without the complexity in many other frameworks.\n\nEasyLM is built with JAX/Flax. By leveraging JAX's pjit utility, EasyLM is able to train large models that don't fit on a single accelerator by sharding the model weights and training data across multiple accelerators. Currently, EasyLM supports multiple TPU/GPU training in a single host as well as multi-host training on Google Cloud TPU Pods.\n\n### FATE-LLM - Federated Learning for LLMs\n- https://github.com/FederatedAI/FATE-LLM\n\nATE-LLM is a framework to support federated learning for large language models(LLMs).\n\n### DeepSpeed-FastGen\n- https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md\n\nLarge language models (LLMs) like GPT-4 and LLaMA have emerged as a dominant workload in serving a wide range of applications infused with AI at every level. From general chat models to document summarization, and from autonomous driving to copilots at every layer of the software stack, the demand to deploy and serve these models at scale has skyrocketed. While frameworks like DeepSpeed, PyTorch, and several others can regularly achieve good hardware utilization during LLM training, the interactive nature of these applications and the poor arithmetic intensity of tasks like open-ended text generation have become the bottleneck for inference throughput in existing systems.\n\nTo this end, frameworks like vLLM powered by PagedAttention and research systems like Orca have significantly improved the performance of inference for LLMs. However, these systems still struggle to provide consistent quality of service, particularly for workloads with longer prompts. These long prompt workloads are becoming increasingly important as more and more models, like MPT-StoryWriter, and systems, such as DeepSpeed Ulysses, support context windows stretching to tens of thousands of tokens. To better understand the problem space, we provide detailed examples of how text generation works for LLMs in two distinct phases called prompt processing and generation. When systems treat them as distinct phases, generation will be preempted by prompt processing that risks breaking the service level agreements (SLAs).\n\nToday, we are glad to present DeepSpeed-FastGen, a system that overcomes these limitations by leveraging the proposed Dynamic SplitFuse technique and offers up to 2.3x higher effective throughput compared to state-of-the-art systems like vLLM. DeepSpeed-FastGen leverages the combination of DeepSpeed-MII and DeepSpeed-Inference to provide an easy-to-use serving system.\n\n### NVIDIA NeMo-Aligner\n- https://github.com/NVIDIA/NeMo-Aligner\n\nNeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state of the art model alignment algorithms such as SteerLM, DPO and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless and helpful. Users can do end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource efficient manner.\n\nNeMo-Aligner toolkit is built using the NeMo Toolkit which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem; allowing for inference deployment and further customization.\n\nThe toolkit is currently in it's early stages, and we are committed to improving the toolkit to make it easier for developers to pick and choose different alignment algorithms to build safe, helpful and reliable models.\n\n### RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback\n- https://arxiv.org/abs/2309.00267\n\nReinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by Bai et al., offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. Furthermore, RLAIF demonstrates the ability to outperform a supervised fine-tuned baseline even when the LLM preference labeler is the same size as the policy. In another experiment, directly prompting the LLM for reward scores achieves superior performance to the canonical RLAIF setup, where LLM preference labels are first distilled into a reward model. Finally, we conduct extensive studies on techniques for generating aligned AI preferences. Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.\n\n### MLX\n- https://github.com/ml-explore/mlx\n\nMLX is designed by machine learning researchers for machine learning researchers. The framework is intended to be user-friendly, but still efficient to train and deploy models. The design of the framework itself is also conceptually simple. We intend to make it easy for researchers to extend and improve MLX with the goal of quickly exploring new ideas.\n\n### OpenRLHF\n- https://github.com/OpenLLMAI/OpenRLHF\n\nOpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HuggingFace Transformers:\n- Simple and easy to use: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, enabling 34B model RLHF training with just a single DGXA100 node (see the training script).\n- Distributed RLHF: The key idea behind OpenRLHF is to distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 7B models across multiple 24GB RTX 4090 GPUs (or 34B models with multiple A100 80G GPUs).\n- High performance: Thanks to the ability to use a large inference batch size with Ray and DeepSpeed's CPUAdam, the performance of OpenRLHF with the 13B LLaMA2 model is 4x that of DeepSpeedChat.\n\n### CoLLiE: Collaborative Training of Large Language Models in an Efficient Way\n- https://github.com/OpenLMLab/collie\n- https://arxiv.org/abs/2312.00407\n\nCoLLiE是一个可以帮助您从零开始训练大模型的完整工具箱，它提供了数据预处理、模型微调、模型保存以及训练过程各项指标监测等功能。CoLLiE集成了现有的并行策略、高效参数微调方法和高效优化器，以加快训练的速度，提高训练的质量，降低训练的开销。CoLLiE支持主流的多种模型（如MOSS, InternLM, LLaMA, ChatGLM等），您可以轻松在不同的模型之间切换。此外，CoLLiE提供了丰富的文档，使初学者可以快速入门。同时，CoLLiE还提供了高度可定制化的功能和灵活的配置选项，使有经验的用户能够根据自己的需求进行个性化定制。无论您是初学者还是有经验的专业人士，CoLLiE都可以为您提供满足需求的解决方案。\n\n### Superalignment\n- https://github.com/openai/weak-to-strong\n- https://cdn.openai.com/papers/weak-to-strong-generalization.pdf\n- https://openai.com/research/weak-to-strong-generalization\n\nA core challenge for aligning future superhuman AI systems (superalignment) is that humans will need to supervise AI systems much smarter than them. We study a simple analogy: can small models supervise large models? We show that we can use a GPT-2-level model to elicit most of GPT-4’s capabilities—close to GPT-3.5-level performance—generalizing correctly even to hard problems where the small model failed. This opens up a new research direction that allows us to directly tackle a central challenge of aligning future superhuman models while making iterative empirical progress today.\n\n### LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models\n- https://aka.ms/LLMLingua\n- https://github.com/microsoft/LLMLingua\n- https://huggingface.co/spaces/microsoft/LLMLingua\n- https://arxiv.org/abs/2310.05736\n\nLarge language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. \n\n### REST\n- https://arxiv.org/pdf/2311.08252.pdf\n- https://sites.google.com/view/rest-llm\n- https://github.com/FasterDecoding/REST\n\nREST is a retrieval-based speculative decoding method designed to boost generation speed of LLMs. Instead of relying on a draft language model like speculative decoding, REST utilizes a datastore to retrieve and employ draft tokens. Moreover, REST differs from blockwise parallel decoding and Medusa in that it doesn't require extra training steps. It functions as a plug-and-play solution capable of accelerating any pre-existing language model.\n\n### MetaAligner\n- https://arxiv.org/abs/2403.17141\n\nRecent advancements in large language models (LLMs) aim to tackle heterogeneous human expectations and values via multi-objective preference alignment. However, existing methods are parameter-adherent to the policy model, leading to two key limitations: (1) the high-cost repetition of their alignment algorithms for each new target model; (2) they cannot expand to unseen objectives due to their static alignment objectives. In this work, we propose Meta-Objective Aligner (MetaAligner), a model that performs conditional weak-to-strong correction for weak responses to approach strong responses. MetaAligner is the first policy-agnostic and generalizable method for multi-objective preference alignment, which enables plug-and-play alignment by decoupling parameter updates from the policy models and facilitates zero-shot preference alignment for unseen objectives via in-context learning. Experimental results show that MetaAligner achieves significant and balanced improvements in multi-objective alignments on 11 policy models with up to 63x more parameters, and outperforms previous alignment methods with down to 22.27x less computational resources. The model also accurately aligns with unseen objectives, marking the first step towards generalizable multi-objective preference alignment.\n\n### DiJiang\n- https://arxiv.org/abs/2403.19928\n- https://github.com/YuchuanTian/DiJiang\n\nIn an effort to reduce the computational load of Transformers, research on linear attention has gained significant momentum. However, the improvement strategies for attention mechanisms typically necessitate extensive retraining, which is impractical for large language models with a vast array of parameters. In this paper, we present DiJiang, a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs. By employing a weighted Quasi-Monte Carlo method for sampling, the proposed approach theoretically offers superior approximation efficiency. To further reduce the training computational complexity, our kernelization is based on Discrete Cosine Transform (DCT) operations. Extensive experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds. Our DiJiang-7B achieves comparable performance with LLaMA2-7B on various benchmark while requires only about 1/50 training cost. \n\n### LISA（Layerwise Importance Sampled AdamW）\n- https://arxiv.org/abs/2403.17919\n- https://github.com/OptimalScale/LMFlow\n\nThe machine learning community has witnessed impressive advancements since the first appearance of large language models (LLMs), yet their huge memory consumption has become a major roadblock to large-scale training. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem, but their performance still fails to match full parameter training in most large-scale fine-tuning settings. Attempting to complement this deficiency, we investigate layerwise properties of LoRA on fine-tuning tasks and observe an uncommon skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freeze most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over 11%-37% in terms of MT-Bench scores. On large models, specifically LLaMA-2-70B, LISA achieves on-par or better performance than LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.\n\n### edge-infer\n- https://github.com/unit-mesh/edge-infer\n\nEdgeInfer enables efficient edge intelligence by running small AI models, including embeddings and OnnxModels, on resource-constrained devices like Android, iOS, or MCUs for real-time decision-making.\n\n### NeFT\n- https://arxiv.org/abs/2403.11621\n\nLarge Language Models (LLMs) are composed of neurons that exhibit various behaviors and roles, which become increasingly diversified as models scale. Recent studies have revealed that not all neurons are active across different datasets, and this sparsity correlates positively with the task-specific ability, leading to advancements in model pruning and training efficiency. Traditional fine-tuning methods engage all parameters of LLMs, which is computationally expensive and may not be necessary. In contrast, Parameter-Efficient Fine-Tuning (PEFT) approaches aim to minimize the number of trainable parameters, yet they still operate at a relatively macro scale (e.g., layer-level). We introduce Neuron-Level Fine-Tuning (NeFT), a novel approach that refines the granularity of parameter training down to the individual neuron, enabling more precise and computationally efficient model updates. The experimental results show that NeFT not only exceeded the performance of full-parameter fine-tuning and PEFT but also provided insights into the analysis of neurons.\n\n### Aligning Large Language Models with Recommendation Knowledge\n- https://arxiv.org/abs/2404.00245\n\nLarge language models (LLMs) have recently been used as backbones for recommender systems. However, their performance often lags behind conventional methods in standard tasks like retrieval. We attribute this to a mismatch between LLMs' knowledge and the knowledge crucial for effective recommendations. While LLMs excel at natural language reasoning, they cannot model complex user-item interactions inherent in recommendation tasks. We propose bridging the knowledge gap and equipping LLMs with recommendation-specific knowledge to address this. Operations such as Masked Item Modeling (MIM) and Bayesian Personalized Ranking (BPR) have found success in conventional recommender systems. Inspired by this, we simulate these operations through natural language to generate auxiliary-task data samples that encode item correlations and user preferences. Fine-tuning LLMs on such auxiliary-task data samples and incorporating more informative recommendation-task data samples facilitates the injection of recommendation-specific knowledge into LLMs. Extensive experiments across retrieval, ranking, and rating prediction tasks on LLMs such as FLAN-T5-Base and FLAN-T5-XL show the effectiveness of our technique in domains such as Amazon Toys & Games, Beauty, and Sports & Outdoors. Notably, our method outperforms conventional and LLM-based baselines, including the current SOTA, by significant margins in retrieval, showcasing its potential for enhancing recommendation quality.\n\n### Q\\*\n- https://arxiv.org/abs/2406.14283\n\nLarge Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, our Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the current task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP demonstrate the superiority of our method, contributing to improving the reasoning performance of existing open-source LLMs.\n\n### TDPO\n- https://arxiv.org/abs/2404.11999\n- https://github.com/Vance0124/Token-level-Direct-Preference-Optimization\n\nFine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. \n\n### ExCP\n- https://arxiv.org/abs/2406.11257\n- https://github.com/Gaffey/ExCP\n\nWe propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints while achieving nearly lossless performance. We first calculate the residuals of adjacent checkpoints to obtain the essential but sparse information for higher compression ratio. To further excavate the redundancy parameters in checkpoints, we then propose a weight-momentum joint shrinking method to utilize another important information during the model optimization, i.e., momentum. In particular, we exploit the information of both model and optimizer to discard as many parameters as possible while preserving critical information to ensure optimal performance. Furthermore, we utilize non-uniform quantization to further compress the storage of checkpoints.\n\n### MindStar\n- https://arxiv.org/pdf/2405.16265v4\n\nAlthough Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.\n\n### LaMDA\n- https://arxiv.org/abs/2406.12832\n\nLow-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight\" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. \n\n### MInference\n- https://arxiv.org/pdf/2407.02490\n- https://hqjiang.com/minference.html\n\nThe computational challenges of LLM inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference, a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matricesthe A-shape, Vertical-Slash, and Block-Sparse—that can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, Yi-200K, GLM-4-1M, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.\n\n### Instruction Pre-Training\n- https://arxiv.org/pdf/2406.14491\n- https://github.com/microsoft/LMOps\n\nUnsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B.\n\n### PEER\n- https://arxiv.org/abs/2407.04153\n\nThe feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.\n\n### Step-DPO\n- https://arxiv.org/html/2406.18629v1\n- https://github.com/dvlab-research/Step-DPO\n\nMathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter’s out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.\n\n### Data, Data Everywhere\n- https://arxiv.org/abs/2407.06380\n\nThe impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality pretraining sets.\n\n### Prover-Verifier Games\n- https://openai.com/index/prover-verifier-games-improve-legibility/\n- https://arxiv.org/abs/2407.13692\n\nOne way to increase confidence in the outputs of Large Language Models (LLMs) is to support them with reasoning that is clear and easy to check -- a property we call legibility. We study legibility in the context of solving grade-school math problems and show that optimizing chain-of-thought solutions only for answer correctness can make them less legible. To mitigate the loss in legibility, we propose a training algorithm inspired by Prover-Verifier Game from Anil et al. (2021). Our algorithm iteratively trains small verifiers to predict solution correctness, \"helpful\" provers to produce correct solutions that the verifier accepts, and \"sneaky\" provers to produce incorrect solutions that fool the verifier. We find that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over the course of training. Furthermore, we show that legibility training transfers to time-constrained humans tasked with verifying solution correctness. Over course of LLM training human accuracy increases when checking the helpful prover's solutions, and decreases when checking the sneaky prover's solutions. Hence, training for checkability by small verifiers is a plausible technique for increasing output legibility. Our results suggest legibility training against small verifiers as a practical avenue for increasing legibility of large LLMs to humans, and thus could help with alignment of superhuman models.\n\n### Mem0\n- https://github.com/mem0ai/mem0\n\nMem0 provides an intelligent, adaptive memory layer for Large Language Models (LLMs), enhancing personalized AI experiences by retaining and utilizing contextual information across diverse applications. This enhanced memory capability is crucial for applications ranging from customer support and healthcare diagnostics to autonomous systems and personalized content recommendations, allowing AI to remember user preferences, adapt to individual needs, and continuously improve over time.\n\n### EAGLE-2\n- https://arxiv.org/pdf/2406.16858\n- https://github.com/SafeAILab/EAGLE\n- https://huggingface.co/spaces/yuhuili/EAGLE-2\n\nEAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.\n\n### LoRA-GA\n- https://arxiv.org/abs/2407.05000\n- https://github.com/Outsider565/LoRA-GA\n\nFine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters. Although LoRA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of LoRA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, LoRA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that LoRA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla LoRA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models such as Llama 2-7B, LoRA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla LoRA, validating its effectiveness in accelerating convergence and enhancing model performance. \n\n### Q-GaLore\n- https://arxiv.org/pdf/2407.08296\n- https://github.com/VITA-Group/Q-GaLore\n\nQ-GaLore is a memory-efficient training methodology effective in both pre-training and fine-tuning scenarios. Q-GaLore incorporates two main components: (i) low precision training utilizing low-rank gradients, and (ii) lazy layer-wise subspace exploration. It enables full-parameter learning while requiring less memory, such as training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16GB of memory.\n\n### rStar\n- Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers\n- https://github.com/zhentingqi/rStar\n- https://arxiv.org/pdf/2408.06195\n\nThis paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. \n\n### T-MAC\n- https://github.com/microsoft/T-MAC\n- https://www.arxiv.org/pdf/2407.00088\n\nT-MAC is a kernel library to directly support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables. T-MAC aims to boost low-bit LLM inference on CPUs. T-MAC already offers support for various low-bit models, including W4A16 from GPTQ/gguf, W2A16 from BitDistiller/EfficientQAT and W1(.58)A8 from BitNet on OSX/Linux/Windows equipped with ARM/Intel CPUs.\n\n### LLM-zero2hero\n- https://github.com/wjmZZZ/LLM-zero2hero\n\nLLM-zero2hero是一个高度解耦的大语言模型(LLM)微调项目，支持自定义训练、验证和推理过程，实现全量微调和LoRA微调。\n\n### MobileQuant\n- https://arxiv.org/abs/2408.13933\n- https://github.com/saic-fi/MobileQuant\n\nLarge language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20\\%-50\\% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.\n\n### min-p sampling\n- https://arxiv.org/abs/2407.01082\n- https://github.com/menhguin/minp_paper/\n- https://x.com/menhguin/status/1826132708508213629\n\nLarge Language Models (LLMs) generate longform text by successively sampling the next token based on the probability distribution of the token vocabulary at each decoding step. Current popular truncation sampling methods such as top-p sampling, also known as nucleus sampling, often struggle to balance coherence and creativity in generating text, particularly when using higher temperatures. To address this issue, we propose min-p, a dynamic truncation sampling method, that establishes a minimum base percentage threshold for tokens, which the scales according to the probability of the top candidate token. Through experiments on several benchmarks, such as GPQA, GSM8K and AlpacaEval Creative Writing, we demonstrate that min-p improves the coherence and quality of generated text even at high temperatures, while also facilitating more creative and diverse outputs compared to top-p and other sampling methods. As of writing, min-p has been adopted by multiple open-source LLM implementations, and have been independently assessed by members of the open-source LLM community, further validating its practical utility and potential.\n\n### Fast Best-of-N Decoding\n- https://arxiv.org/abs/2410.20290\n\nThe safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.\n\n### UNA: Unifying Alignments of RLHF/PPO, DPO and KTO\n- https://arxiv.org/abs/2408.15339\n\nAn LLM is pretrained on trillions of tokens, but the pretrained LLM may still generate undesired responses. To solve this problem, alignment techniques such as RLHF, DPO and KTO are proposed. However, these alignment techniques have limitations. For example, RLHF requires training the reward model and policy separately, which is complex, time-consuming, memory intensive and unstable during training processes. DPO proposes a mapping between an optimal policy and a reward, greatly simplifying the training process of RLHF. However, it can not take full advantages of a reward model and it is limited to pairwise preference data.\nIn this paper, we propose \\textbf{UN}ified \\textbf{A}lignment (UNA) which unifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the classical RLHF objective, the optimal policy is induced by a generalize implicit reward function. With this novel mapping between a reward model and an optimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised learning of minimizing the difference between an implicit reward and an explicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process; 3. accommodate different feedback types including pairwise, binary and scalar feedback. Downstream experiments show UNA outperforms DPO, KTO and RLHF.\n\n### LongReward\n- https://arxiv.org/abs/2410.21252\n- https://github.com/THUDM/LongReward\n- https://huggingface.co/datasets/THUDM/LongReward-10k\n\nWe open-source LongReward under long_reward/auto_scorer.py, a novel method that utilize an off-the-shelf LLM to automatically provide rewards for model responses in long-context scenarios, considering four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness. Given a long-context-based model response, LongReward assigns a score ranging from 0 to 10 for each dimension, and takes their average as the final reward.\n\n### HybridFlow\n- https://team.doubao.com/zh/publication/hybridflow-a-flexible-and-efficient-rlhf-framework?view_from=research\n- https://github.com/volcengine/veRL\n\nveRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of HybridFlow paper.\n\n### The Surprising Effectiveness of Test-Time Training for Abstract Reasoning\n- https://arxiv.org/abs/2411.07279\n\nLanguage models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT) -- updating model parameters temporarily during inference using a loss derived from input data -- as a mechanism for improving models' reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC's public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.\n\n### OpenR\n- https://github.com/openreasoner/openr\n\nAn Open Source Framework for Advanced Reasoning with Large Language Models\n\n### A Theoretical Understanding of Self-Correction through In-context Alignment\n- https://arxiv.org/abs/2405.18634\n- https://github.com/yifeiwang77/Self-Correction\n\nGoing beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.\n\n### EfficientQAT\n- https://arxiv.org/abs/2407.11062\n- https://github.com/OpenGVLab/EfficientQAT\n\nLarge language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41).\n\n### Cautious Optimizers\n- https://arxiv.org/abs/2411.16085\n- https://github.com/kyleliang919/C-Optim\n\nAdamW has been the default optimizer for transformer pretraining. For many years, our community searches for faster and more stable optimizers with only constraint positive outcomes. In this work, we propose a \\textbf{single-line modification in Pytorch} to any momentum-based optimizer, which we rename Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing speed-up on Llama and MAE pretraining up to 1.47×.\n\n### Optimizing Large Language Model Training Using FP4 Quantization\n- https://arxiv.org/abs/2501.17116\n\nThe growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.\n\n### Evolving Deeper LLM Thinking\n- https://arxiv.org/abs/2501.09891\n\nWe explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.\n\n### rStar-Math\n- https://arxiv.org/abs/2501.04519\n- https://github.com/ai-in-pm/rStar-Math\n\nAn AI Agent that demonstrates the principles and performance of the rStar-Math framework, with capabilities to generate integration code for other chatbots and AI agents.\n\n### Transformer²: Self-Adaptive LLMs\n- https://sakana.ai/transformer-squared/\n\nTransformer² is a machine learning system that dynamically adjusts its weights for various tasks. Adaptation is a remarkable natural phenomenon, like how the octopus can blend its color in with its environment, or how the brain rewires itself after injury. We believe our new system paves the way for a new generation of adaptive AI models, modifying their own weights and architecture to adapt to the nature of the tasks they encounter, embodying living intelligence capable of continuous change and lifelong learning.\n\n### test-time compute scaling\n- https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute\n\nOver the last few years, the scaling of train-time compute has dominated the progress of large language models (LLMs). 1 Although this paradigm has proven to be remarkably effective, the resources needed to pretrain ever larger models are becoming prohibitively expensive, with billion-dollar clusters already on the horizon. 2 This trend has sparked significant interest in a complementary approach: test-time compute scaling. Rather than relying on ever-larger pretraining budgets, test-time methods use dynamic inference strategies that allow models to “think longer” on harder problems. \n\n### XGrammar\n- https://arxiv.org/pdf/2411.15100\n- https://github.com/mlc-ai/xgrammar\n\nXGrammar is an open-source library for efficient, flexible, and portable structured generation. It supports general context-free grammar to enable a broad range of structures while bringing careful system optimizations to enable fast executions. XGrammar features a minimal and portable C++ backend that can be easily integrated into multiple environments and frameworks, and is co-designed with the LLM inference engine and enables zero-overhead structured generation in LLM inference.\n\n### Reverse Thinking Makes LLMs Stronger Reasoners\n- https://arxiv.org/pdf/2411.19865v1\n\nReverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.\n\n### noise_step\n- https://github.com/wbrickner/noise_step\n\nnoise_step: Training in 1.58b With No Gradient Memory\n\n### llamafile\n- https://github.com/Mozilla-Ocho/llamafile/releases\n\nllamafile lets you distribute and run LLMs with a single file\n\n### summarize_from_feedback_details\n- https://github.com/vwxyzjn/summarize_from_feedback_details\n- https://arxiv.org/abs/2403.17031\n\nThis work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF) scaling behaviors reported in OpenAI's seminal TL;DR summarization work. We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in response quality that scale with model size, with our 2.8B, 6.9B models outperforming OpenAI's released 1.3B checkpoint.\n\n### EvoLLM\n- https://arxiv.org/abs/2403.13187\n\nWe present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.\n\n### llm.c\n- https://github.com/karpathy/llm.c\n\nLLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation. I chose GPT-2 as the first working example because it is the grand-daddy of LLMs, the first time the modern stack was put together.\n\n### Mergoo\n- https://github.com/Leeroo-AI/mergoo\n\nmergoo is a library for easily merging multiple LLM experts, and efficiently train the merged LLM. With mergoo, you can efficiently integrate the knowledge of different generic or domain-based LLM experts.\n\n### qwen-vllm\n- https://github.com/owenliang/qwen-vllm\n\n本项目旨在探索生产环境下的高并发推理服务端搭建方法，核心工作非常清晰，边角细节没有投入太多精力，希望对大家有帮助.\n\n### SiLLM\n- https://github.com/armbues/SiLLM\n\nSiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework. Building upon the foundation provided by MLX Examples, this project introduces additional features specifically designed to enhance LLM operations with MLX in a streamlined package.\n\n### How to Train Data-Efficient LLMs\n- https://arxiv.org/abs/2402.09668\n\nThe training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.\n\n### Better & Faster Large Language Models via Multi-token Prediction\n- https://arxiv.org/abs/2404.19737\n\nLarge language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.\n\n### Llama-3 70B Gradient Adapter\n- https://huggingface.co/cognitivecomputations/Llama-3-70B-Gradient-524k-adapter\n- https://huggingface.co/cognitivecomputations/Llama-3-70B-Gradient-1048k-adapter\n\n### Unsloth\n- https://github.com/unslothai/unsloth\n\nFinetune Llama 3, Mistral & Gemma 2-5x faster with 80% less memory!\n\n### RLHF Workflow\n- https://arxiv.org/pdf/2405.07863\n- https://github.com/RLHFlow/RLHF-Reward-Modeling\n- https://github.com/RLHFlow/Online-RLHF\n- https://huggingface.co/RLHFlow\n\nWe present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.\n\n### SimPO\n- https://arxiv.org/pdf/2405.14734\n- https://github.com/princeton-nlp/SimPO\n\nDirect Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3. We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- making it the strongest 8B open-source model.\n\n### ODPO\n- https://arxiv.org/abs/2402.10571\n- https://github.com/rycolab/odpo\n\nDirect preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal. Sometimes, the preferred response is only slightly better than the dispreferred one. In other cases, the preference is much stronger. For instance, if a response contains harmful or toxic content, the annotator will have a strong preference for that response. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning language models, especially when the number of preference pairs is limited.\n\n### ΨPO\n- https://arxiv.org/abs/2310.12036\n\nThe prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation.\nIn this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called ΨPO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of ΨPO) and to identify their potential pitfalls. We then consider another special case for ΨPO by setting Ψ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.\n\n### MoRA\n- https://arxiv.org/pdf/2405.12130\n\nLow-rank adaptation is a popular parameter-efficient fine-tuning method for large language models. In this paper, we analyze the impact of low-rank updating, as implemented in LoRA. Our findings suggest that the low-rank updating mechanism may limit the ability of LLMs to effectively learn and memorize new knowledge. Inspired by this observation, we propose a new method called MoRA, which employs a square matrix to achieve high-rank updating while maintaining the same number of trainable parameters. To achieve it, we introduce the corresponding non-parameter operators to reduce the input dimension and increase the output dimension for the square matrix. Furthermore, these operators ensure that the weight can be merged back into LLMs, which makes our method can be deployed like LoRA. We perform a comprehensive evaluation of our method across five tasks: instruction tuning, mathematical reasoning, continual pretraining, memory and pretraining. Our method outperforms LoRA on memory-intensive tasks and achieves comparable performance on other tasks.\n\n### LOFIT\n- https://arxiv.org/pdf/2406.01563\n- https://github.com/fc2869/lo-fit\n\nRecent work in interpretability shows that large language models (LLMs) can be adapted for new tasks in a learning-free way: it is possible to intervene on LLM representations to elicit desired behaviors for alignment. For instance, adding certain bias vectors to the outputs of certain attention heads is reported to boost the truthfulness of models. In this work, we show that localized fine-tuning serves as an effective alternative to such representation intervention methods. We introduce a framework called Localized Fine-Tuning on LLM Representations LoFiT, which identifies a subset of attention heads that are most important for learning a specific task, then trains offset vectors to add to the model's hidden representations at those selected heads. LoFiT localizes to a sparse set of heads (3%) and learns the offset vectors from limited training data, comparable to the settings used for representation intervention. For truthfulness and reasoning tasks, we find that LoFiT's intervention vectors are more effective for LLM adaptation than vectors from representation intervention methods such as Inference-time Intervention. We also find that the localization step is important: selecting a task-specific set of attention heads can lead to higher performance than intervening on heads selected for a different task. Finally, for the tasks we study, LoFiT achieves comparable performance to other parameter-efficient fine-tuning methods such as LoRA, despite modifying 20x-200x fewer parameters than these methods.\n\n### MEFT\n- https://arxiv.org/pdf/2406.04984\n- https://github.com/CURRENTF/MEFT\n\nParameter-Efficient Fine-tuning (PEFT) facilitates the fine-tuning of Large Language Models (LLMs) under limited resources. However, the fine-tuning performance with PEFT on complex, knowledge-intensive tasks is limited due to the constrained model capacity, which originates from the limited number of additional trainable parameters. To overcome this limitation, we introduce a novel mechanism that fine-tunes LLMs with adapters of larger size yet memory-efficient. This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs and utilizing the larger capacity of Central Processing Unit (CPU) memory compared to Graphics Processing Unit (GPU). We store and update the parameters of larger adapters on the CPU. Moreover, we employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU. This is particularly beneficial over the limited bandwidth of PCI Express (PCIe). Our method can achieve fine-tuning results comparable to those obtained with larger memory capacities, even when operating under more limited resources such as a 24GB memory single GPU setup, with acceptable loss in training efficiency.\n\n### PowerInfer-2\n- https://arxiv.org/abs/2406.06282\n- http://www.powerinfer.ai/v2\n\nThis paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations. The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.\n\n### Emulated Disalignment\n- https://arxiv.org/abs/2402.12343\n- https://github.com/ZHZisZZ/emulated-disalignment\n\nLarge language models (LLMs) undergo safety alignment to ensure safe conversations with humans. However, this paper introduces a training-free attack method capable of reversing safety alignment, converting the outcomes of stronger alignment into greater potential for harm by accessing only LLM output token distributions. Specifically, our method achieves this reversal by contrasting the output token distribution of a safety-aligned language model (e.g., Llama-2-chat) against its pre-trained version (e.g., Llama-2), so that the token predictions are shifted towards the opposite direction of safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward. Our experiments with ED across three evaluation datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets by a large margin. Eventually, given ED's reliance on language model output token distributions, which particularly compromises open-source models, our findings highlight the need to reassess the open accessibility of language models, even if they have been safety-aligned. \n\n### Aligning Large Language Models with Representation Editing: A Control Perspective\n- https://arxiv.org/abs/2406.05954\n\nAligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabilities. To address these challenges, we propose aligning LLMs through representation editing. The core of our method is to view a pre-trained autoregressive LLM as a discrete-time stochastic dynamical system. To achieve alignment for specific objectives, we introduce external control signals into the state space of this language dynamical system. We train a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain the optimal control signals at test time. Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods.\n\n### sDPO\n- https://arxiv.org/abs/2403.19270\n\nAs development of large language models (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization (DPO) for alignment tuning. This approach involves dividing the available preference datasets and utilizing them in a stepwise manner, rather than employing it all at once. We demonstrate that this method facilitates the use of more precisely aligned reference models within the DPO training framework. Furthermore, sDPO trains the final model to be more performant, even outperforming other popular LLMs with more parameters.\n\n### PiSSA\n- https://github.com/GraphPKU/PiSSA\n- https://arxiv.org/abs/2404.02948\n\nAs the parameters of LLMs expand, the computational cost of fine-tuning the entire model becomes prohibitive. To address this challenge, we introduce a PEFT method, Principal Singular values and Singular vectors Adaptation (PiSSA), which optimizes a significantly reduced parameter space while achieving or surpassing the performance of full-parameter fine-tuning. PiSSA is inspired by Intrinsic SAID, which suggests that pre-trained, over-parametrized models inhabit a space of low intrinsic dimension. Consequently, PiSSA represents a matrix W within the model by the product of two trainable matrices A and B, plus a residual matrix Wres for error correction. SVD is employed to factorize W, and the principal singular values and vectors of W are utilized to initialize A and B. The residual singular values and vectors initialize the residual matrix Wres, which keeps frozen during fine-tuning. Notably, PiSSA shares the same architecture with LoRA. However, LoRA approximates Delta W through the product of two matrices, A, initialized with Gaussian noise, and B, initialized with zeros, while PiSSA initializes A and B with principal singular values and vectors of the original matrix W. PiSSA can better approximate the outcomes of full-parameter fine-tuning at the beginning by changing the essential parts while freezing the \"noisy\" parts. In comparison, LoRA freezes the original matrix and updates the \"noise\". This distinction enables PiSSA to convergence much faster than LoRA and also achieve better performance in the end. Due to the same architecture, PiSSA inherits many of LoRA's advantages, such as parameter efficiency and compatibility with quantization. Leveraging a fast SVD method, the initialization of PiSSA takes only a few seconds, inducing negligible cost of switching LoRA to PiSSA.\n\n### LongRoPE\n- https://arxiv.org/abs/2402.13753\n\nLarge context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.\n\n### ORPO\n- https://arxiv.org/abs/2403.07691\n\nWhile recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).\n\n### How to Train Data-Efficient LLMs\n- https://arxiv.org/abs/2402.09668\n\nThe training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.\n\n### Large Language Model Unlearning\n- https://arxiv.org/abs/2310.10683\n- https://github.com/kevinyaobytedance/llm_unlearn\n\nWe study how to perform unlearning, i.e. forgetting undesirable (mis)behaviors, on large language models (LLMs). We show at least three scenarios of aligning LLMs with human preferences can benefit from unlearning: (1) removing harmful responses, (2) erasing copyright-protected content as requested, and (3) eliminating hallucinations. Unlearning, as an alignment technique, has three advantages. (1) It only requires negative (e.g. harmful) examples, which are much easier and cheaper to collect (e.g. via red teaming or user reporting) than positive (e.g. helpful and often human-written) examples required in RLHF (RL from human feedback). (2) It is computationally efficient. (3) It is especially effective when we know which training samples cause the misbehavior. To the best of our knowledge, our work is among the first to explore LLM unlearning. We are also among the first to formulate the settings, goals, and evaluations in LLM unlearning. We show that if practitioners only have limited resources, and therefore the priority is to stop generating undesirable outputs rather than to try to generate desirable outputs, unlearning is particularly appealing. Despite only having negative samples, our ablation study shows that unlearning can still achieve better alignment performance than RLHF with just 2% of its computational time.\n\n### PowerInfer\n- https://github.com/SJTU-IPADS/PowerInfer\n- https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf\n\nPowerInfer is a CPU/GPU LLM inference engine leveraging activation locality for your device.\n\nWe introduce PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation.\n\nThis distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity.\n\nEvaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.\n\n### m-LoRA\n- https://arxiv.org/abs/2312.02515\n- https://github.com/TUDB-Labs/multi-lora-fine-tune\n\nm-LoRA (a.k.a Multi-Lora Fine-Tune) is an open-source framework for fine-tuning Large Language Models (LLMs) using the efficient multiple LoRA/QLoRA methods. Key features of m-LoRA include:\n- Efficient LoRA/QLoRA: Optimizes the fine-tuning process, significantly reducing GPU memory usage by leveraging a shared frozen-based model.\n- Multiple LoRA Adapters: Support for concurrent fine-tuning of multiple LoRA/QLoRA adapters.\n\n### LASER\n- https://github.com/pratyushasharma/laser\n- https://pratyushasharma.github.io/laser/\n- https://arxiv.org/pdf/2312.13558.pdf\n\nLASER stands for LAyer SElective Rank-Reduction, and is an intervention where we replace a selected weight matrix in the transformer architecture of an LLM with its low-rank approximation. A single LASER transformation consists of 3 hyperparameters: the layer number to modify (ℓ) such as 16th layer, the parameter type (τ) such as the first MLP layer, and the fraction of the maximum rank to retain (ρ) such as 0.01 fraction of the rank. We can write this transformation as (ℓ, τ, ρ) and we can stack these transformations and apply them in parallel. The low-rank approximation is performed using SVD. Figure below from our paper shows an illustration.\n\n### StripedHyena-7B\n- https://github.com/togethercomputer/stripedhyena\n- https://www.together.ai/blog/stripedhyena-7b\n\nOne of the focus areas at Together Research is new architectures for long context, improved training, and inference performance over the Transformer architecture. Spinning out of a research program from our team and academic collaborators, with roots in signal processing-inspired sequence models, we are excited to introduce the StripedHyena models.\n\nStripedHyena is the first alternative model competitive with the best open-source Transformers of similar sizes in short and long-context evaluations.\n\nStripedHyena-Nous-7B (SH-N 7B) is our chat model for this release, and was developed with our collaborators at Nous Research.\n\n### SwiftInfer\n- https://github.com/hpcaitech/SwiftInfer\n\nColossal-AI 团队开源了 SwiftInfer，基于 TensorRT 实现了 StreamingLLM，可以进一步提升大模型推理性能 46%，为多轮对话推理提供了高效可靠的落地方案。\n\n### SPIN（Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models）\n- hhttps://arxiv.org/abs/2401.01335\n\nHarnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.\n\n### Self-Rewarding Language Models\n- https://arxiv.org/pdf/2401.10020.pdf\n\nWe posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.\n\n### OPO（On-the-fly Preference Optimization）\n- https://arxiv.org/abs/2312.15907\n- https://gair-nlp.github.io/OPO/\n- https://github.com/GAIR-NLP/OPO\n\nIn this paper, we aim to align large language models with the ever-changing, complex, and diverse human values (e.g., social norms) across time and locations. This presents a challenge to existing alignment techniques, such as supervised fine-tuning, which internalize values within model parameters. To overcome this, we propose an On-the-fly Preference Optimization (OPO) method, which is a real-time alignment that works in a streaming way. It employs an external memory to store established rules for alignment, which can constrain LLMs' behaviors without further training, allowing for convenient updates and customization of human values. We also introduce a scalable evaluation to assess the proposed method more effectively. Experimental results on both human-annotated and auto-generated questions from legal and moral domains indicate the effectiveness of the proposed OPO method. \n\n### ASPIRE\n- https://aclanthology.org/2023.findings-emnlp.345.pdf\n\nLarge language models (LLMs) have recently shown great advances in a variety of tasks, including natural language understanding and generation. However, their use in high-stakes decision-making scenarios is still limited due to the potential for errors. Selective prediction is a technique that can be used to improve the reliability of the LLMs by allowing them to abstain from making predictions when they are unsure of the answer. In this work, we propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of LLMs. Our framework is based on the idea of using parameter-efficient tuning to adapt the LLM to the specific task at hand while improving its ability to perform self-evaluation. We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods. For example, on the CoQA benchmark, our method improves the AUACC from 91.23% to 92.63% and improves the AUROC from 74.61% to 80.25%.\n\n### The Impact of Reasoning Step Length on Large Language Models\n- https://arxiv.org/abs/2401.04925\n- https://github.com/jmyissb/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models\n\nChain of Thought (CoT) is significant in improving the reasoning abilities of large language models (LLMs). However, the correlation between the effectiveness of CoT and the length of reasoning steps in prompts remains largely unknown. To shed light on this, we have conducted several empirical experiments to explore the relations. Specifically, we design experiments that expand and compress the rationale reasoning steps within CoT demonstrations, while keeping all other factors constant. We have the following key findings. First, the results indicate that lengthening the reasoning steps in prompts, even without adding new information into the prompt, considerably enhances LLMs' reasoning abilities across multiple datasets. Alternatively, shortening the reasoning steps, even while preserving the key information, significantly diminishes the reasoning abilities of models. This finding highlights the importance of the number of steps in CoT prompts and provides practical guidance to make better use of LLMs' potential in complex problem-solving scenarios. Second, we also investigated the relationship between the performance of CoT and the rationales used in demonstrations. Surprisingly, the result shows that even incorrect rationales can yield favorable outcomes if they maintain the requisite length of inference. Third, we observed that the advantages of increasing reasoning steps are task-dependent: simpler tasks require fewer steps, whereas complex tasks gain significantly from longer inference sequences.\n\n### SliceGPT\n- https://arxiv.org/abs/2401.15024\n- https://github.com/microsoft/TransformerCompression\n\nSliceGPT is a new post-training sparsification scheme that makes transformer networks (including LLMs) smaller by first applying orthogonal transformations to each transformer layer that leave the model unchanged, and then slicing off the least-significant rows and columns (chosen by the eigenvalue decay) of the weight matrices. The model structure is left unchanged, but each weight matrix is replaced by a smaller (dense) weight matrix, reducing the embedding dimension of the model. This results in speedups (without any additional code optimization) and a reduced memory footprint.\n\n### FuseLLM\n- https://github.com/fanqiwan/FuseLLM\n- https://arxiv.org/abs/2401.10491\n\nWhile training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.\n\n### Tree of Thoughts\n- https://arxiv.org/abs/2305.10601\n- https://github.com/princeton-nlp/tree-of-thought-llm\n\nLanguage models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.\n\n### CogGPT\n- https://github.com/KwaiKEG/CogGPT\n- https://arxiv.org/abs/2401.08438\n\nCognitive dynamics are pivotal to advance human understanding of the world. Recent advancements in large language models (LLMs) reveal their potential for cognitive simulation. However, these LLM-based cognitive studies primarily focus on static modeling, overlooking the dynamic nature of cognition. To bridge this gap, we propose the concept of the cognitive dynamics of LLMs and present a corresponding task with the inspiration of longitudinal studies. Towards the task, we develop CogBench, a novel benchmark to assess the cognitive dynamics of LLMs and validate it through participant surveys. We also design two evaluation metrics for CogBench, including Authenticity and Rationality. Recognizing the inherent static nature of LLMs, we introduce CogGPT for the task, which features an innovative iterative cognitive mechanism aimed at enhancing lifelong cognitive dynamics. Empirical results demonstrate the superiority of CogGPT over existing methods, particularly in its ability to facilitate role-specific cognitive dynamics under continuous information flows.\n\n### KTO（Kahneman-Tversky Optimisation）\n- https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf\n- https://github.com/ContextualAI/HALOs\n\nThis repo draws from the excellently written DPO repo and has preserved many design choices from the original. Some of the key changes we introduced are:\n\n- making data loading more modular, so that you can easily write your own dataloader\n- making trainers more modular, so that each HALO has its own trainer subclass\n- adding code for doing open-ended evaluation with GPT-4 as a judge\n- supporting losses beyond SFT and DPO (including KTO, PPO (offline, off-policy variant), and SLiC)\n\n### Aligner\n- https://aligner2024.github.io\n- https://arxiv.org/abs/2402.02416\n\nEfforts to align Large Language Models (LLMs) are mainly conducted via Reinforcement Learning from Human Feedback (RLHF) methods. However, RLHF encounters major challenges including training reward models, actor-critic engineering, and importantly, it requires access to LLM parameters. Here we introduce Aligner, a new efficient alignment paradigm that bypasses the whole RLHF process by learning the correctional residuals between the aligned and the unaligned answers. Our Aligner offers several key advantages. Firstly, it is an autoregressive seq2seq model that is trained on the query-answer-correction dataset via supervised learning; this offers a parameter-efficient alignment solution with minimal resources. Secondly, the Aligner facilitates weak-to-strong generalization; finetuning large pretrained models by Aligner's supervisory signals demonstrates strong performance boost. Thirdly, Aligner functions as a model-agnostic plug-and-play module, allowing for its direct application on different open-source and API-based models. Remarkably, Aligner-7B improves 11 different LLMs by 21.9% in helpfulness and 23.8% in harmlessness on average (GPT-4 by 17.5% and 26.9%). When finetuning (strong) Llama2-70B with (weak) Aligner-13B's supervision, we can improve Llama2 by 8.2% in helpfulness and 61.6% in harmlessness. \n\n### RPO（Robust Prompt Optimization）\n- https://arxiv.org/abs/2401.17263\n\nDespite advances in AI alignment, language models (LM) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries modify input prompts to induce harmful behavior. While some defenses have been proposed, they focus on narrow threat models and fall short of a strong defense, which we posit should be effective, universal, and practical. To achieve this, we propose the first adversarial objective for defending LMs against jailbreaking attacks and an algorithm, robust prompt optimization (RPO), that uses gradient-based token optimization to enforce harmless outputs. This results in an easily accessible suffix that significantly improves robustness to both jailbreaks seen during optimization and unknown, held-out jailbreaks, reducing the attack success rate on Starling-7B from 84% to 8.66% across 20 jailbreaks. In addition, we find that RPO has a minor effect on benign use, is successful under adaptive attacks, and can transfer to black-box models, reducing the success rate of the strongest attack on GPT-4, GUARD, from 92% to 6%.\n\n### Inference-Time Training Helps Long Text Generation\n- https://arxiv.org/abs/2401.11504\n- https://github.com/TemporaryLoRA/Temp-LoRA/tree/main\n\nLong text generation, such as novel writing or discourse-level translation with extremely long contexts, presents significant challenges to current language models. Existing methods mainly focus on extending the model's context window through strategies like length extrapolation. However, these approaches demand substantial hardware resources during the training and/or inference phases. Our proposed method, Temp-Lora, introduces an alternative concept. Instead of relying on the KV cache to store all context information, Temp-Lora embeds this information directly into the model's parameters. In the process of long text generation, we use a temporary Lora module, progressively trained with text generated previously. This approach not only efficiently preserves contextual knowledge but also prevents any permanent alteration to the model's parameters given that the module is discarded post-generation. Extensive experiments on the PG19 language modeling benchmark and the GuoFeng discourse-level translation benchmark validate the effectiveness of Temp-Lora. Our results show that: 1) Temp-Lora substantially enhances generation quality for long texts, as indicated by a 13.2% decrease in perplexity on a subset of PG19, and a 29.6% decrease in perplexity along with a 53.2% increase in BLEU score on GuoFeng, 2) Temp-Lora is compatible with and enhances most existing long text generation methods, and 3) Temp-Lora can greatly reduce computational costs by shortening the context window. While ensuring a slight improvement in generation quality (a decrease of 3.8% in PPL), it enables a reduction of 70.5% in the FLOPs required for inference and a 51.5% decrease in latency.\n\n### LiPO\n- https://arxiv.org/abs/2402.01878\n\nAligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a listwise ranking problem and describe the Listwise Preference Optimization (LiPO) framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives, especially pairwise ones. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment withDPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-{\\lambda}, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-{\\lambda} can outperform DPO and SLiC by a clear margin on two preference alignment tasks.\n\n### ChatLLM.cpp\n- https://github.com/foldl/chatllm.cpp\n\nInference of a bunch of models from less than 3B to more than 45B, for real-time chatting on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.\n\n### Self-Discover\n- https://arxiv.org/abs/2402.03620\n\nWe introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.\n\n### DoRA\n- https://arxiv.org/abs/2402.09353\n- https://github.com/catid/dora\n\nAmong the widely used parameter-efficient finetuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed LowRank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.\n\n### GPO（Generalized Preference Optimization）\n- https://arxiv.org/abs/2402.05749\n\nOffline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.\n\n### CoT-decoding\n- https://arxiv.org/abs/2402.10200\n\nIn enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) prompting. These methods, while effective, often involve manually intensive prompt engineering. Our study takes a novel approach by asking: Can LLMs reason effectively without prompting? Our findings reveal that, intriguingly, CoT reasoning paths can be elicited from pre-trained LLMs by simply altering the \\textit{decoding} process. Rather than conventional greedy decoding, we investigate the top-k alternative tokens, uncovering that CoT paths are frequently inherent in these sequences. This approach not only bypasses the confounders of prompting but also allows us to assess the LLMs' \\textit{intrinsic} reasoning abilities. Moreover, we observe that the presence of a CoT in the decoding path correlates with a higher confidence in the model's decoded answer. This confidence metric effectively differentiates between CoT and non-CoT paths. Extensive empirical studies on various reasoning benchmarks show that the proposed CoT-decoding substantially outperforms the standard greedy decoding.\n\n### FSDP&QLoRA（Answer）\n- https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html\n- https://github.com/AnswerDotAI/fsdp_qlora/tree/main\n\nwe’re releasing Answer.AI’s first project: a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.\n\n### MindNLP\n- https://github.com/mindspore-lab/mindnlp\n\nMindNLP is an open source NLP library based on MindSpore. It supports a platform for solving natural language processing tasks, containing many common approaches in NLP. It can help researchers and developers to construct and train models more conveniently and rapidly.\n\n### GaLore\n- https://arxiv.org/abs/2403.03507\n- https://github.com/jiaweizzhao/GaLore\n\nTraining Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.\n\n### Mixture-of-LoRAs\n- https://arxiv.org/abs/2403.03432\n\nInstruction Tuning has the potential to stimulate or enhance specific capabilities of large language models (LLMs). However, achieving the right balance of data is crucial to prevent catastrophic forgetting and interference between tasks. To address these limitations and enhance training flexibility, we propose the Mixture-of-LoRAs (MoA) architecture which is a novel and parameter-efficient tuning method designed for multi-task learning with LLMs. In this paper, we start by individually training multiple domain-specific LoRA modules using corresponding supervised corpus data. These LoRA modules can be aligned with the expert design principles observed in Mixture-of-Experts (MoE). Subsequently, we combine the multiple LoRAs using an explicit routing strategy and introduce domain labels to facilitate multi-task learning, which help prevent interference between tasks and ultimately enhances the performance of each individual task. Furthermore, each LoRA model can be iteratively adapted to a new domain, allowing for quick domain-specific adaptation. Experiments on diverse tasks demonstrate superior and robust performance, which can further promote the wide application of domain-specific LLMs.\n\n### LLaMA Factory\n- https://github.com/hiyouga/LLaMA-Factory\n\nLLaMA Factory是一个高效、易用、可扩展的开源全栈大模型微调框架，半年内在GitHub开源社区获得10000关注，并得到Hugging Face、Avalon Labs、美团等多家国内外企业的关注或落地应用。本次分享将从大模型高效训练的角度详细剖析LLaMA Factory的构建动机与组成模块，包括上百种大模型的全栈微调适配原理，LoRA算子优化加速方法，多种微调Trick集成思路等等。\n\n### InfLLM\n- https://arxiv.org/abs/2402.04617\n- https://github.com/thunlp/InfLLM\n\nLarge language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs, such as LLM-driven agents. However, existing LLMs, pre-trained on sequences with restricted maximum length, cannot generalize to longer sequences due to the out-of-domain and distraction issues. To alleviate these issues, existing efforts employ sliding attention windows and discard distant tokens to achieve the processing of extremely long sequences. Unfortunately, these approaches inevitably fail to capture long-distance dependencies within sequences to deeply understand semantics. This paper introduces a training-free memory-based method, InfLLM, to unveil the intrinsic ability of LLMs to process streaming long sequences. Specifically, InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. Thereby, InfLLM allows LLMs to efficiently process long sequences while maintaining the ability to capture long-distance dependencies. Without any training, InfLLM enables LLMs pre-trained on sequences of a few thousand tokens to achieve superior performance than competitive baselines continually training these LLMs on long sequences. Even when the sequence length is scaled to 1, 024K, InfLLM still effectively captures long-distance dependencies.\n\n### MediaPipe\n- https://github.com/googlesamples/mediapipe/tree/main\n\nMediaPipe Solutions streamlines on-device ML development and deployment with flexible low-code / no-code tools that provide the modular building blocks for creating custom high-performance solutions for cross-platform deployment.\n\n### OneBit\n- https://arxiv.org/abs/2402.11295\n\nModel quantification uses low bit-width values to represent the weight matrices of models, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, existing quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit quantization-aware training (QAT) framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the QAT framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 83% of the non-quantized performance) with robust training processes when only using 1-bit weight matrices.\n\n### RWKV_Pytorch\n- https://github.com/yuunnn-w/RWKV_Pytorch\n\n这是一个用纯Pytorch原生实现的RWKV大语言模型的推理框架，官方的原生实现过于复杂且无法拓展生态，让我们加入灵活的Pytorch阵营，一起开源起来吧！\n\nThis is an inference framework for the RWKV large language model implemented purely in native PyTorch. The official native implementation is overly complex and lacks extensibility. Let's join the flexible PyTorch ecosystem and open-source it together!\n\n### llama2.mojo\n- https://mp.weixin.qq.com/s/NpIUReKV-9hb05HXzu7Pdg\n- https://github.com/tairov/llama2.mojo\n\nThis repository serves as a port that provides a Mojo-based implementation of llama2.c.\n\nWith the release of Mojo, I was inspired to take my Python port of llama2.py and transition it to Mojo. The result? A version that leverages Mojo's SIMD & vectorization primitives, boosting the Python performance by nearly 250x. Impressively, the Mojo version now outperforms the original llama2.c, even in runfast mode, by 15-20%. This showcases the potential of hardware-level optimizations through Mojo's advanced features. I think this also can help us to see how far can we go with the original llama2.c hardware optimizations.\n\n### LightLLM\n- https://github.com/ModelTC/lightllm\n\nLightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.\n\n**Features**\n- Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.\n- Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.\n- Dynamic Batch: enables dynamic batch scheduling of requests\n- FlashAttention: incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.\n- Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference.\n- Token Attention: implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.\n- High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.\n\n### Megatron-LLaMA\n- https://mp.weixin.qq.com/s/9yEWvqR5QtCPQVxJHw-wCA\n- https://github.com/alibaba/Megatron-LLaMA\n\nTo facilitate the training of LLaMA-based models and reduce the cost on occupying hardware resources, Alibaba decides to release the internal optimized Megatron-LLaMA training framework to the community. Megatron-LLaMA makes the following contributions:\n\n(i) A standard implementation of LLaMA in Megatron-LLaMA: It is easy to obtain the LLaMA code from Huggingface, which does not involve the various parallel methods provided by Megatron-LM. Megatron-LLaMA offers a standard implementation of LLaMA in Megatron-LM, allowing developers to configure the optimization techniques on demand. We will continue to release features such as Alibi and FlashAttention2 in the future.\n\n(ii) Efficient communication-computation parallelism: Similar to DeepSpeed ZeRO Stage 2, Megatron-LM implements DistributedOptimizer that partitions the gradient and optimizer state, significantly reducing GPU memory usage. However, the solution provided by Megatron-LM does not fully overlap GPU computation with communication, resulting in underutilization of hardware resources. Building upon the original DistributedOptimizer and ZeRO-Stage-2, Megatron-LLaMA proposes a novel approach for gradient and optimizer state sharding, achieving the following benefits without compromising precision: a) extremely high parallelism between communication and computation; b) highly efficient utilization of communication bandwidth; c) lower GPU memory usage. Consequently, Megatron-LLaMA enables higher training throughput on the same hardware configuration than the vanilla Megatron-LM.\n\n(iii) Utilities: Megatron-LLaMA supplements several utilities and improves the checkpoint mechanism in Megatron-LM, including: a) Distributed checkpoint saving/restoring to speedup. This also provides abstract filesystem interfaces for easily integrating distributed file systems such as HDFS; b) Convenient interface for weight conversion from/to the HuggingFace format, facilitating the delivery to the downstream tasks after pretraining; c) Support for Tokenizers in HuggingFace transformers library.\n\n### MeZO: Fine-Tuning Language Models with Just Forward Passes\n- https://github.com/princeton-nlp/MeZO\n- https://arxiv.org/abs/2305.17333\n- https://mp.weixin.qq.com/s/3RLCVQg2QJGSiDUtx9DgPg\n\nThis is the implementation for the paper Fine-Tuning Language Models with Just Forward Passes. In this paper we propose a memory-efficient zeroth-order optimizer (MeZO), adapting the classical zeroth-order SGD method to operate in-place, thereby fine-tuning language models (LMs) with the same memory footprint as inference.\n\nWith a single A100 80GB GPU, MeZO can train a 30-billion parameter OPT model, whereas fine-tuning with Adam can train only a 2.7B LM. MeZO demonstrates comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12× memory reduction. MeZO is also compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning. We also show that MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1).\n\n### MLC LLM\n- https://github.com/mlc-ai/mlc-llm\n\nMLC LLM is a universal solution that allows any language models to be deployed natively on a diverse set of hardware backends and native applications, plus a productive framework for everyone to further optimize model performance for their own use cases.\n\nOur mission is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices.\n\nEverything runs locally with no server support and accelerated with local GPUs on your phone and laptops. Supported platforms include:\n- iPhone, iPad\n- Metal GPUs and Intel/ARM MacBooks;\n- AMD, Intel and NVIDIA GPUs via Vulkan on Windows and Linux;\n- NVIDIA GPUs via CUDA on Windows and Linux;\n- WebGPU on browsers (through companion project WebLLM).\n\n### PKU-Beaver 河狸 (Safe RLHF)\n- https://github.com/PKU-Alignment/safe-rlhf\n- https://mp.weixin.qq.com/s/ZpkgszXbisl5xf63EfTNjQ\n\n北京大学团队开源了名为 PKU-Beaver（河狸）项目，其开源地址为：https://github.com/PKU-Alignment/safe-rlhf。该项目首次公开了 RLHF 所需的数据集、训练和验证代码，是目前首个开源的可复现的 RLHF 基准。同时，为解决人类标注产生的偏见和歧视等不安全因素，北京大学团队首次提出了带有约束的价值对齐技术 CVA（Constrained Value Alignment）。该技术通过对标注信息进行细粒度划分，并结合带约束的安全强化学习方法，显著降低了模型的偏见和歧视，提高了模型的安全性。Beaver使用GPT4进行Evaluation，结果表明，在原有性能保持不变的情况下，Beaver回复的安全性大幅度提升。\n\n### PaLM + RLHF (Pytorch)\n- https://github.com/lucidrains/PaLM-rlhf-pytorch\n\nImplementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Maybe I'll add retrieval functionality too, à la RETRO\n\n### RL4LMs\n- https://github.com/allenai/RL4LMs\n- https://rl4lms.apps.allenai.org/\n\nA modular RL library to fine-tune language models to human preferences\n\nWe provide easily customizable building blocks for training language models including implementations of on-policy algorithms, reward functions, metrics, datasets and LM based actor-critic policies\n\n### Reinforcement Learning with Language Model\n- https://github.com/HarderThenHarder/transformers_tasks/tree/main/RLHF\n\n在这个项目中，我们将通过开源项目 trl 搭建一个通过强化学习算法（PPO）来更新语言模型（GPT-2）的几个示例，包括：\n- 基于中文情感识别模型的正向评论生成机器人（No Human Reward）\n- 基于人工打分的正向评论生成机器人（With Human Reward）\n- 基于排序序列（Rank List）训练一个奖励模型（Reward Model）\n- 排序序列（Rank List）标注平台\n\n### SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression\n- https://github.com/Vahe1994/SpQR\n- https://arxiv.org/pdf/2306.03078.pdf\n- https://mp.weixin.qq.com/s/819L-dY54BaVM1vub9OSpQ\n\nSpQR 通过识别和隔离异常权重来工作，这些异常权重会导致特别大的量化误差，研究者将它们以更高的精度存储，同时将所有其他权重压缩到 3-4 位，在 LLaMA 和 Falcon LLMs 中实现了不到 1% 的困惑度相对准确率损失。从而可以在单个 24GB 的消费级 GPU 上运行 33B 参数的 LLM，而不会有任何性能下降，同时还能提高 15% 的速度。\n\n### Scikit-LLM: Sklearn Meets Large Language Models\n- https://github.com/iryna-kondr/scikit-llm\n\nSeamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.\n\n### Transformer Reinforcement Learning\n- https://github.com/lvwerra/trl\n\nWith trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers library by 🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via transformers. At this point most of decoder architectures and encoder-decoder architectures are supported.\n\n### Train_Transformers_with_INT4\n- https://mp.weixin.qq.com/s/pyEJJ5AvQqfyncO7CA8eNA\n- https://arxiv.org/abs/2306.11987\n- https://github.com/xijiu9/Train_Transformers_with_INT4\n\nQuantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.\n\n### Transformer Reinforcement Learning X\n- https://github.com/CarperAI/trlx\n\ntrlX is a distributed training framework designed from the ground up to focus on fine-tuning large language models with reinforcement learning using either a provided reward function or a reward-labeled dataset.\n\nTraining support for 🤗 Hugging Face models is provided by Accelerate-backed trainers, allowing users to fine-tune causal and T5-based language models of up to 20B parameters, such as facebook/opt-6.7b, EleutherAI/gpt-neox-20b, and google/flan-t5-xxl. For models beyond 20B parameters, trlX provides NVIDIA NeMo-backed trainers that leverage efficient parallelism techniques to scale effectively.\n\n### vLLM\n- https://github.com/vllm-project/vllm\n\nvLLM is a fast and easy-to-use library for LLM inference and serving.\n\nvLLM is fast with:\n- State-of-the-art serving throughput\n- Efficient management of attention key and value memory with PagedAttention\n- Dynamic batching of incoming requests\n- Optimized CUDA kernels\n\nvLLM is flexible and easy to use with:\n- Seamless integration with popular HuggingFace models\n- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more\n- Tensor parallelism support for distributed inference\n- Streaming outputs\n- OpenAI-compatible API server\n\n## 3 可参考的其它开源模型\n### Cerebras（可商用）\n- https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/\n- https://huggingface.co/cerebras\n\n开源7个可商用GPT模型，含数据集和可直接下载的预训练模型权重: Cerebras 开源 7 个 GPT 模型，均可商用，参数量分别达到 1.11 亿、2.56 亿、5.9 亿、13 亿、27 亿、67 亿和 130 亿。其中最大的模型参数量达到 130 亿，与 Meta 最近开源的 LLaMA-13B 相当。该项目开源数据集和预训练模型权重，其中预训练模型权重文件大小近50G可直接下载，并且可用于商业和研究用途。与此前的 GPT-3 模型相比，Cerebras 开源的模型具有更高的可用性和透明度，研究人员和开发者可以使用少量数据对其进行微调，构建出高质量的自然语言处理应用。\n\n### ChatDoctor\n- https://github.com/Kent0n-Li/ChatDoctor\n- https://arxiv.org/abs/2303.14070\n\nRecent large language models (LLMs) in the general domain, such as ChatGPT, have shown remarkable success in following instructions and producing human-like responses. However, such language models have yet to be adapted for the medical domain, resulting in poor accuracy of responses and an inability to provide sound advice on medical diagnoses, medications, etc. To address this problem, we fine-tuned our ChatDoctor model based on 100k real-world patient-physician conversations from an online medical consultation site. Besides, we add autonomous knowledge retrieval capabilities to our ChatDoctor, for example, Wikipedia or a disease database as a knowledge brain. By fine-tuning the LLMs using these 100k patient-physician conversations, our model showed significant improvements in understanding patients' needs and providing informed advice. The autonomous ChatDoctor model based on Wikipedia and Database Brain can access real-time and authoritative information and answer patient questions based on this information, significantly improving the accuracy of the model's responses, which shows extraordinary potential for the medical field with a low tolerance for error.\n\n### Code Llama (Meta AI)\n- https://github.com/facebookresearch/codellama\n- https://ai.meta.com/blog/code-llama-large-language-model-coding/\n\nCode Llama is a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama was developed by fine-tuning Llama 2 using a higher sampling of code. As with Llama 2, we applied considerable safety mitigations to the fine-tuned versions of the model. For detailed information on model training, architecture and parameters, evaluations, responsible AI and safety refer to our research paper. Output generated by code generation features of the Llama Materials, including Code Llama, may be subject to third party licenses, including, without limitation, open source licenses.\n\nWe are unlocking the power of large language models and our latest version of Code Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly. This release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 34B parameters.\n\n### Dolly 1&2（可商用）\n- https://github.com/databrickslabs/dolly\n- https://huggingface.co/databricks/dolly-v2-12b\n- https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html\n\nWe show that anyone can take a dated off-the-shelf open source large language model (LLM) and give it magical ChatGPT-like instruction following ability by training it in 30 minutes on one machine, using high-quality training data. Surprisingly, instruction-following does not seem to require the latest or largest models: our model is only 6 billion parameters, compared to 175 billion for GPT-3. We open source the code for our model (Dolly) and show how it can be re-created on Databricks. We believe models like Dolly will help democratize LLMs, transforming them from something very few companies can afford into a commodity every company can own and customize to improve their products.\n\n### FinGPT\n- https://github.com/ai4finance-foundation/fingpt\n- https://arxiv.org/pdf/2306.06031v1.pdf\n- https://mp.weixin.qq.com/s/A9euFin675nxGGciiX6rJQ\n\nLarge language models (LLMs) have shown the potential of revolutionizing natural language processing tasks in diverse domains, sparking great interest in finance. Accessing high-quality financial data is the first challenge for financial LLMs (FinLLMs). While proprietary models like BloombergGPT have taken advantage of their unique data accumulation, such privileged access calls for an open-source alternative to democratize Internet-scale financial data.\n\nIn this paper, we present an open-source large language model, FinGPT, for the finance sector. Unlike proprietary models, FinGPT takes a data-centric approach, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. We highlight the importance of an automatic data curation pipeline and the lightweight low-rank adaptation technique in building FinGPT. Furthermore, we showcase several potential applications as stepping stones for users, such as robo-advising, algorithmic trading, and low-code development. Through collaborative efforts within the open-source AI4Finance community, FinGPT aims to stimulate innovation, democratize FinLLMs, and unlock new opportunities in open finance. \n\n### Falcon（可商用）\n- https://mp.weixin.qq.com/s/mKx0ZiTB28khj4U7EVJiVw\n- https://falconllm.tii.ae/\n- https://huggingface.co/tiiuae/falcon-40b\n\nFalcon LLM is a foundational large language model (LLM) with 40 billion parameters trained on one trillion tokens. TII has now released Falcon LLM – a 40B model.\n\nThe model uses only 75 percent of GPT-3’s training compute, 40 percent of Chinchilla’s, and 80 percent of PaLM-62B’s.\n\n### Facebook/Meta LLaMA/LLaMA2\n- https://github.com/facebookresearch/llama\n- https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/\n\n**LLaMA1**\n\nLLaMA: Open and Efficient Foundation Language Models\n\nWe introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.\n\n**LLaMA2**\n\nWe are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly.\n\nThis release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 70B parameters.\n\nThis repository is intended as a minimal example to load Llama 2 models and run inference. For more detailed examples leveraging HuggingFace, see llama-recipes.\n\n### Giraffe\n- https://huggingface.co/abacusai/Giraffe-v2-13b-32k\n- https://github.com/abacusai/long-context\n\nThe choice of how to encode positional information for transformers has been one of the key components of LLM architectures.\n\nAn area that has been interesting to us and others in the community recently is whether LLMs can be extended to longer contexts.\n\nWe have conducted a range of experiments with different schemes for extending context length capabilities of Llama, which has been pretrained on 2048 context length with the RoPE (Rotary Position Embedding) encoding. Here we share some of the results as well as the training and evaluation scripts in the hope that it will be useful to the community. For our best performing models - linear scaling with IFT at scales 4 and 16 - we are also sharing the weights in case others wish to use them, or to conduct their own tests. We believe the scale 16 model should perform well on real world tasks up to 16k context lengths, and potentially even up to about 20-24k context lengths.\n\n### GALACTICA\n- https://github.com/paperswithcode/galai\n- https://arxiv.org/pdf/2211.09085.pdf\n- https://galactica.org/\n\nGALACTICA is a general-purpose scientific language model. It is trained on a large corpus of scientific text and data. It can perform scientific NLP tasks at a high level, as well as tasks such as citation prediction, mathematical reasoning, molecular property prediction and protein annotation. More information is available at galactica.org.\n\n### Goar-7B for Arithmetic Tasks\n- https://mp.weixin.qq.com/s/_haINkHNV4bMszm9F41yXA\n- https://arxiv.org/pdf/2305.14201.pdf\n- https://github.com/liutiedong/goat\n\n在本文介绍了一种微调的语言模型：Goat。不同于以往对算术计算的研究，该模型在 LLaMA上采用端到端监督指令微调范式，利用包含约100万个样本的综合生成数据集进行训练得到。它非常擅长算术任务。Goat 在初等算术（包括整数的加法、减法、乘法和除法）中实现了最先进的性能。实验结果表明，仅通过监督微调而不应用任何特殊技术，「Goat模型能够在Zero-shot设置中以近乎完美的精度为大数加法和减法生成答案」。这种出色的算术能力归因于 LLaMA 对数字的一致标记化，并表明这对于以前的 LLM 来说几乎是不可能实现的，例如 Bloom、OPT、GPT-NeoX 、Pythia等。\n\n 然而，该模型在面对乘除运算时遇到了很大的挑战。为了克服这一挑战，本文提出了一种方法，即「将各种算术任务分为可学习和不可学习任务」，随后利用基本算术原理将不可学习任务（例如多位数乘法和除法）分解为一系列可学习任务。本文方法确保促进模型学习的中间监督也很容易被人类理解，即通过模型微调在生成最终答案之前生成合适的CoT。「本文方法大大优于 GPT-4 的长乘法和长除法」。最终使用 BIG-bench (Srivastava et al., 2022) 算术子任务评估模型的性能，并对本文方法的有效性进行综合评估。实验结果表明，该模型可以学习计算模式并将其泛化到看不见的数据，而不仅仅是纯粹权重记忆计算。此外，Goat-7B 可以在24GB VRAM GPU上使用LoRA低秩适应技术进行训练，可以「很容易复现论文成果」。\n \n### HuggingChat\n- https://huggingface.co/chat/\n\nMaking the community's best AI chat models available to everyone.\n\n### Koala: A Dialogue Model for Academic Research\n- https://bair.berkeley.edu/blog/2023/04/03/koala/\n\nIn this post, we introduce Koala, a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web. We describe the dataset curation and training process of our model, and also present the results of a user study that compares our model to ChatGPT and Stanford’s Alpaca. Our results show that Koala can effectively respond to a variety of user queries, generating responses that are often preferred over Alpaca, and at least tied with ChatGPT in over half of the cases.\n\n### LongLLaMA\n- https://mp.weixin.qq.com/s/XzaET7WfrNpOf-zdiSxrig\n- https://arxiv.org/pdf/2307.03170.pdf\n- https://github.com/CStanKonrad/long_llama\n- https://huggingface.co/syzymon/long_llama_3b\n\nThis repository contains the research preview of LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more.\n\nLongLLaMA is built upon the foundation of OpenLLaMA and fine-tuned using the Focused Transformer (FoT) method. We release a smaller 3B variant of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on Hugging Face. Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models. Stay tuned for further updates.\n\n### LLaMA复刻版OpenLLaMA\n- https://github.com/openlm-research/open_llama\n\nIn this repo, we release a permissively licensed open source reproduction of Meta AI's LLaMA large language model. In this release, we're releasing a public preview of the 7B OpenLLaMA model that has been trained with 200 billion tokens. We provide PyTorch and Jax weights of pre-trained OpenLLaMA models, as well as evaluation results and comparison against the original LLaMA models. Stay tuned for our updates.\n\n### Llama-X: Open Academic Research on Improving LLaMA to SOTA LLM\n- https://github.com/AetherCortex/Llama-X\n\nThis is the repo for the Llama-X, which aims to:\n- Progressively improve the performance of LLaMA to SOTA LLM with open-source community.\n- Conduct Llama-X as an open academic research which is long-term, systematic and rigorous.\n- Save the repetitive work of community and we work together to create more and faster increment.\n\n### Lit-LLaMA ️\n- https://github.com/Lightning-AI/lit-llama\n\nLit-LLaMA is:\n- Simple: Single-file implementation without boilerplate.\n- Correct: Numerically equivalent to the original model.\n- Optimized: Runs on consumer hardware or at scale.\n- Open-source: No strings attached.\n\n### MammoTH\n- https://github.com/TIGER-AI-Lab/MAmmoTH\n\nWe introduce MAmmoTH 🦣, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, a meticulously curated instruction tuning dataset that is lightweight yet generalizable. MathInstruct is compiled from 13 math rationale datasets, six of which are newly curated by this work. It uniquely focuses on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and ensures extensive coverage of diverse mathematical fields.\n\n### MPT-7B（可商用）\n- https://www.mosaicml.com/blog/mpt-7b\n- https://huggingface.co/mosaicml/mpt-7b\n\nMPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. This model was trained by MosaicML.\n\nMPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.\n\nIntroducing MPT-7B, the latest entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Starting today, you can train, finetune, and deploy your own private MPT models, either starting from one of our checkpoints or training from scratch. For inspiration, we are also releasing three finetuned models in addition to the base MPT-7B: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-StoryWriter-65k+, the last of which uses a context length of 65k tokens!\n\n### OpenGPT\n- https://github.com/CogStack/OpenGPT\n\nA framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).\n\nNHS-LLM：A conversational model for healthcare trained using OpenGPT. All the medical datasets used to train this model were created using OpenGPT and are available below.\n\n### Orca\n- https://aka.ms/orca-lm\n- https://arxiv.org/pdf/2306.02707.pdf\n- https://mp.weixin.qq.com/s/RRdrSeI2ux5QE6MqJ8opSg\n\nRecent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to publicly release a diff of the model weights in accordance with LLaMA's release policy to be published at this https URL), a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. To promote this progressive learning, we tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval. Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4. Our research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.\n\n### OpenChatKit\n- https://www.together.xyz/blog/openchatkit \n- https://huggingface.co/spaces/togethercomputer/OpenChatKit\n- https://github.com/togethercomputer/OpenChatKit\n\nOpenChatKit uses a 20 billion parameter chat model trained on 43 million instructions and supports reasoning, multi-turn conversation, knowledge and generative answers.\n\nOpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications. The kit includes an instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between Together, LAION, and Ontocord.ai. Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions.\n\n### Open-Assistant\n- https://github.com/LAION-AI/Open-Assistant\n- https://open-assistant.io/zh\n\nOpen Assistant is a project meant to give everyone access to a great chat based large language model.\n\nWe believe that by doing this we will create a revolution in innovation in language. In the same way that stable-diffusion helped the world make art and images in new ways we hope Open Assistant can help improve the world by improving language itself.\n\n### Platypus\n- https://platypus-llm.github.io/\n- https://github.com/arielnlee/Platypus\n\nWe present Platypus a family of fine-tuned and merged Large Language Models (LLMs) that achieves the strongest performance and currently stands at first place in HuggingFace's Open LLM Leaderboard as of the release date of this work. In this work we describe (1) our curated dataset Open-Platypus, that is a subset of other open datasets and which we release to the public (2) our process of fine-tuning and merging LoRA modules in order to conserve the strong prior of pretrained LLMs, while bringing specific domain knowledge to the surface (3) our efforts in checking for test data leaks and contamination in the training data, which can inform future research. Specifically, the Platypus family achieves strong performance in quantitative LLM metrics across model sizes, topping the global Open LLM leaderboard while using just a fraction of the fine-tuning data and overall compute that are required for other state-of-the-art fine-tuned LLMs. In particular, a 13B Platypus model can be trained on a single A100 GPU using 25k questions in 5 hours. This is a testament of the quality of our Open-Platypus dataset, and opens opportunities for more improvements in the field.\n\n### MedLLaMA-13B & PMC-LLaMA: Continue Training LLaMA on Medical Papers\n- https://github.com/chaoyi-wu/PMC-LLaMA\n- https://huggingface.co/chaoyi-wu/PMC_LLAMA_7B\n- https://arxiv.org/abs/2304.14454\n\nWe have release a new model MedLLaMA-13B finetuned with LLaMA-13B on some medical corpus, termed as MedLLaMA-13B. It have been proved to be more powerful than both LLaMA-13B and PMC-LLaMA, refering to our benchmark for detail comparison.\n\n### RedPajama（可商用）\n- https://www.together.xyz/blog/redpajama\n- https://github.com/togethercomputer/RedPajama-Data\n\nRedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens.\n\n### SQLCoder (Defog)\n- https://github.com/defog-ai/sqlcoder\n- https://huggingface.co/defog/sqlcoder\n\nSQLCoder is a 15B parameter model that outperforms gpt-3.5-turbo for natural language to SQL generation tasks on our sql-eval framework, and significantly outperforms all popular open-source models. It also significantly outperforms text-davinci-003, a model that's more than 10 times its size.\n\nSQLCoder is fine-tuned on a base StarCoder model.\n\n### StableLM\n- https://zhuanlan.zhihu.com/p/623542189\n- https://github.com/Stability-AI/StableLM\n\nStableLM: Stability AI Language Models\n\nThis repository contains Stability AI's ongoing development of the StableLM series of language models and will be continuously updated with new checkpoints. The following provides an overview of all currently available models. More coming soon.\n\n### StableVicuna\n- https://github.com/Stability-AI/StableLM\n\nStableVicuna基于小羊驼Vicuna-13B的进一步指令微调和RLHF训练的版本。Vicuna-13B是LLaMA-13B的一个指令微调模型。\n\n### Stanford Alpaca\n- https://crfm.stanford.edu/2023/03/13/alpaca.html\n- https://alpaca-ai.ngrok.io/\n- https://github.com/tatsu-lab/stanford_alpaca\n\nAlpaca: A Strong, Replicable Instruction-Following ModelAl\n\nWe introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On our preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).\n\n### UltraLM-13B\n- https://github.com/thunlp/UltraChat\n\nUltraLM is a series of chat language models trained on UltraChat. Currently, we have released the 13B version, which ranks #1 among open-source models and ranks #4 among all models on AlpacaEval Leaderboard. UltraLM-13B is based upon LLaMA-13B.\n\nThis project aims to construct open-source, large-scale, and multi-round dialogue data powered by Turbo APIs to facilitate the construction of powerful language models with general conversational capability. In consideration of factors such as safeguarding privacy, we do not directly use any data available on the Internet as prompts. To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. We instruct the user model with carefully designed prompts to mimic human user behavior and call the two APIs iteratively. The generated dialogues undergo further post-processing and filtering.\n\n### Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality\n- https://chat.lmsys.org/\n- https://vicuna.lmsys.org/\n- https://github.com/lm-sys/FastChat\n\nAn open platform for training, serving, and evaluating large language model based chatbots.\n\n### Wombat\n- https://mp.weixin.qq.com/s/xoPKmOzjlNZ2qGdcKeGARw\n- https://mp.weixin.qq.com/s/UI-ij5o43ct1efYoNVdQDg\n- https://arxiv.org/abs/2304.05302v1\n- https://github.com/GanjinZero/RRHF\n\nThis is the repository for RRHF (Rank Response to align Human Feedback) and open-sourced language models Wombat. RRHF helps align large language models with human perference easier.\n\nReinforcement Learning from Human Feedback (RLHF) enables the alignment of large language models with human preference, improving the quality of interactions between humans and language models. Recent practice of RLHF uses PPO to enable the large language model optimization of such alignment. However, implementing PPO is non-trivial (where the training procedure requires interactive between policy, behavior policy, reward, value model) and it is also tedious to tuning many hyper-parameters. Our motivation is to simplify the alignment between language models with human preference, and our proposed paradigm RRHF (Rank Response from Human Feedback) can achieve such alignment as easily as conventional fine-tuning. It is simpler than PPO from the aspects of coding, model counts, and hyperparameters.\n\n### WizardMath\n- https://github.com/nlpxucan/WizardLM/tree/main/WizardMath\n- https://huggingface.co/WizardLM/WizardMath-70B-V1.0\n\n\n\n### XGen-7B\n- https://blog.salesforceairesearch.com/xgen/\n- https://github.com/salesforce/xgen\n\nWe trained a series of 7B LLMs named XGen-7B with standard dense attention on up to 8K sequence length for up to 1.5T tokens. We also fine tune the models on public-domain instructional data. The main take-aways are:\n- On standard NLP benchmarks, XGen achieves comparable or better results when compared with state-of-the-art open-source LLMs (e.g. MPT, Falcon, LLaMA, Redpajama, OpenLLaMA) of similar model size.\n- Our targeted evaluation on long sequence modeling benchmarks show benefits of our 8K-seq models over 2K- and 4K-seq models.\n- XGen-7B archives equally strong results both in text (e.g., MMLU, QA) and code (HumanEval) tasks.\n- Training cost of $150K on 1T tokens under Google Cloud pricing for TPU-v4.\n\n### Xwin-LM\n- https://github.com/Xwin-LM/Xwin-LM\n\nXwin-LM aims to develop and open-source alignment technologies for large language models, including supervised fine-tuning (SFT), reward models (RM), reject sampling, reinforcement learning from human feedback (RLHF), etc. Our first release, built-upon on the Llama2 base models, ranked TOP-1 on AlpacaEval. Notably, it's the first to surpass GPT-4 on this benchmark. The project will be continuously updated.\n\n### LLaMA 2 Long\n- https://arxiv.org/pdf/2309.16039.pdf\n\nWe present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.\n\n### Mistral 7B\n- https://mistral.ai/news/announcing-mistral-7b/\n\nMistral 7B is a 7.3B parameter model that:\n- Outperforms Llama 2 13B on all benchmarks\n- Outperforms Llama 1 34B on many benchmarks\n- Approaches CodeLlama 7B performance on code, while remaining good at English tasks\n- Uses Grouped-query attention (GQA) for faster inference\n- Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost\n\n### UltraLM-13B (UltraFeedback)\n- https://github.com/OpenBMB/UltraFeedback\n\nUltraRM unleashes the power of UltraLM-13B-v2.0 and UltraLM-13B! A simple best-of-16 sampling achieves 92.30% (UltraLM2, 🥇 in 13B results) and 91.54% (UltraLM, 🥇 in LLaMA-1 results) win rates against text-davinci-003 on AlpacaEval benchmark!\n\nUltraFeedback is a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models. We collect about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN, see here for dataset statistics). We then use these prompts to query multiple LLMs (see here for model lists) and generate 4 different responses for each prompt, resulting in a total of 256k samples.\n\nTo collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness. We then ask GPT-4 to annotate the collected samples based on the instruction.\n\n### Llemma: An Open Language Model For Mathematics\n- https://arxiv.org/abs/2310.10631\n- https://github.com/EleutherAI/math-lm\n\nWe present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.\n\n### Mistral-Trismegistus-7B （神秘学/玄学/灵性）\n- https://huggingface.co/teknium/Mistral-Trismegistus-7B\n\nTranscendence is All You Need! Mistral Trismegistus is a model made for people interested in the esoteric, occult, and spiritual.\n\n### Memory-GPT(MemGPT)\n- https://github.com/cpacker/MemGPT\n- https://arxiv.org/abs/2310.08560\n\nLarge language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. \n\n### MetaMath\n- https://github.com/meta-math/MetaMath\n\nLarge language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.\n\n### ChipNeMo (芯片设计)\n- https://arxiv.org/abs/2311.00176\n\nChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: custom tokenizers, domain-adaptive continued pretraining, supervised fine-tuning (SFT) with domain-specific instructions, and domain-adapted retrieval models. We evaluate these methods on three selected LLM applications for chip design: an engineering assistant chatbot, EDA script generation, and bug summarization and analysis. Our results show that these domain adaptation techniques enable significant LLM performance improvements over general-purpose base models across the three evaluated applications, enabling up to 5x model size reduction with similar or better performance on a range of design tasks. Our findings also indicate that there's still room for improvement between our current results and ideal outcomes. We believe that further investigation of domain-adapted LLM approaches will help close this gap in the future.\n\n### Zephyr\n- https://github.com/huggingface/alignment-handbook\n- https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-6538c6d6d5ddd1cbb1744a66\n\nJust one year ago, chatbots were out of fashion and most people hadn't heard about techniques like Reinforcement Learning from Human Feedback (RLHF) to align language models with human preferences. Then, OpenAI broke the internet with ChatGPT and Meta followed suit by releasing the Llama series of language models which enabled the ML community to build their very own capable chatbots. This has led to a rich ecosystem of datasets and models that have mostly focused on teaching language models to follow instructions through supervised fine-tuning (SFT).\n\nHowever, we know from the InstructGPT and Llama2 papers that significant gains in helpfulness and safety can be had by augmenting SFT with human (or AI) preferences. At the same time, aligning language models to a set of preferences is a fairly novel idea and there are few public resources available on how to train these models, what data to collect, and what metrics to measure for best downstream performance.\n\nThe Alignment Handbook aims to fill that gap by providing the community with a series of robust training recipes that span the whole pipeline.\n\n### neural-chat-7b-v3-1（Intel）\n- https://huggingface.co/Intel/neural-chat-7b-v3-1\n\nThis model is a fine-tuned model based on mistralai/Mistral-7B-v0.1 on the open source dataset Open-Orca/SlimOrca. Then we align it with DPO algorithm. For more details, you can refer our blog: The Practice of Supervised Fine-tuning and Direct Preference Optimization on Habana Gaudi2.\n\n### SteerLM\n- https://huggingface.co/datasets/nvidia/HelpSteer\n- http://arxiv.org/abs/2311.09528\n- https://arxiv.org/abs/2310.05344\n\nHelpSteer is an open-source Helpfulness Dataset (CC-BY-4.0) that supports aligning models to become more helpful, factually correct and coherent, while being adjustable in terms of the complexity and verbosity of its responses.\n\nLeveraging this dataset and SteerLM, we train a Llama 2 70B to reach 7.54 on MT Bench, the highest among models trained on open-source datasets based on MT Bench Leaderboard as of 15 Nov 2023.\n\n### Llama Coder\n- https://github.com/ex3ndr/llama-coder\n- https://marketplace.visualstudio.com/items?itemName=ex3ndr.llama-coder\n\nLlama Coder is a better and self-hosted Github Copilot replacement for VS Studio Code. Llama Coder uses Ollama and codellama to provide autocomplete that runs on your hardware. Works best with Mac M1/M2/M3 or with RTX 4090.\n\n### Meditron\n- https://github.com/epfLLM/meditron\n- https://arxiv.org/abs/2311.16079\n\nMeditron is a suite of open-source medical Large Language Models (LLMs).\n\nWe release Meditron-7B and Meditron-70B, which are adapted to the medical domain from Llama-2 through continued pretraining on a comprehensively curated medical corpus, including selected PubMed papers and abstracts, a new dataset of internationally-recognized medical guidelines, and a general domain corpus.\n\nMeditron-70B, finetuned on relevant data, outperforms Llama-2-70B, GPT-3.5 and Flan-PaLM on multiple medical reasoning tasks.\n\n### RankZephyr\n- https://arxiv.org/abs/2312.02724\n- https://github.com/castorini/rank_llm\n\nIn information retrieval, proprietary large language models (LLMs) such as GPT-4 and open-source counterparts such as LLaMA and Vicuna have played a vital role in reranking. However, the gap between open-source and closed models persists, with reliance on proprietary, non-transparent models constraining reproducibility. Addressing this gap, we introduce RankZephyr, a state-of-the-art, open-source LLM for listwise zero-shot reranking. RankZephyr not only bridges the effectiveness gap with GPT-4 but in some cases surpasses the proprietary model. Our comprehensive evaluations across several datasets (TREC Deep Learning Tracks; NEWS and COVID from BEIR) showcase this ability. RankZephyr benefits from strategic training choices and is resilient against variations in initial document ordering and the number of documents reranked. Additionally, our model outperforms GPT-4 on the NovelEval test set, comprising queries and passages past its training period, which addresses concerns about data contamination.\n\n### StableLM Zephyr 3B\n- https://huggingface.co/stabilityai/stablelm-zephyr-3b\n- https://huggingface.co/stabilityai/stable-zephyr-3b-dpo\n- https://github.com/eaidova/openvino_notebooks/blob/ea/stateful_chatbot/notebooks/273-stable-zephyr-3b-chatbot/273-stable-zephyr-3b-chatbot.ipynb\n\nStableLM Zephyr 3B is a 3 billion parameter Large Language Model (LLM), 60% smaller than 7B models, allowing accurate, and responsive output on a variety of devices without requiring high-end hardware. \n\n### Orca 2\n- https://arxiv.org/pdf/2311.11045.pdf\n- https://huggingface.co/microsoft/Orca-2-13b\n\nOrca 2 is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning.\n\n### Mixtral 7b 8 Expert\n- https://huggingface.co/DiscoResearch/mixtral-7b-8expert\n- https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1\n- https://replicate.com/nateraw/mixtral-8x7b-32kseqlen\n- https://mistral.ai/news/mixtral-of-experts/\n\n\nMistral AI continues its mission to deliver the best open models to the developer community. Moving forward in AI requires taking new technological turns beyond reusing well-known architectures and training paradigms. Most importantly, it requires making the community benefit from original models to foster new inventions and usages.\n\nToday, the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.\n\nMixtral has the following capabilities.\n- It gracefully handles a context of 32k tokens.\n- It handles English, French, Italian, German and Spanish.\n- It shows strong performance in code generation.\n- It can be finetuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.\n\n### Phi\n- https://huggingface.co/microsoft/phi-1_5\n- https://arxiv.org/abs/2309.05463\n- https://huggingface.co/microsoft/phi-1\n- https://huggingface.co/microsoft/phi-2\n- https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/\n\nWe are now releasing Phi-2(opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.\n\n### LLM360（Amber,CrystalCoder,Diamond）\n- https://www.llm360.ai/\n- https://arxiv.org/pdf/2312.06550.pdf\n\nThe recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at this https URL). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.\n\n### Mamba\n- https://github.com/state-spaces/mamba\n- https://arxiv.org/abs/2312.00752\n\nMamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.\n\n### SOLAR\n- https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0\n\nWe introduce the first 10.7 billion (B) parameter model, SOLAR-10.7B. It's compact, yet remarkably powerful, and demonstrates unparalleled state-of-the-art performance in models with parameters under 30B.\n\nWe developed the Depth Up-Scaling technique. Built on the Llama2 architecture, SOLAR-10.7B incorporates the innovative Upstage Depth Up-Scaling. We then integrated Mistral 7B weights into the upscaled layers, and finally, continued pre-training for the entire model.\n\nDepth-Upscaled SOLAR-10.7B has remarkable performance. It outperforms models with up to 30B parameters, even surpassing the recent Mixtral 8X7B model. For detailed information, please refer to the experimental table. Solar 10.7B is an ideal choice for fine-tuning. SOLAR-10.7B offers robustness and adaptability for your fine-tuning needs. Our simple instruction fine-tuning using the SOLAR-10.7B pre-trained model yields significant performance improvements.\n\n### NexusRaven（function calling LLM）\n- https://huggingface.co/Nexusflow/NexusRaven-V2-13B\n\nNexusRaven is an open-source and commercially viable function calling LLM that surpasses the state-of-the-art in function calling capabilities.\n\n### LLaMA-MoE\n- https://github.com/pjlab-sys4nlp/llama-moe\n\nLLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:\n- Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.\n- Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.\n\n### TinyLlama\n- https://github.com/jzhang38/TinyLlama/\n- https://arxiv.org/pdf/2401.02385.pdf\n\nTinyLlama项目旨在在3万亿tokens上进行预训练，构建一个拥有11亿参数的Llama模型。经过精心优化，我们\"仅\"需16块A100-40G的GPU，便可在90天内完成这个任务🚀🚀。训练已于2023-09-01开始。\n\n我们采用了与Llama 2完全相同的架构和分词器。这意味着TinyLlama可以在许多基于Llama的开源项目中即插即用。此外，TinyLlama只有1.1B的参数，体积小巧，适用于需要限制计算和内存占用的多种应用。\n\n## 4 评价\n\n### 天秤（FlagEval）\n- https://flageval.baai.ac.cn/#/home\n\n大语言评测体系及开放平台：构建“能力-任务-指标”三维评测框架，细粒度刻画模型的认知能力边界。\n\n### 獬豸（Xiezhi）Benchmark\n- https://arxiv.org/abs/2306.05783\n- https://github.com/MikeGu721/XiezhiBenchmark\n\nXiezhi是一个综合的、多学科的、能够自动更新的领域知识评估Benchmark。Xiezhi包含了哲学、经济学、法学、教育学、文学、历史学、自然科学、工学、农学、医学、军事学、管理学、艺术学这13个学科门类，24万道学科题目，516个具体学科，249587道题目。这 516 个学科以及分类方式源自中国教育部颁布的学科分类法。作者从中国研究生入学考试中手动选择并注释了 20,000 道多选题，涵盖了这 516 个标签，以形成Xiezhi-Meta数据集。Xiezhi-Meta被用来训练一个能够计算题目和学科标签之间相关性的标注模型。作者们随后收集了来自不同考试的 150,000 个多项选择题，以及来自学术Survey的 70,000 个多项选择题，并使用标注模型对所有这些问题进行了注释。\n\n为了方便进行实验，并能够有效地评估LLM对于跨学科知识的处理能力，作者们提出了Xiezhi-Specialty和Xiezhi-Interdiscipline，这两个数据集都提供了中英文的版本，并由 15,000 个更平衡、更不敏感、更不以中国为中心的多选题组成。 Xiezhi-Specialty 包含可以使用单一领域的知识解决的问题，而 Xiezhi-Interdiscipline 包含需要来自多个领域的知识才能解决的问题。\n\n### C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models\n- https://arxiv.org/abs/2305.08322\n- https://cevalbenchmark.com/\n- https://github.com/SJTU-LIT/ceval\n\nC-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels.\n\n### HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models\n- https://mp.weixin.qq.com/s/cuoO2V4X-GQOuWyA-e9BeQ\n- https://arxiv.org/abs/2305.11747\n- https://github.com/RUCAIBox/HaluEval\n\n为了进一步研究大模型幻象的内容类型和大模型生成幻象的原因，本文提出了用于大语言模型幻象评估的基准——HaluEval。我们基于现有的数据集，通过自动生成和手动标注的方式构建了大量的幻象数据组成HaluEval的数据集，其中包含特定于问答、对话、文本摘要任务的30000条样本以及普通用户查询的5000条样本。在本文中，我们详细介绍了HaluEval数据集的构建过程，对构建的数据集进行了内容分析，并初步探索了大模型识别和减少幻象的策略。\n\n### KoLA: Carefully Benchmarking World Knowledge of Large Language Models\n- https://mp.weixin.qq.com/s/xVj1blhRtpO-Y1HgQ8Wl-A\n- https://arxiv.org/pdf/2306.09296.pdf\n- https://kola.xlore.cn\n\nKoLA基于19个关注实体、概念和事件的任务。参考了Bloom认知体系，KoLA从知识的记忆、理解、应用和创造4个层级，从深度而非广度去衡量大语言模型处理世界知识的能力。实验结果表明，GPT-4虽然很强，但依然未能霸榜，在知识创造层次的测试中仅排第三名。\n\n### LucyEval—中文大语言模型成熟度评测\n- https://mp.weixin.qq.com/s/-K7wmnaTdEexlAoTB47mFw\n- http://lucyeval.besteasy.com/\n- https://arxiv.org/abs/2308.04823\n\n本评测基准包含11000道不同类型的题目，涵盖科技工程、人文与社会科学、数学计算、医师资格考试、司法考试、注册会计师考试等科目下的55个子科目，由甲骨易AI研究院人工整理标注。题目分为名词解释、简答题和计算题三种类型。同时，甲骨易AI研究院还设计了一套复合打分方式，使评分过程更加合理、科学。\n\n### CMB: A Comprehensive Medical Benchmark in Chinese\n- https://mp.weixin.qq.com/s/M8V-XaCRuk-UkAhqBkPgGg\n- https://arxiv.org/abs/2308.08833\n- https://github.com/FreedomIntelligence/CMB\n- https://cmedbenchmark.llmzoo.com/\n\n我们提出了中文医疗模型评估基准 CMB，其包括了不同临床职业、不同职业阶段考试中的多项选择题（CMB-Exam）和基于真实病例的复杂临床诊断问题（CMB-Clin）。通过测评实验,我们发现：（1）GPT-4 在医学领域表现出显著优越性，于此同时中文通用大模型也表现得相当出色；（2）医疗大模型在性能方面仍然落后于通用模型，还有很大的提升空间（3）有参考答案和评分标准的问诊自动评估与专家评估高度对齐，提供了一个医学领域超级对齐的初步尝试。\n\n### Multiscale Positive-Unlabeled Detection of AI-Generated Texts\n- https://mp.weixin.qq.com/s/KBN8TMwXD1bcE2X_dImXVg\n- https://arxiv.org/abs/2305.18149\n- https://github.com/mindspore-lab/mindone/tree/master/examples/detect_chatgpt\n- https://github.com/YuchuanTian/AIGC_text_detector\n\nRecent releases of Large Language Models (LLMs), e.g. ChatGPT, are astonishing at generating human-like texts, but they may get misused for fake scholarly texts, fake news, fake tweets, et cetera. Previous works have proposed methods to detect these multiscale AI-generated texts, including simple ML classifiers, pretrained-model-based training-agnostic methods, and finetuned language classification models. However, mainstream detectors are formulated without considering the factor of corpus length: shorter corpuses are harder to detect compared with longer ones for shortage of informative features. In this paper, a Multiscale Positive-Unlabeled (MPU) training framework is proposed to address the challenge of multiscale text detection. Firstly, we acknowledge the human-resemblance property of short machine texts, and rephrase text classification as a Positive-Unlabeled (PU) problem by marking these short machine texts as \"unlabeled\" during training. In this PU context, we propose the length-sensitive Multiscale PU Loss, where we use a recurrent model in abstraction to estimate positive priors of scale-variant corpuses. Additionally, we introduce a Text Multiscaling module to enrich training corpuses. Experiments show that our MPU method augments detection performance on long AI-generated text, and significantly improves short-corpus detection of language model detectors. Language Models trained with MPU could outcompete existing detectors by large margins on multiscale AI-generated texts. \n\n### PandaLM\n- https://github.com/WeOpenML/PandaLM\n- https://zhuanlan.zhihu.com/p/630173415\n- https://mp.weixin.qq.com/s/HE6jez3G9aEO5qLkvwtKXg\n\nThis is the official repository for PandaLM: ReProducible and Automated Language Model Assessment.\n\nPandaLM aims to provide reproducible and automated comparisons between different large language models (LLMs). By giving PandaLM the same context, it can compare the responses of different LLMs and provide a reason for the decision, along with a reference answer. The target audience for PandaLM may be organizations that have confidential data and research labs with limited funds that seek reproducibility. These organizations may not want to disclose their data to third parties or may not be able to afford the high costs of secret data leakage using third-party APIs or hiring human annotators. With PandaLM, they can perform evaluations without compromising data security or incurring high costs, and obtain reproducible results. To demonstrate the reliability and consistency of our tool, we have created a diverse human-annotated test dataset of approximately 1,000 samples, where the contexts and the labels are all created by humans. On our test dataset, PandaLM-7B has achieved 94% ChatGPT's evaluation ability in terms of accuracy. The papers and more features are coming soon.\n\n### Auto-J\n- https://gair-nlp.github.io/auto-j/\n- https://github.com/GAIR-NLP/auto-j\n- https://arxiv.org/abs/2310.05470\n\nWe develop Auto-J, a new open-source generative judge that can effectively evaluate different LLMs on how they align to human preference. It is featured with:\n- Generality: Auto-J is trained on data from real-world user queries and responses from various LLMs, covering a wide range of 58 real-world scenarios.\n- Flexibility: Auto-J supports both pairwise response comparison and single-response evaluation by just switching to corresponding prompts.\n- Interpretability: Auto-J provides detailed critiques that enhance the reliability of its evaluation outcomes and facilitate humans' involvement in the evaluation loop.\n\n### CLEVA: Chinese Language Models EVAluation Platform\n- https://arxiv.org/abs/2308.04813\n- https://github.com/LaVi-Lab/CLEVA\n\nCLEVA is a Chinese Language Models EVAluation Platform developed by CUHK LaVi Lab. CLEVA would like to thank Shanghai AI Lab for the great collaboration in the process. The main features of CLEVA include:\n\n- A comprehensive Chinese Benchmark, featuring 31 tasks (11 application assessments + 20 ability evaluation tasks), with a total of 370K Chinese test samples (33.98% are newly collected, mitigating data contamination issues);\n- A standardized Prompt-Based Evaluation Methodology, incorporating unified pre-processing for all data and using a consistent set of Chinese prompt templates for evaluation.\n- A trustworthy Leaderboard, as CLEVA uses a significant amount of new data to minimize data contamination and regularly organizes evaluations.\n\nThe leaderboard is evaluated and maintained by CLEVA using new test data. Past leaderboard data (processed test samples, annotated prompt templates, etc.) are made available to users for local evaluation runs.\n\n### ALCUNA: Large Language Models Meet New Knowledge\n- https://github.com/arvid-pku/alcuna\n\nWith the rapid development of NLP, large-scale language models (LLMs) excel in various tasks across multiple domains now. However, existing benchmarks may not adequately measure these models' capabilities, especially when faced with new knowledge. In this paper, we address the lack of benchmarks to evaluate LLMs' ability to handle new knowledge, an important and challenging aspect in the rapidly evolving world. We propose an approach called KnowGen that generates new knowledge by altering existing entity attributes and relationships, resulting in artificial entities that are distinct from real-world entities. With KnowGen, we introduce a benchmark named ALCUNA to assess LLMs' abilities in knowledge understanding, differentiation, and association. We benchmark several LLMs, reveals that their performance in face of new knowledge is not satisfactory, particularly in reasoning between new and internal knowledge. We also explore the impact of entity similarity on the model's understanding of entity knowledge and the influence of contextual entities. We appeal to the need for caution when using LLMs in new scenarios or with new knowledge, and hope that our benchmarks can help drive the development of LLMs in face of new knowledge.\n\n### HalluQA：Evaluating Hallucinations in Chinese Large Language Models\n- https://github.com/xiami2019/HalluQA/\n\nEvaluating Hallucinations in Chinese Large Language Models\n\nHalluQA contains 450 meticulously designed adversarial questions, spanning multiple domains, and takes into account Chinese historical culture, customs, and social phenomena. The pipeline of data collection is shown above. At step 1, we write questions which we think may induce model hallucinations. At step 2, we use ChatGPT3.5/Puyu/GLM-130B to generate answers and collect adversarial questions. At step 3, we write multiple correct and wrong answers for each adversarial question and add support evidence. At step 4, we check all annotated question-answer pairs and remove low quality samples.\n\n### GLoRE: Evaluating Logical Reasoning of Large Language Models  \n- https://arxiv.org/abs/2310.09107\n\nRecently, large language models (LLMs), including notable models such as GPT-4 and burgeoning community models, have showcased significant general language understanding abilities. However, there has been a scarcity of attempts to assess the logical reasoning capacities of these LLMs, an essential facet of natural language understanding. To encourage further investigation in this area, we introduce GLoRE, a meticulously assembled General Logical Reasoning Evaluation benchmark comprised of 12 datasets that span three different types of tasks. Our experimental results show that compared to the performance of human and supervised fine-tuning, the logical reasoning capabilities of open LLM models necessitate additional improvement; ChatGPT and GPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing ChatGPT by a large margin. We propose a self-consistency probing method to enhance the accuracy of ChatGPT and a fine-tuned method to boost the performance of an open LLM. We release the datasets and evaluation programs to facilitate future research.\n\n### HelpSteer\n- https://huggingface.co/datasets/nvidia/HelpSteer\n\nHelpSteer is an open-source Helpfulness Dataset (CC-BY-4.0) that supports aligning models to become more helpful, factually correct and coherent, while being adjustable in terms of the complexity and verbosity of its responses.\n\nLeveraging this dataset and SteerLM, we train a Llama 2 70B to reach 7.54 on MT Bench, the highest among models trained on open-source datasets based on MT Bench Leaderboard as of 15 Nov 2023.\n\n### AlignBench: 多维度中文对齐评测基准\n- https://github.com/THUDM/AlignBench\n\nAlignBench 是第一个多维度全面评估中文大模型对齐水平的评测基准。此仓库包含了 AlignBench 的介绍信息、数据和代码。\n\n### UHGEval\n- https://github.com/IAAR-Shanghai/UHGEval\n- https://arxiv.org/abs/2311.15296\n\nBenchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation\n\n- Safety: Ensuring the security of experimental data is of utmost importance.\n- Flexibility: Easily expandable, with all modules replaceable.\n\n### Purple Llama (Meta)\n- https://ai.meta.com/blog/purple-llama-open-trust-safety-generative-ai/\n\nWe’re announcing Purple Llama, an umbrella project featuring open trust and safety tools and evaluations meant to level the playing field for developers to responsibly deploy generative AI models and experiences in accordance with best practices shared in our Responsible Use Guide.\n\n### OMGEval\n- https://github.com/blcuicall/OMGEval\n\n近一年，大模型发展迅速，带动了⼀系列通用人工智能技术的迅速发展，对大模型性能的评测随之涌现。\n\n从评测能力上来看，由于目前的评测数据集主要是利用人类试题及其标准答案进行评测，这种评价方式更偏向对推理能力的评估，存在评估结果和模型真实能力有⼀定偏差。例如，英文数据集中，HELM1使用16个NLP数据集，MMLU2用57项人类考试科目来评测大模型。中文数据集中，GAOKAO3、C-Eval4等也采用人类试题，他们在自动化评测流程中都只包含有标准答案的问题，无法全面衡量生成式大模型的综合能力。\n\n此外，目前也有一些工作关注到了模型的开放式问答，由斯坦福大学提出的的AlpacaEval5被广泛认可，但仅由英文问题组成，决定了只能评估模型在英文上的表现。包含中文开放式问答的SuperCLUE6数据集是首个提出开放式问答的中文数据集，但其数据集闭源，且也仅由中文问题组成。可以看到，目前已有的开放式问题数据集都是在单一语言上进行评测的，用来衡量模型的多语言能力的开源的开放式问答数据集仍然空缺。\n\n综上所述，构建一个多语言的开放式问答数据集用以全面评测大模型的综合能力是有必要的。我们将从中文入手，逐渐迁移至其他语言。\n\n### SciGuard&SciMT-Safety\n- https://arxiv.org/abs/2312.06632\n- https://github.com/SciMT/SciMT-benchmark\n\nThe expanding application of Artificial Intelligence (AI) in scientific fields presents unprecedented opportunities for discovery and innovation. However, this growth is not without risks. AI models in science, if misused, can amplify risks like creation of harmful substances, or circumvention of established regulations. In this study, we aim to raise awareness of the dangers of AI misuse in science, and call for responsible AI development and use in this domain. We first itemize the risks posed by AI in scientific contexts, then demonstrate the risks by highlighting real-world examples of misuse in chemical science. These instances underscore the need for effective risk management strategies. In response, we propose a system called SciGuard to control misuse risks for AI models in science. We also propose a red-teaming benchmark SciMT-Safety to assess the safety of different systems. Our proposed SciGuard shows the least harmful impact in the assessment without compromising performance in benign tests. Finally, we highlight the need for a multidisciplinary and collaborative effort to ensure the safe and ethical use of AI models in science. We hope that our study can spark productive discussions on using AI ethically in science among researchers, practitioners, policymakers, and the public, to maximize benefits and minimize the risks of misuse.\n\n### HaluEval 2.0, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models\n- https://github.com/RUCAIBox/HaluEval-2.0\n- https://arxiv.org/abs/2401.03205\n\nIn the era of large language models (LLMs), hallucination (i.e., the tendency to generate factually incorrect content) poses great challenge to trustworthy and reliable deployment of LLMs in real-world applications. To tackle the LLM hallucination, three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them (mitigation). To address these challenges, this work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation. Specially, we construct a new hallucination benchmark HaluEval 2.0, and designs a simple yet effective detection method for LLM hallucination. Furthermore, we zoom into the different training or utilization stages of LLMs and extensively analyze the potential factors that lead to the LLM hallucination. Finally, we implement and examine a series of widely used techniques to mitigate the hallucinations in LLMs. Our work has led to several important findings to understand the hallucination origin and mitigate the hallucinations in LLMs. \n\n### DebugBench: Evaluating Debugging Capability of Large Language Models\n- https://github.com/thunlp/DebugBench\n- https://arxiv.org/abs/2401.04621\n\nLarge Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models like GPT-4 exhibit inferior debugging performance compared to humans, open-source models such as Code Llama fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.\n\n### GenMedicalEval\n- https://github.com/MediaBrain-SJTU/GenMedicalEval\n\n一个医疗大语言模型的综合评测框架。\n\n### R-Judge\n- https://github.com/Lordog/R-Judge\n- https://arxiv.org/abs/2401.10019\n\nLarge language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on LLM-generated content safety in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging safety risks given agent interaction records. R-Judge comprises 162 agent interaction records, encompassing 27 key risk scenarios among 7 application categories and 10 risk types. It incorporates human consensus on safety with annotated safety risk labels and high-quality risk descriptions. Utilizing R-Judge, we conduct a comprehensive evaluation of 8 prominent LLMs commonly employed as the backbone for agents. The best-performing model, GPT-4, achieves 72.29% in contrast to the human score of 89.38%, showing considerable room for enhancing the risk awareness of LLMs. Notably, leveraging risk descriptions as environment feedback significantly improves model performance, revealing the importance of salient safety risk feedback. Furthermore, we design an effective chain of safety analysis technique to help the judgment of safety risks and conduct an in-depth case study to facilitate future research.\n\n### TravelPlanner\n- https://osu-nlp-group.github.io/TravelPlanner/\n\nWe introduce TravelPlanner: a comprehensive benchmark designed to evaluate the planning abilities of language agents in real-world scenarios across multiple dimensions. Without losing generality, TravelPlanner casts travel planning as its test environment, with all relevant information meticulously crafted to minimize data contamination. TravelPlanner does not have a singular ground truth for each query. Instead, the benchmark employs several pre-defined evaluation scripts to assess each tested plan, determining whether the language agent can effectively use tools to create a plan that aligns with both the implicit commonsense and explicit user needs outlined in the query (i.e., commonsense constraint and hard constraint). Every query in TravelPlanner has undergone thorough human verification to guarantee that feasible solutions exist. Additionally, TravelPlanner evaluates the language agent's capability by varying the breadth and depth of planning, controlled through the number of travel days and the quantity of hard constraints.\n\n### EasyJailbreak\n- https://github.com/EasyJailbreak/EasyJailbreak\n- http://easyjailbreak.cn/\n- https://easyjailbreak.github.io/EasyJailbreakDoc.github.io/\n\nEasyJailbreak is an easy-to-use Python framework designed for researchers and developers focusing on LLM security. Specifically, EasyJailbreak decomposes the mainstream jailbreaking process into several iterable steps: initialize mutation seeds, select suitable seeds, add constraint, mutate, attack, and evaluate. On this basis, EasyJailbreak provides a component for each step, constructing a playground for further research and attempts. More details can be found in our paper.\n\n## 5 文本向量\n### Matryoshka Representation Learning\n- https://arxiv.org/abs/2205.13147\n- https://github.com/RAIVNLab/MRL\n\nLearned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context rigid, fixed capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14x smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14x real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities -- vision (ViT, ResNet), vision + language (ALIGN) and language (BERT).\n\n### Jina Embeddings\n- https://huggingface.co/jinaai\n\n正式开源了两款双语向量模型：中英双语（Chinese-English）和英德双语（English-German）向量模型，这也是全球首次推出支持 8K 双语文本的开源向量模型。\n\n### BGE-M3\n- https://github.com/FlagOpen/FlagEmbedding\n- https://huggingface.co/BAAI/bge-m3\n\nIn this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.\n\n- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.\n- Multi-Linguality: It can support more than 100 working languages.\n- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.\n\n### Nomic Embed\n- https://github.com/nomic-ai/contrastors\n- https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf\n- https://arxiv.org/abs/2402.01613\n- https://huggingface.co/nomic-ai/nomic-embed-text-v1.5\n\nThis technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1.\n\n### Moka Massive Mixed Embedding（M3E）\n- https://huggingface.co/moka-ai/m3e-small\n\n- Moka，此模型由 MokaAI 训练，开源和评测，训练脚本使用 uniem ，评测 BenchMark 使用 MTEB-zh\n- Massive，此模型通过千万级 (2200w+) 的中文句对数据集进行训练\n- Mixed，此模型支持中英双语的同质文本相似度计算，异质文本检索等功能，未来还会支持代码检索\n- Embedding，此模型是文本嵌入模型，可以将自然语言转换成稠密的向量\n\n### GRIT\n- https://arxiv.org/abs/2402.09906\n- https://github.com/ContextualAI/gritlm\n\nAll text-based language problems can be reduced to either generation or embedding. Current models only perform well at one or the other. We introduce generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions. Compared to other open models, our resulting GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB) and outperforms all models up to its size on a range of generative tasks. By scaling up further, GritLM 8x7B outperforms all open generative language models that we tried while still being among the best embedding models. Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss. Among other benefits, the unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models.\n\n### TinyRAG\n- https://github.com/KMnO4-zx/TinyRAG\n\n此仓库用于学习大模型RAG的相关内容，目前为手搓实现，主要是llama-index和langchain不太好魔改。此仓库可以方便看论文的时候，实现一些小的实验。\n\n### RAFT\n- https://arxiv.org/abs/2403.10131\n- http://github.com/ShishirPatil/gorilla\n\nPretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain such new knowledge remains an open question. In this paper, we present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a \"open-book\" in-domain settings. In RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document that would help answer the question. This coupled with RAFT's chain-of-thought-style response helps improve the model's ability to reason. In domain-specific RAG, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG.\n\n### Chat with MLX\n- https://github.com/mlx-chat/mlx-chat-app\n\nChat with MLX is a high-performance macOS application that connects your local documents to a personalized large language model (LLM). By leveraging retrieval-augmented generation (RAG), open source LLMs, and MLX for accelerated machine learning on Apple silicon, you can efficently search, query, and interact with your documents without information ever leaving your device.\n\n### LLocalSearch\n- https://github.com/nilsherzig/LLocalSearch\n\nThis is a completely locally running meta search engine using LLM Agents. The user can ask a question and the system will use a chain of LLMs to find the answer. The user can see the progress of the agents and the final answer. No OpenAI or Google API keys are needed.\n\nHere is a video of it in action, running completel\n\n### RAGFlow\n- https://github.com/infiniflow/ragflow/tree/main\n\nRAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex fomatted data.\n\n### Dot\n- https://github.com/alexpinel/Dot\n\nThis is Dot, a standalone open source app meant for easy use of local LLMs and RAG in particular to interact with documents and files similarly to Nvidia's Chat with RTX. Dot itself is completely standalone and is packaged with all dependencies including a copy of Mistral 7B, this is to ensure the app is as accessible as possible and no prior knowledge of programming or local LLMs is required to use it.\n\n### Ollama Embedding Models\n- https://ollama.com/blog/embedding-models\n\nOllama supports embedding models, making it possible to build retrieval augmented generation (RAG) applications that combine text prompts with existing documents or other data.\n\n### LLM2Vec\n- https://arxiv.org/abs/2404.05961\n\nLarge decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.\n\n### gecko\n- https://github.com/google-research/gecko\n- https://arxiv.org/pdf/2403.20327.pdf\n\nWe present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.\n\n### Cognita \n- https://github.com/truefoundry/cognita\n\nLangchain/LlamaIndex provide easy to use abstractions that can be used for quick experimentation and prototyping on jupyter notebooks. But, when things move to production, there are constraints like the components should be modular, easily scalable and extendable. This is where Cognita comes in action. Cognita uses Langchain/Llamaindex under the hood and provides an organisation to your codebase, where each of the RAG component is modular, API driven and easily extendible. Cognita can be used easily in a local setup, at the same time, offers you a production ready environment along with no-code UI support. Cognita also supports incremental indexing by default.\n\n### Piccolo2\n- https://arxiv.org/abs/2405.06932\n- https://huggingface.co/sensenova/piccolo-large-zh-v2\n\nIn this report, we introduce Piccolo2, an embedding model that surpasses other models in the comprehensive evaluation over 6 tasks on CMTEB benchmark, setting a new state-of-the-art. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, effectively harnessing textual data and labels from diverse downstream tasks. In addition, Piccolo2 scales up the embedding dimension and uses MRL training to support more flexible vector dimensions.\n\n### NV-Embed\n- https://huggingface.co/nvidia/NV-Embed-v1\n- https://arxiv.org/pdf/2405.17428\n\nWe introduce NV-Embed, a generalist embedding model that ranks No. 1 on the Massive Text Embedding Benchmark (MTEB benchmark)(as of May 24, 2024), with 56 tasks, encompassing retrieval, reranking, classification, clustering, and semantic textual similarity tasks. Notably, our model also achieves the highest score of 59.36 on 15 retrieval tasks within this benchmark.\n\nNV-Embed presents several new designs, including having the LLM attend to latent vectors for better pooled embedding output, and demonstrating a two-stage instruction tuning method to enhance the accuracy of both retrieval and non-retrieval tasks.\n\n### RankRAG\n- https://arxiv.org/abs/2407.02485\n\nLarge language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.\n\n### LightRAG\n- https://github.com/SylphAI-Inc/LightRAG\n\nLightRAG helps developers with both building and optimizing Retriever-Agent-Generator pipelines. It is light, modular, and robust, with a 100% readable codebase.\n\n### GraphRAG\n- https://github.com/microsoft/graphrag\n\nThe GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.\n\n### gte-multilinguial\n- https://huggingface.co/Alibaba-NLP\n- https://huggingface.co/collections/Alibaba-NLP/gte-models-6680f0b13f885cb431e6d469\n- https://arxiv.org/abs/2407.19669\n\nWe present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.\n\n### nano-graphrag\n- https://github.com/gusye1234/nano-graphrag\n\n😭 GraphRAG is good and powerful, but the official implementation is difficult/painful to read or hack.\n\n😊 This project provides a smaller, faster, cleaner GraphRAG, while remaining the core functionality(see benchmark and issues ).\n\n🎁 Excluding tests and prompts, nano-graphrag is about 800 lines of code.\n\n👌 Small yet portable, asynchronous and fully typed.\n\n### MaxKB\n- https://github.com/1Panel-dev/MaxKB\n\nMaxKB = Max Knowledge Base，是一款基于 LLM 大语言模型的开源知识库问答系统，广泛应用于企业内部知识库、客户服务、学术研究与教育等场景。\n\n### Langchain-Chatchat\n- https://github.com/chatchat-space/Langchain-Chatchat\n\n🤖️ 一种利用 langchain 思想实现的基于本地知识库的问答应用，目标期望建立一套对中文场景与开源模型支持友好、可离线运行的知识库问答解决方案。\n\n### RAGLite\n- https://github.com/superlinear-ai/raglite\n\nRAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite.\n\n### OpenScholar\n- https://github.com/AkariAsai/OpenScholar\n- https://openscholar.allen.ai/paper\n- https://github.com/AkariAsai/ScholarQABench\n\nScientific progress hinges on our ability to find, synthesize, and build on relevant knowledge from the scientific literature. However, the exponential growth of this literature—with millions of papers now published each year—has made it increasingly difficult for scientists to find the information they need or even stay abreast of the latest findings in a single subfield.\n\nTo help scientists effectively navigate and synthesize scientific literature, we introduce OpenScholar, a retrieval-augmented language model (LM) designed to answer user queries by first searching for relevant papers in the literature and then generating responses grounded in those sources.\n\n### MasteringRAG\n- https://github.com/Steven-Luo/MasteringRAG\n\n本项目是一个使用LLM（大语言模型）使用RAG技术构建文档问答的项目，将会涵盖企业构建基于RAG的文档问答几乎所有的常见优化手段。 项目重点介绍算法流程，不会将重点放在非常规范化的工程代码上，因此，每一个Notebook文件都可以独立运行，不会做公共逻辑的抽象。\n\n### FlashRAG-Paddle\n- https://github.com/RUC-NLPIR/FlashRAG-Paddle\n- https://github.com/PaddlePaddle/PaddleNLP\n- https://arxiv.org/abs/2405.13576\n\nPaddleNLP是一款基于飞桨深度学习框架的大语言模型(LLM)开发套件，支持在多种硬件上进行高效的大模型训练、无损压缩以及高性能推理。PaddleNLP 具备简单易用和性能极致的特点，致力于助力开发者实现高效的大模型产业级应用。\n\n### MiniRAG\n- https://github.com/HKUDS/MiniRAG\n- https://arxiv.org/abs/2501.06713\n\nMiniRAG is an extremely simple retrieval-augmented generation framework that enables small models to achieve good RAG performance through heterogeneous graph indexing and lightweight topology-enhanced retrieval.\n\n### XRAG\n- https://github.com/DocAILab/XRAG\n\nXRAG is a benchmarking framework designed to evaluate the foundational components of advanced Retrieval-Augmented Generation (RAG) systems. By dissecting and analyzing each core module, XRAG provides insights into how different configurations and components impact the overall performance of RAG systems.\n\n### Chronos\n- https://arxiv.org/abs/2501.00888\n- https://github.com/Alibaba-NLP/CHRONOS\n\nWe propose CHRONOS, a novel retrieval-based approach to Timeline Summarization (TLS) by iteratively posing questions about the topic and the retrieved documents to generate chronological summaries.\n\n### DeepRAG\n- https://arxiv.org/abs/2502.01142\n\nLarge Language Models (LLMs) have shown remarkable potential in reasoning while they still suffer from severe factual hallucinations due to timeliness, accuracy, and coverage of parametric knowledge. Meanwhile, integrating reasoning with retrieval-augmented generation (RAG) remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling strategic and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency while improving answer accuracy by 21.99%, demonstrating its effectiveness in optimizing retrieval-augmented reasoning.\n\n### UltraRAG\n- https://github.com/OpenBMB/UltraRAG\n\nThe UltraRAG framework was jointly proposed by the THUNLP group from Tsinghua University, the NEUIR group from Northeastern University, Modelbest.Inc, and the 9#AISoft team. It is based on agile deployment and modular construction, introducing an automated \"data construction-model fine-tuning-inference evaluation\" knowledge adaptation technology system. This provides a one-stop, researcher and developer-friendly RAG system solution. UltraRAG significantly simplifies the entire process from data construction to model fine-tuning in domain adaptation for RAG systems, assisting researchers and developers in efficiently tackling complex tasks.\n\n### CAG\n- arxiv.org/abs/2412.15605\n- github.com/hhhuang/CAG\n\nRetrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing language models by integrating external knowledge sources.\n\n### FlexRAG\n- https://github.com/ictnlp/flexrag\n\nFlexRAG is a flexible and high-performance framework designed for Retrieval-Augmented Generation (RAG) tasks, offering support for multimodal data, seamless configuration management, and out-of-the-box performance for both research and prototyping.\n\n## 6 其它\n### Alpaca-CoT\n- https://github.com/PhoebusSi/Alpaca-CoT\n- https://mp.weixin.qq.com/s/Q5Q3RpQ80XmpbfhSxq2R1Q\n\nAn Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface\n\nAlpaca-CoT项目旨在探究如何更好地通过instruction-tuning的方式来诱导LLM具备类似ChatGPT的交互和instruction-following能力。为此，我们广泛收集了不同类型的instruction（尤其是Chain-of-Thought数据集），并基于LLaMA给出了深入细致的实证研究，以供未来工作参考。据我们所知，我们是首个将CoT拓展进Alpaca的工作，因此简称为\"Alpaca-CoT\"。\n\n### Auto-GPT\n- https://github.com/torantulino/auto-gpt\n\nAuto-GPT is an experimental open-source application showcasing the capabilities of the GPT-4 language model. This program, driven by GPT-4, chains together LLM \"thoughts\", to autonomously achieve whatever goal you set. As one of the first examples of GPT-4 running fully autonomously, Auto-GPT pushes the boundaries of what is possible with AI.\n\n### ChatPiXiu\n- https://github.com/catqaq/ChatPiXiu\n\n我们是羡鱼智能【xianyu.ai】，主要成员是一群来自老和山下、西湖边上的咸鱼们，塘主叫作羡鱼，想在LLMs时代做点有意义的事！我们的口号是：做OpenNLP和OpenX！希望在CloseAI卷死我们之前退出江湖！\n\n也许有一天，等到GPT-X发布的时候，有人会说NLP不存在了，但是我们想证明有人曾经来过、热爱过！在以ChatGPT/GPT4为代表的LLMs时代，在被CloseAI卷死之前，我们发起了OpenNLP计划，宗旨是OpenNLP for everyone!\n\nChatPiXiu项目为OpenNLP计划的第2个正式的开源项目，旨在Open ChatGPT for everyone！在以ChatGPT/GPT4为代表的LLMs时代，在被OpenAI卷死之前，做一点有意义的事情！未来有一天，等到GPT-X发布的时候，或许有人会说NLP不存在了，但是我们想证明有人曾来过！\n\n### Gorilla\n- https://mp.weixin.qq.com/s/p9tx3q3Lpr4fNqdyxWhzyA\n- gorilla.cs.berkeley.edu\n- arxiv.org/abs/2305.15334\n- https://github.com/ShishirPatil/gorilla/\n\n大型语言模型性能强大，但为了更好地用于解决实际问题，各式各样的 API 是必不可少的。\n\n加利福尼亚大学伯克利分校和微软研究院造出了一只「大猩猩」Gorilla，该模型能根据用户输入的自然语言为用户选择合适的 API 来执行对应任务。理论上讲，这个模型可以根据用户需求调用其它各种 AI 模型，因此 Gorilla 有望成为一个统御其它 AI 的 AI 模型。该项目的代码、模型、数据和演示都已发布。\n\n### HuggingGPT\n- https://mp.weixin.qq.com/s/o51CmLt2JViJ4nsKfBJfwg\n- https://arxiv.org/pdf/2303.17580.pdf\n\nHuggingGPT利用ChatGPT作为控制器，连接HuggingFace社区中的各种AI模型，来完成多模态复杂任务。\n\n这意味着，你将拥有一种超魔法，通过HuggingGPT，便可拥有多模态能力，文生图、文生视频、语音全能拿捏了。\n\n### LLMPruner：大语言模型裁剪工具\n- https://mp.weixin.qq.com/s/u0UcCxzJOkF4fO_JI6ToQA\n- https://github.com/yangjianxin1/LLMPruner\n\n在许多下游任务中，我们往往只需要使用到一两种语言，例如在中文场景中，一般只会用到中英文。 所以我们可以对大语言模型的词表进行裁剪，只留下所需的部分词表，这样不仅能够充分保留模型的预训练知识，并且减少模型参数量，降低显存占用，提升训练速度，使用更少的显卡进行下游任务的finetune训练。\n\n基于上述原因，笔者开发了LLMPruner项目，目前主要包含裁剪后的各种参数规模的Bloom模型。对Bloom进行词表裁剪，保留常用的中英文token，词表由250880将至46145，缩减为原来的18.39%。\n\n### LLM-Pruner: On the Structural Pruning of Large Language Models\n- https://github.com/horseee/LLM-Pruner\n- https://arxiv.org/abs/2305.11627\n- https://mp.weixin.qq.com/s/feqFfy4n31eztoZfodMieQ\n\n在本文中，我们提出了 LLM-Pruner，一种用于大型语言模型的结构化剪枝方法。LLM-Pruner 旨在以任务无关的方式压缩庞大的语言模型，同时尽量减少对原始训练语料库的依赖，并保留 LLM 的语言能力。LLM-Pruner 通过迭代地检查模型中的每个神经元作为识别依赖组的触发器，从而构建 LLM 的依赖图。随后，LLM-Pruner 使用参数级和权重级估计来评估这些组的重要性。\n\n最后，我们利用 LoRA 对被剪枝模型进行快速恢复和调整。我们使用多个 zero-shot 数据集评估了 LLM-Pruner 在三个不同模型（LLaMA，Vicuna 和 ChatGLM）上的有效性。我们的实验结果表明，LLM-Pruner 成功地剪枝了模型，在保留 zero-shot 能力的同时减轻了计算负担。\n\n### LLM for Recommendation Systems\n- https://github.com/WLiK/LLM4Rec\n- https://arxiv.org/abs/2305.19860\n- https://mp.weixin.qq.com/s/WCUjCahiak4STbb0QjJInQ\n\nLarge Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP) and have recently gained significant attention in the domain of Recommendation Systems (RS). These models, trained on massive amounts of data using self-supervised learning, have demonstrated remarkable success in learning universal representations and have the potential to enhance various aspects of recommendation systems by some effective transfer techniques such as fine-tuning and prompt tuning, and so on. The crucial aspect of harnessing the power of language models in enhancing recommendation quality is the utilization of their high-quality representations of textual features and their extensive coverage of external knowledge to establish correlations between items and users. To provide a comprehensive understanding of the existing LLM-based recommendation systems, this survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec), with the latter being systematically sorted out for the first time. Furthermore, we systematically review and analyze existing LLM-based recommendation systems within each paradigm, providing insights into their methodologies, techniques, and performance. Additionally, we identify key challenges and several valuable findings to provide researchers and practitioners with inspiration.\n\n### Self-Instruct\n- https://github.com/yizhongw/self-instruct\n- https://arxiv.org/abs/2212.10560\n\nSelf-Instruct is a framework that helps language models improve their ability to follow natural language instructions. It does this by using the model's own generations to create a large collection of instructional data. With Self-Instruct, it is possible to improve the instruction-following capabilities of language models without relying on extensive manual annotation.\n\n### ToolBench&ToolLLM\n- https://github.com/OpenBMB/ToolBench\n- https://arxiv.org/pdf/2304.08354.pdf\n- https://arxiv.org/pdf/2307.16789.pdf\n- https://mp.weixin.qq.com/s/DuoQJj1OBl5iFPvjidDiCg\n\nThis project (ToolBench)  aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.\n\n🔨This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.\n\n### Wanda (Pruning by Weights and activations)\n- https://github.com/locuslab/wanda\n- https://mp.weixin.qq.com/s/UoQLCQiFnKZUQPedDM_MCQ\n- https://arxiv.org/pdf/2306.11695.pdf\n\nA Simple and Effective Pruning Approach for Large Language Models\n\n### Streaming LLM\n- https://github.com/mit-han-lab/streaming-llm\n\nDeploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.\n\n### Sheared LLAMA (Structured Pruning)\n- https://xiamengzhou.github.io/sheared-llama/\n\nWe introduce the Sheared-LLaMA models, the strongest 1.3B and 2.7B public base large language models (LLMs). Our models are produced by LLM-Shearing, an efficient method of constructing LLMs by first pruning a larger existing model and then continually pre-training it. Sheared-LLaMA models are first pruned from the LLaMA2-7B model, and then trained on only 50B tokens, 5% budget of the previous strongest public 3B model.\n\n### QA-LoRA\n- https://arxiv.org/abs/2309.14717\n- https://github.com/yuhuixu1993/qa-lora\n\nRecently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. \n\n### AgentLM (AgentTuning, AgentInstruct)\n- https://github.com/THUDM/AgentTuning\n\nAgentTuning represents the very first attempt to instruction-tune LLMs using interaction trajectories across multiple agent tasks. Evaluation results indicate that AgentTuning enables the agent capabilities of LLMs with robust generalization on unseen agent tasks while remaining good on general language abilities. We have open-sourced the AgentInstruct dataset and AgentLM.\n\n### XAgent\n- https://github.com/OpenBMB/XAgent\n\nXAgent is an open-source experimental Large Language Model (LLM) driven autonomous agent that can automatically solve various tasks. It is designed to be a general-purpose agent that can be applied to a wide range of tasks. XAgent is still in its early stages, and we are working hard to improve it.\n\n🏆 Our goal is to create a super-intelligent agent that can solve any given task!\n\n### OpenAgents\n- https://github.com/xlang-ai/OpenAgents\n\nCurrent language agent frameworks aim to facilitate the construction of proof-of-concept language agents while neglecting the non-expert user access to agents and paying little attention to application-level designs. We built OpenAgents, an open platform for using and hosting language agents in the wild of everyday life.\n\n### gpu_poor\n- https://github.com/RahulSChand/gpu_poor\n\nCalculate how much GPU memory you need & breakdown of where it goes for training/inference of any LLM model with quantization (GGML/bitsandbytes), inference frameworks (vLLM/llama.cpp/HF) & QLoRA.\n\n### CAMEL:Communicative Agents for “Mind” Exploration of Large Scale Language Model Society \n- https://ghli.org/camel.pdf \n- https://github.com/camel-ai/camel \n- https://www.camel-ai.org/\n\nCAMEL-AI.org is an open-source community dedicated to the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we provide, implement, and support various types of agents, tasks, prompts, models, datasets, and simulated environments.\n\n### Transformer Index for GEnerative Recommenders (TIGER)\n- https://arxiv.org/pdf/2305.05065.pdf\n\nModern recommender systems perform large-scale retrieval by first embedding queries and item candidates in the same unified space, followed by approximate nearest neighbor search to select top candidates given a query embedding. In this paper, we propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates. To that end, we create semantically meaningful tuple of codewords to serve as a Semantic ID for each item. Given Semantic IDs for items in a user session, a Transformer-based sequence-to-sequence model is trained to predict the Semantic ID of the next item that the user will interact with. To the best of our knowledge, this is the first Semantic ID-based generative model for recommendation tasks. We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets. In addition, we show that incorporating Semantic IDs into the sequence-to-sequence model enhances its ability to generalize, as evidenced by the improved retrieval performance observed for items with no prior interaction history.\n\n### KnowPAT\n- https://github.com/zjukg/KnowPAT\n- https://arxiv.org/abs/2311.06503\n\nKnowledgeable Preference Alignment for LLMs in Domain-specific Question Answering\n\nFor domain-specific application of large language models (LLMs), external knowledge and LLMs should work together to achieve best user experience. LLMs should acquire an ability to make the right choices about retrieved external knowledge to meet the human needs. Knowledgeable Preference AlignmenT (KnowPAT) is a new pipeline to align LLMs with human's knowledge preference. KnowPAT incorporates domain knowledge graphs to construct preference set and design new alignment objective to fine-tune the LLMs.\n\n### AuthentiGPT: Detecting Machine-Generated Text\n- https://arxiv.org/abs/2311.07700\n\nLarge language models (LLMs) have opened up enormous opportunities while simultaneously posing ethical dilemmas. One of the major concerns is their ability to create text that closely mimics human writing, which can lead to potential misuse, such as academic misconduct, disinformation, and fraud. To address this problem, we present AuthentiGPT, an efficient classifier that distinguishes between machine-generated and human-written texts. Under the assumption that human-written text resides outside the distribution of machine-generated text, AuthentiGPT leverages a black-box LLM to denoise input text with artificially added noise, and then semantically compares the denoised text with the original to determine if the content is machine-generated. With only one trainable parameter, AuthentiGPT eliminates the need for a large training dataset, watermarking the LLM's output, or computing the log-likelihood. Importantly, the detection capability of AuthentiGPT can be easily adapted to any generative language model. With a 0.918 AUROC score on a domain-specific dataset, AuthentiGPT demonstrates its effectiveness over other commercial algorithms, highlighting its potential for detecting machine-generated text in academic settings.\n\n### Curiosity-driven Red-teaming for Large Language Models\n- https://openreview.net/forum?id=4KqkizXgXU\n\nLarge language models (LLMs) hold great potential for various natural language applications but risk generating incorrect or toxic content. In order to probe when an LLM generates unwanted content, the current paradigm is to recruit human testers to create input prompts (i.e., test cases) designed to elicit unfavorable responses from LLMs. This procedure is called red teaming. However, relying solely on human testers can be both expensive and time-consuming. Recent works automate red teaming by training LLMs (i.e., red team LLMs) with reinforcement learning (RL) to maximize the chance of eliciting undesirable responses (i.e., successful test cases) from the target LLMs being evaluated. However, while effective at provoking undesired responses, current RL methods lack test case diversity as RL-based methods tend to consistently generate the same few successful test cases once found. To overcome this limitation, we introduce curiosity-driven exploration to train red team models. This approach jointly maximizes the test case effectiveness and novelty. Maximizing novelty motivates the red-team model to search for new and diverse test cases. We evaluate our method by performing red teaming against LLMs in text continuation and instruction following tasks. Our experiments show that curiosity-driven exploration achieves greater diversity in all the experiments compared to existing RL-based red team methods while maintaining effectiveness. Remarkably, curiosity-driven exploration also enhances the effectiveness when performing red teaming in instruction following test cases, generating a higher number of successful test cases. We even demonstrate that curiosity-driven exploration successfully provokes toxic responses from the LLaMA2 model that has undergone finetuning based on human preferences.\n\n### Language Models are Super Mario（DARE, Drop And REscale）\n- https://arxiv.org/pdf/2311.03099.pdf\n- https://github.com/yule-BUAA/MergeLM\n\nIn this work, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without the need for retraining or GPUs.\n\n- We introduce a novel operation called DARE to directly set most of (90% or even 99%) the delta parameters to zeros without affecting the capabilities of SFT LMs.\n- We sparsify delta parameters of multiple SFT homologous models with DARE as a general preprocessing technique and subsequently merge them into a single model by parameter averaging.\n\n### TinyGSM\n- https://arxiv.org/abs/2312.09241\n\nSmall-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce TinyGSM, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on TinyGSM, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 teacher model (77.4%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset TinyGSM, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.\n\n### MathPile\n- https://huggingface.co/papers/2312.17120\n- https://gair-nlp.github.io/MathPile/\n- https://github.com/GAIR-NLP/MathPile\n- https://huggingface.co/datasets/GAIR/MathPile\n- https://huggingface.co/datasets/GAIR/MathPile_Commercial\n\nHigh-quality, large-scale corpora are the cornerstone of building powerful foundation models. In this work, we introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. \n\n### Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM\n- https://huggingface.co/ChaiML\n- https://arxiv.org/pdf/2401.02994.pdf\n\nIn conversational AI research, there's a noticeable trend towards developing models with a larger number of parameters, exemplified by models like ChatGPT. While these expansive models tend to generate increasingly better chat responses, they demand significant computational resources and memory. This study explores a pertinent question: Can a combination of smaller models collaboratively achieve comparable or enhanced performance relative to a singular large model? We introduce an approach termed \"blending\", a straightforward yet effective method of integrating multiple chat AIs. Our empirical evidence suggests that when specific smaller models are synergistically blended, they can potentially outperform or match the capabilities of much larger counterparts. For instance, integrating just three models of moderate size (6B/13B paramaeters) can rival or even surpass the performance metrics of a substantially larger model like ChatGPT (175B+ paramaters). This hypothesis is rigorously tested using A/B testing methodologies with a large user base on the Chai research platform over a span of thirty days. The findings underscore the potential of the \"blending\" strategy as a viable approach for enhancing chat AI efficacy without a corresponding surge in computational demands.\n\n### Personal LLM Agents - Survey\n- https://github.com/MobileLLM/Personal_LLM_Agents_Survey\n- https://arxiv.org/abs/2401.05459\n\nPersonal LLM Agents are defined as a special type of LLM-based agents that are deeply integrated with personal data, personal devices, and personal services. They are perferably deployed to resource-constrained mobile/edge devices and/or powered by lightweight AI models. The main purpose of personal LLM agents is to assist end-users and augment their abilities, helping them to focus more and do better on interesting and important affairs.\n\n### AUTOACT\n- https://arxiv.org/abs/2401.05268\n- https://github.com/zjunlp/AutoAct\n\nLanguage agents have achieved considerable performance on various complex tasks. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework that does not rely on large-scale annotated data and synthetic trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. We even notice that AutoAct, when using the Llama-2-13b model, can achieve performance comparable to that of the zero-shot GPT-3.5-Turbo agent.\n\n### MetaGPT\n- https://arxiv.org/abs/2308.00352\n- https://github.com/geekan/MetaGPT\n\nRemarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. \n\n### Multi-LLM-Agent\n- https://github.com/X-PLUG/Multi-LLM-Agent\n- https://arxiv.org/abs/2401.07324\n\nLarge Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete complex tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers but also excel in task planning, memory management, tool invocation, and result summarization. While traditional approaches focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. Moreover, the entire LLM may require retraining when tools are updated. To overcome these challenges, we propose a novel strategy that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with other components to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning.\n\n### More Agents Is All You Need\n- https://arxiv.org/abs/2402.05120\n- https://anonymous.4open.science/r/more_agent_is_all_you_need\n\nWe find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence.\n\n### AIOS\n- https://github.com/agiresearch/AIOS\n\nAIOS, a Large Language Model (LLM) Agent operating system, embeds large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system \"with soul\" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.\n\n### TwoStep\n- https://arxiv.org/abs/2403.17246\n- https://glamor-usc.github.io/twostep\n\nClassical planning formulations like the Planning Domain Definition Language (PDDL) admit action sequences guaranteed to achieve a goal state given an initial state if any are possible. However, reasoning problems defined in PDDL do not capture temporal aspects of action taking, for example that two agents in the domain can execute an action simultaneously if postconditions of each do not interfere with preconditions of the other. A human expert can decompose a goal into largely independent constituent parts and assign each agent to one of these subgoals to take advantage of simultaneous actions for faster execution of plan steps, each using only single agent planning. By contrast, large language models (LLMs) used for directly inferring plan steps do not guarantee execution success, but do leverage commonsense reasoning to assemble action sequences. We combine the strengths of classical planning and LLMs by approximating human intuitions for two-agent planning goal decomposition. We demonstrate that LLM-based goal decomposition leads to faster planning times than solving multi-agent PDDL problems directly while simultaneously achieving fewer plan execution steps than a single agent plan alone and preserving execution success. Additionally, we find that LLM-based approximations of subgoals can achieve similar multi-agent execution steps than those specified by human experts.\n\n### Agent-FLAN\n- https://arxiv.org/abs/2403.12881\n- https://github.com/InternLM/Agent-FLAN\n\nOpen-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem. This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents. Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5\\% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs.\n\n### EasyRL4Rec\n- https://github.com/chongminggao/EasyRL4Rec\n\nEasyRL4Rec is a comprehensive and easy-to-use library designed specifically for Reinforcement Learning (RL)-based Recommender Systems (RSs). This library provides lightweight and diverse RL environments based on five public datasets and includes core modules with rich options, simplifying model development. It provides unified evaluation standards focusing on long-term outcomes and offers tailored designs for state modeling and action representation for recommendation scenarios. \n\n### Jan\n- https://github.com/janhq/jan\n\nJan is an open-source ChatGPT alternative that runs 100% offline on your computer.\n\n### AgentStudio\n- https://skyworkai.github.io/agent-studio/\n- https://arxiv.org/abs/2403.17918\n\nAgentStudio is an open toolkit covering the entire lifespan of building virtual agents that can interact with everything on digital worlds. Here, we open-source the beta of environment implementations, benchmark suite, data collection pipeline, and graphical interfaces to promote research towards generalist virtual agents of the future.\n\n### AnyTool\n- https://arxiv.org/abs/2402.04253\n- https://github.com/dyabel/AnyTool\n\nWe introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism, which re-activates AnyTool if the initial solution proves impracticable. AnyTool is powered by the function calling feature of GPT-4, eliminating the need for training external modules. We also revisit the evaluation protocol introduced by previous works and identify a limitation in this protocol that leads to an artificially high pass rate. By revising the evaluation protocol to better reflect practical application scenarios, we introduce an additional benchmark, termed AnyToolBench. Experiments across various datasets demonstrate the superiority of our AnyTool over strong baselines such as ToolLLM and a GPT-4 variant tailored for tool utilization. For instance, AnyTool outperforms ToolLLM by +35.4% in terms of average pass rate on ToolBench. \n\n### TinyAgent\n- https://github.com/KMnO4-zx/TinyAgent\n\n在ChatGPT横空出世，夺走Bert的桂冠之后，大模型愈发的火热，国内各种模型层出不穷，史称“百模大战”。大模型的能力是毋庸置疑的，但大模型在一些实时的问题上，或是某些专有领域的问题上，可能会显得有些力不从心。因此，我们需要一些工具来为大模型赋能，给大模型一个抓手，让大模型和现实世界发生的事情对齐颗粒度，这样我们就获得了一个更好的用的大模型。\n\n这里基于React的方式，制作了一个最小的Agent结构（其实更多的是调用工具），暑假的时候会尝试将React结构修改为SOP结构。\n\n### TinyAgent\n- https://github.com/SqueezeAILab/TinyAgent\n\nTinyAgent aims to enable complex reasoning and function calling capabilities in Small Language Models (SLMs) that can be deployed securely and privately at the edge. Traditional Large Language Models (LLMs) like GPT-4 and Gemini-1.5, while powerful, are often too large and resource-intensive for edge deployment, posing challenges in terms of privacy, connectivity, and latency. TinyAgent addresses these challenges by training specialized SLMs with high-quality, curated data, and focusing on function calling with LLMCompiler. As a driving application, TinyAgent can interact with various MacOS applications, assisting users with day-to-day tasks such as composing emails, managing contacts, scheduling calendar events, and organizing Zoom meetings.\n\n### Tree Search for Language Model Agents\n- https://arxiv.org/abs/2407.01476\n- https://jykoh.com/search-agents\n\nAutonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. \n\n### octo-planner\n- https://www.nexa4ai.com/octo-planner#video\n- https://arxiv.org/pdf/2406.18082\n- https://huggingface.co/NexaAIDev/octopus-planning\n\nWe're thrilled to introduce the Octo-planner, the latest breakthrough in on-device language models from Nexa AI. Developed for the Planner-Action Agents Framework, Octo-planner enables rapid and efficient planning without the need for cloud connectivity, this model together with Octopus-V2 can work on edge devices locally to support AI Agent usages.\n\n### MindSearch\n- https://github.com/InternLM/MindSearch\n\nMindSearch is an open-source AI Search Engine Framework with Perplexity.ai Pro performance. You can simply deploy it with your own perplexity.ai style search engine with either close-source LLMs (GPT, Claude) or open-source LLMs (InternLM2.5-7b-chat). \n\n### AgentInstruct\n- https://arxiv.org/pdf/2407.03502\n\nSynthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.\n\n### AgentCourt\n- https://doi.org/10.48550/arXiv.2408.08089\n- https://github.com/relic-yuexi/AgentCourt\n\nIn this paper, we present a simulation system called AgentCourt that simulates the entire courtroom process. The judge, plaintiff's lawyer, defense lawyer, and other participants are autonomous agents driven by large language models (LLMs). Our core goal is to enable lawyer agents to learn how to argue a case, as well as improving their overall legal skills, through courtroom process simulation. To achieve this goal, we propose an adversarial evolutionary approach for the lawyer-agent. Since AgentCourt can simulate the occurrence and development of court hearings based on a knowledge base and LLM, the lawyer agents can continuously learn and accumulate experience from real court cases. The simulation experiments show that after two lawyer-agents have engaged in a thousand adversarial legal cases in AgentCourt (which can take a decade for real-world lawyers), compared to their pre-evolutionary state, the evolved lawyer agents exhibit consistent improvement in their ability to handle legal tasks. To enhance the credibility of our experimental results, we enlisted a panel of professional lawyers to evaluate our simulations. The evaluation indicates that the evolved lawyer agents exhibit notable advancements in responsiveness, as well as expertise and logical rigor. This work paves the way for advancing LLM-driven agent technology in legal scenarios. \n\n### AI-Scientist\n- https://github.com/SakanaAI/AI-Scientist\n\nOne of the grand challenges of artificial intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used to aid human scientists, e.g. for brainstorming ideas or writing code, they still require extensive manual supervision or are heavily constrained to a specific task.\n\nWe're excited to introduce The AI Scientist, the first comprehensive system for fully automatic scientific discovery, enabling Foundation Models such as Large Language Models (LLMs) to perform research independently.\n\n### RD-Agent\n- https://github.com/microsoft/RD-Agent\n\nRDAgent aims to automate the most critical and valuable aspects of the industrial R&D process, and we begin with focusing on the data-driven scenarios to streamline the development of models and data. Methodologically, we have identified a framework with two key components: 'R' for proposing new ideas and 'D' for implementing them. We believe that the automatic evolution of R&D will lead to solutions of significant industrial value.\n\n### AFlow: Automating Agentic Workflow Generation\n- https://arxiv.org/abs/2410.10762\n- https://github.com/geekan/MetaGPT\n\nLarge language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. \n\n### swarm\n- https://github.com/openai/swarm\n\nAn educational framework exploring ergonomic, lightweight multi-agent orchestration.\n\n### FinVision\n- https://arxiv.org/abs/2411.08899\n\nFinancial trading has been a challenging task, as it requires the integration of vast amounts of data from various modalities. Traditional deep learning and reinforcement learning methods require large training data and often involve encoding various data types into numerical formats for model input, which limits the explainability of model behavior. Recently, LLM-based agents have demonstrated remarkable advancements in handling multi-modal data, enabling them to execute complex, multi-step decision-making tasks while providing insights into their thought processes. This research introduces a multi-modal multi-agent system designed specifically for financial trading tasks. Our framework employs a team of specialized LLM-based agents, each adept at processing and interpreting various forms of financial data, such as textual news reports, candlestick charts, and trading signal charts. A key feature of our approach is the integration of a reflection module, which conducts analyses of historical trading signals and their outcomes. This reflective process is instrumental in enhancing the decision-making capabilities of the system for future trading scenarios. Furthermore, the ablation studies indicate that the visual reflection module plays a crucial role in enhancing the decision-making capabilities of our framework.\n\n### Agent Mental Clinic (AMC)\n- https://arxiv.org/abs/2409.15084\n\nMental health issues, particularly depressive disorders, present significant challenges in contemporary society, necessitating the development of effective automated diagnostic methods. This paper introduces the Agent Mental Clinic (AMC), a self-improving conversational agent system designed to enhance depression diagnosis through simulated dialogues between patient and psychiatrist agents. To enhance the dialogue quality and diagnosis accuracy, we design a psychiatrist agent consisting of a tertiary memory structure, a dialogue control and reflect plugin that acts as ``supervisor'' and a memory sampling module, fully leveraging the skills reflected by the psychiatrist agent, achieving great accuracy on depression risk and suicide risk diagnosis via conversation. Experiment results on datasets collected in real-life scenarios demonstrate that the system, simulating the procedure of training psychiatrists, can be a promising optimization method for aligning LLMs with real-life distribution in specific domains without modifying the weights of LLMs, even when only a few representative labeled cases are available.\n\n### MedAI\n- https://arxiv.org/abs/2410.04660\n\nBiomedical knowledge is uniquely complex and structured, requiring distinct reasoning strategies compared to other scientific disciplines like physics or chemistry. Biomedical scientists do not rely on a single approach to reasoning; instead, they use various strategies, including rule-based, prototype-based, and case-based reasoning. This diversity calls for flexible approaches that accommodate multiple reasoning strategies while leveraging in-domain knowledge. We introduce KGARevion, a knowledge graph (KG) based agent designed to address the complexity of knowledge-intensive medical queries. Upon receiving a query, KGARevion generates relevant triplets by using the knowledge base of the LLM. These triplets are then verified against a grounded KG to filter out erroneous information and ensure that only accurate, relevant data contribute to the final answer. Unlike RAG-based models, this multi-step process ensures robustness in reasoning while adapting to different models of medical reasoning. Evaluations on four gold-standard medical QA datasets show that KGARevion improves accuracy by over 5.2%, outperforming 15 models in handling complex medical questions. To test its capabilities, we curated three new medical QA datasets with varying levels of semantic complexity, where KGARevion achieved a 10.4% improvement in accuracy.\n\n### Agent-0\n- https://github.com/PromtEngineer/Agent-0\n\nAgent-0: Replicating O1's Chain of Thought Reasoning\n\nThis project is a proof of concept that aims to replicate the reasoning capabilities of OpenAI's newly released O1 model. O1 uses chain-of-thought prompting and reinforcement learning to reflect on its solutions, improving responses through iterative reasoning. Our goal is to mimic this behavior using alternative models.\n\nIn this implementation, we use a sequential agent-based system powered by the Gemini API (or any model with function-calling capabilities). The system proposes solutions to coding-related problems and iteratively refines them using chain-of-thought and reflection techniques at each stage. The Gemini API, with its code execution abilities, is ideal for this project. While it works with Gemini Flash, we recommend using the Pro version to avoid issues with external package dependencies, as the Pro version generally sticks to Python's standard library.\n\n### Large Language Model-Brained GUI Agents: A Survey\n- https://arxiv.org/abs/2411.18279\n\nGUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry.\n\nTo provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.\n\n### Building effective agents\n- https://www.anthropic.com/research/building-effective-agents\n\nOver the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.\n\nIn this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.\n\n### UI-TARS\n- https://github.com/bytedance/UI-TARS\n\nUI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.\n\n### PaSa\n- https://github.com/bytedance/pasa\n\nPaSa -- an advanced paper search agent powered by large language models. It can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholarly queries.\n\n### Docling\n- https://github.com/DS4SD/docling\n\nDocling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.\n\n### Eko\n- https://eko.fellou.ai\n- https://github.com/FellouAI/eko\n- https://eko.fellou.ai/docs\n\nEko (pronounced like ‘echo’) is a production-ready JavaScript framework that enables developers to create reliable agents, from simple commands to complex workflows. It provides a unified interface for running agents in both computer and browser environments.\n\n### Search-o1\n- https://arxiv.org/abs/2501.05366\n- https://huggingface.co/papers/2501.05366\n- https://github.com/sunnynexus/Search-o1\n\nLarge Reasoning Models (LRMs) like OpenAI's o1 have showcased remarkable long stepwise reasoning capabilities through large-scale reinforcement learning. Despite their strengths, these models often encounter knowledge insufficiencies during prolonged reasoning processes, resulting in frequent uncertainties and potential errors, as shown in the following figure.\n\n### CogAgent\n- https://arxiv.org/abs/2312.08914\n- https://github.com/THUDM/CogAgent\n\nCogAgent-9B-20241220 model is based on GLM-4V-9B, a bilingual open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements, CogAgent-9B-20241220 achieves significant advancements in GUI perception, inference prediction accuracy, action space completeness, and generalizability across tasks. The model supports bilingual (Chinese and English) interaction with both screenshots and language input. This version of the CogAgent model has already been applied in ZhipuAI's GLM-PC product. We hope the release of this model can assist researchers and developers in advancing the research and applications of GUI agents based on vision-language models.\n\n### Proactive Agent\n- https://arxiv.org/abs/2410.12361\n\nAgents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.\n\n### Open-source DeepResearch\n- https://huggingface.co/blog/open-deep-research\n\nwe decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!\n\n### RAGEN\n- github.com/ZihanWang314/ragen\n\nRAGEN is the first reproduction of the DeepSeek-R1(-Zero) methods for training agentic models.\nWe strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.\n\n### smolagents\n- https://huggingface.co/blog/smolagents\n\n🤗 smolagents: a barebones library for agents. Agents write python code to call tools and orchestrate other agents.\n\n### Open Deep Research\n- https://github.com/btahir/open-deep-research\n\nAn open-source alternative to Gemini Deep Research, built to generate AI-powered reports from web search results with precision and efficiency. Supporting multiple AI platforms (Google, OpenAI, Anthropic) and models, it offers flexibility in choosing the right AI model for your research needs.\n\n### Octopus v2\n- https://arxiv.org/abs/2404.01744\n- https://huggingface.co/NexaAIDev/Octopus-v2\n\nLanguage models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95\\%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.\n\n### ReadAgent\n- https://arxiv.org/abs/2402.09727\n- https://read-agent.github.io/\n- https://github.com/read-agent/read-agent.github.io/blob/main/assets/read_agent_demo.ipynb\n\nInspired by how humans interactively read long documents, we implement ReadAgent as a simple prompting system that uses the advanced language capabilities of LLMs to (1) decide what content to store together in a memory episode, (2) compress those memory episodes into short episodic memories called gist memories, and (3) take actions to look up passages in the original text if ReadAgent needs to remind itself of relevant details to complete a task. We evaluate ReadAgent against baselines using retrieval methods, using the original long contexts, and using the gist memories. These evaluations are performed on three long-document reading comprehension tasks: QuALITY (max 6,000 words), NarrativeQA (max 343,000 words), and QMSum (max 26,300 words). ReadAgent outperforms the baselines on all three tasks while extending the effective context window by 3-20x.\n\nIn addition, we adapt ReadAgent to web navigation, which is a fundamentally very-long context agent setting. We find that ReadAgent is simple to adapt to this setting and shows promising performance.\n\n### STORM\n- https://github.com/stanford-oval/storm\n- https://arxiv.org/abs/2402.14207\n\nSTORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search.\nWhile the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage.\n\n### AgentRun\n- https://github.com/Jonathan-Adly/AgentRun\n\nAgentRun is a Python library that makes it easy to run Python code safely from large language models (LLMs) with a single line of code. Built on top of the Docker Python SDK and RestrictedPython, it provides a simple, transparent, and user-friendly API to manage isolated code execution.\n\n### OS-Copilot\n- https://os-copilot.github.io/\n- https://arxiv.org/pdf/2402.07456.pdf\n- https://github.com/OS-Copilot/OS-Copilot\n\nAutonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OSCopilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.\n\n### AutoWebGLM\n- https://github.com/THUDM/AutoWebGLM\n- https://arxiv.org/pdf/2404.03648.pdf\n\nAutoWebGLM is a project aimed at building a more efficient language model-driven automated web navigation agent. This project is built on top of the ChatGLM3-6B model, extending its capabilities to navigate the web more effectively and tackle real-world browsing challenges better.\n\n### Agent Hospital\n- https://arxiv.org/pdf/2405.02957\n\nIn this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. All patients, nurses, and doctors are autonomous agents powered by large language models (LLMs). Our central goal is to enable a doctor agent to learn how to treat illness within the simulacrum. To do so, we propose a method called MedAgent-Zero. As the simulacrum can simulate disease onset and progression based on knowledge bases and LLMs, doctor agents can keep accumulating experience from both successful and unsuccessful cases. Simulation experiments show that the treatment performance of doctor agents consistently improves on various tasks. More interestingly, the knowledge the doctor agents have acquired in Agent Hospital is applicable to real-world medicare benchmarks. After treating around ten thousand patients (real-world doctors may take over two years), the evolved doctor agent achieves a state-of-the-art accuracy of 93.06% on a subset of the MedQA dataset that covers major respiratory diseases. This work paves the way for advancing the applications of LLM-powered agent techniques in medical scenarios.\n\n### CodeR\n- https://www.swebench.com\n- https://arxiv.org/pdf/2406.01304\n- https://github.com/NL2Code/CodeR\n\nGitHub issue resolving recently has attracted significant attention from academia and industry. SWE-bench is proposed to measure the performance in resolving issues. In this paper, we propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository. On SWE-bench lite, CodeR is able to solve $28.33$% of issues, in the case of submitting only once for each issue. We examine the performance impact of each design of CodeR and offer insights to advance this research direction.\n\n### Mobile-Agent-v2\n- https://arxiv.org/abs/2406.01014\n- https://github.com/X-PLUG/MobileAgent\n\nMobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. \n\n### Husky\n- https://arxiv.org/html/2406.06469v1\n- https://github.com/agent-husky/Husky-v1\n\nWe introduce LogoHusky-v1, a holistic, open-source language agent that learns to reason over a unified action space to address a diverse set of complex tasks involving numerical, tabular, and knowledge-based reasoning. Husky iterates between two stages: 1) generating the next action to take towards solving a given task, and 2) executing the action using expert models and updating the current solution state. Husky-v1 uses a code generator, a query generator and a math reasoner as expert models.\n\n### APAM\n- https://arxiv.org/abs/2404.04204\n\nPeople rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Drawing upon interdisciplinary research from communication and psychology, this perspective paper identifies social skill barriers to enter specialized fields. Then we present a solution that leverages large language models for social skill training via a generic framework. Our AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback. This work ultimately calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.\n\n### Mistral-Interact\n- https://github.com/HBX-hbx/Mistral-Interact\n- https://arxiv.org/abs/2402.09205\n\nCurrent language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions. Although adept at devising strategies and performing tasks, these agents struggle with seeking clarification and grasping precise user intentions. To bridge this gap, we introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users' implicit intentions through explicit queries. Next, we propose the incorporation of model experts as the upstream in agent designs to enhance user-agent interaction. Employing IN3, we empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires user intentions, and refines them into actionable goals before starting downstream agent task execution. Integrating it into the XAgent framework, we comprehensively evaluate the enhanced agent system regarding user instruction understanding and execution, revealing that our approach notably excels at identifying vague user tasks, recovering and summarizing critical missing information, setting precise and necessary agent execution goals, and minimizing redundant tool usage, thus boosting overall efficiency. All the data and codes are released.\n\n### AgentLite\n- https://github.com/SalesforceAIResearch/AgentLite\n- https://arxiv.org/abs/2402.15538\n\nAgentLite is a research-oriented library designed for building and advancing LLM-based task-oriented agent systems. It simplifies the implementation of new agent/multi-agent architectures, enabling easy orchestration of multiple agents through a manager agent. Whether you're building individual agents or complex multi-agent systems, AgentLite provides a straightforward and lightweight foundation for your research and development.\n\n### KnowAgent\n- https://www.zjukg.org/project/KnowAgent/\n- https://arxiv.org/abs/2403.03101\n- https://github.com/zjunlp/KnowAgent\n\nOur development is grounded on several key steps: Initially, we create an extensive action knowledge base, which amalgamates action planning knowledge pertinent to specific tasks. This database acts as an external reservoir of information, steering the model's action generation process. Subsequently, by converting action knowledge into text, we enable the model to deeply understand and utilize this knowledge in creating action trajectories. Finally, through a knowledgeable self-learning phase, we use trajectories developed from the model's iterative processes to continually improve its understanding and application of action knowledge. This process not only strengthens the agents' planning abilities but also enhances their potential for application in complex situations.\n\n### LlamaGym\n- https://github.com/KhoomeiK/LlamaGym\n\n\"Agents\" originated in reinforcement learning, where they learn by interacting with an environment and receiving a reward signal. However, LLM-based agents today do not learn online (i.e. continuously in real time) via reinforcement.\n\nOpenAI created Gym to standardize and simplify RL environments, but if you try dropping an LLM-based agent into a Gym environment for training, you'd find it's still quite a bit of code to handle LLM conversation context, episode batches, reward assignment, PPO setup, and more.\n\nLlamaGym seeks to simplify fine-tuning LLM agents with RL. Right now, it's a single Agent abstract class that handles all the issues mentioned above, letting you quickly iterate and experiment with agent prompting & hyperparameters across any Gym environment.\n\n### WorkArena\n- https://arxiv.org/abs/2403.07718\n- https://github.com/ServiceNow/WorkArena\n- https://github.com/ServiceNow/BrowserGym\n\nWe study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 29 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.\n\n### STE（Simulated Trial and Error）\n- https://arxiv.org/abs/2403.04746\n- https://github.com/microsoft/simulated-trial-and-error\n\nTools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.\n\n### AgentBench\n- https://llmbench.ai/agent\n- https://github.com/THUDM/AgentBench\n\n我们提出了AgentBench，这是一个多维演进基准测试，包括8个不同环境，用于评估大型语言模型（LLMs）在多回合开放式生成环境中的推理和决策能力。通过对25个语言模型的广泛测试，我们发现顶级商业语言模型在复杂环境中表现出色，且与开源模型之间存在显著差距。\n\n### 中文MT-Bench\n- https://github.com/HIT-SCIR/huozi\n\n本数据集是英文MT-Bench对话能力评测数据集的中文版。它包含了一系列多轮对话问题，每一组问题都经过了精心的人工校对，并为适应中文语境进行了必要的调整。\n\n### E-EVAL\n- https://eevalbenchmark.com\n- https://github.com/AI-EDU-LAB/E-EVAL\n- https://arxiv.org/abs/2401.15927\n\nE-EVAL is a comprehensive Chinese K12 education evaluation benchmark that contains 4,352 multiple-choice questions across three difficulty levels, primary, middle and high school, for a total of 23 subjects.\n\n### ConflictingQA\n- https://arxiv.org/abs/2402.11782\n- https://github.com/AlexWan0/rag-convincingness\n\nRetrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as \"is aspartame linked to cancer\". To resolve these ambiguous queries, one must search through a large range of websites and consider \"which, if any, of this evidence do I find convincing?\". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.\n\n### Medical Information Retrieval-Augmented Generation Evaluation （MIRAGE）\n- https://arxiv.org/abs/2402.13178\n- https://teddy-xionggz.github.io/benchmark-medical-rag/\n\nWhile large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the \"lost-in-the-middle\" effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.\n\n### ∞Bench\n- https://arxiv.org/abs/2402.13718\n- https://github.com/OpenBMB/InfiniteBench\n\nProcessing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose ∞Bench, the first LLM benchmark featuring an average data length surpassing 100K tokens. ∞Bench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in ∞Bench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. In our experiments, based on ∞Bench, we evaluate the state-of-the-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. We further present three intriguing analyses regarding the behavior of LLMs processing long context.\n\n### Red Teaming Resistance Benchmark\n- https://huggingface.co/spaces/HaizeLabs/red-teaming-resistance-benchmark\n- https://github.com/haizelabs/redteaming-resistance-benchmark\n\nHello! This repository contains the code and data used to benchmark redteaming prompts against various models as seen in our Huggingface Leaderboard. This project is aimed to reveal weaknesses in both open-sourced and blackbox language models through redteaming attacks covering a diverse range of behaviors and topics.\n\n### Fin-Eva\n- https://github.com/alipay/financial_evaluation_dataset\n\n蚂蚁集团、上海财经大学联合推出金融评测集Fin-Eva Version 1.0，覆盖财富管理、保险、投资研究等多个金融场景以及金融专业主题学科，总评测题数目达到1.3w+。\n\n蚂蚁数据源包括各业务领域数据、互联网公开数据，经过数据脱敏、文本聚类、语料精筛、数据改写等处理过程后，结合金融领域专家的评审构建而成。 上海财经大学数据源主要基于相关领域权威性考试的各类真题和模拟题对知识大纲的要求，由上海财经大学统计与管理学院张立文副教授课题组牵头，金融学院闵敏副教授及其他各学院老师协助完成，所有数据均为原创，这保证了数据源的准确性和权威性。\n\n蚂蚁部分涵盖金融认知、金融知识、金融逻辑、内容生成以及安全合规五大类能力33个子维度共8445个测评题； 上财部分涵盖金融，经济，会计和证书等四大领域，包括4661个问题，涵盖34个不同的学科。\n\n### Cappy\n- https://blog.research.google/2024/03/cappy-outperforming-and-boosting-large.html\n- https://arxiv.org/abs/2311.06720\n\nIn “Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer”, presented at NeurIPS 2023, we propose a novel approach that enhances the performance and efficiency of multi-task LLMs. We introduce a lightweight pre-trained scorer, Cappy, based on continual pre-training on top of RoBERTa with merely 360 million parameters. Cappy takes in an instruction and a candidate response as input, and produces a score between 0 and 1, indicating an estimated correctness of the response with respect to the instruction. Cappy functions either independently on classification tasks or serves as an auxiliary component for LLMs, boosting their performance. Moreover, Cappy efficiently enables downstream supervision without requiring any finetuning, which avoids the need for back-propagation through LLM parameters and reduces memory requirements. Finally, adaptation with Cappy doesn’t require access to LLM parameters as it is compatible with closed-source multi-task LLMs, such as those only accessible via WebAPIs.\n\n### BAMBOO\n- https://arxiv.org/abs/2309.13345\n- https://github.com/RUCAIBox/BAMBOO\n\nLarge language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. Recently, multiple studies have committed to extending the context length and enhancing the long text modeling capabilities of LLMs. To comprehensively evaluate the long context ability of LLMs, we propose BAMBOO, a multi-task long context benchmark. BAMBOO has been designed with four principles: comprehensive capacity evaluation, avoidance of data contamination, accurate automatic evaluation, and different length levels. It consists of 10 datasets from 5 different long text understanding tasks, i.e. question answering, hallucination detection, text sorting, language modeling, and code completion, to cover core capacities and various domains of LLMs. We conduct experiments with five long context models on BAMBOO and further discuss four key research questions of long text. We also qualitatively analyze current long context models and point out future directions for enhancing long text modeling capacities.\n\n### Fast-DetectGPT\n- https://openreview.net/forum?id=Bpcgcr8E8Z\n- https://github.com/baoguangsheng/fast-detect-gpt\n- https://arxiv.org/abs/2310.05130\n\nLarge language models (LLMs) have shown the ability to produce fluent and cogent content, presenting both productivity opportunities and societal risks. To build trustworthy AI systems, it is imperative to distinguish between machine-generated and human-authored content. The leading zero-shot detector, DetectGPT, showcases commendable performance but is marred by its intensive computational costs. In this paper, we introduce the concept of conditional probability curvature to elucidate discrepancies in word choices between LLMs and humans within a given context. Utilizing this curvature as a foundational metric, we present **Fast-DetectGPT**, an optimized zero-shot detector, which substitutes DetectGPT's perturbation step with a more efficient sampling step. Our evaluations on various datasets, source models, and test conditions indicate that Fast-DetectGPT not only surpasses DetectGPT by a relative around 75% in both the white-box and black-box settings but also accelerates the detection process by a factor of 340, as detailed in Table 1.\n\n### GAMA-Bench\n- https://github.com/CUHK-ARISE/GAMABench\n- https://arxiv.org/abs/2403.11807\n\nDecision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 72.5. Moreover, the increasingly higher scores across the three iterations of GPT-3.5 (0613, 1106, 0125) demonstrate marked advancements in the model's intelligence with each update.\n\n### FineMath\n- https://arxiv.org/pdf/2403.07747.pdf\n\nTo thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities.\n\n### ToolEmu\n- http://toolemu.com/\n- https://arxiv.org/pdf/2309.15817.pdf\n\nRecent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, manually setting up the environment for each test scenario, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment.\n\n### ClongEval\n- https://arxiv.org/abs/2403.03514\n- https://github.com/zexuanqiu/CLongEval\n\nDeveloping Large Language Models (LLMs) with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs will be released.\n\n### Counting-Stars\n- https://arxiv.org/abs/2403.11802\n- https://github.com/nick7nlp/Counting-Stars\n\nWhile recent research endeavors have concentrated on developing Large Language Models (LLMs) with robust long-context capabilities, due to the lack of appropriate evaluation strategies, relatively little is known about how well the long-context capability and performance of leading LLMs (e.g., GPT-4 Turbo and Kimi Chat). To address this gap, we propose a simple, efficient, and reasonable strategy for evaluating long-context LLMs as a new benchmark, named Counting-Stars. The Counting-Stars is designed to require LLMs to fully understand and capture long dependencies in long contexts, further being able to collect inter-dependency across multiple pieces of evidence spanning the entire context to finish the task. Based on the Counting-Stars, we conduct experiments to evaluate the two leading long-context LLMs, i.e., GPT-4 Turbo and Kimi Chat. The experimental results indicate that GPT-4 Turbo and Kimi Chat achieve significant performance in the long context from 4K to 128K. We further present several intriguing analyses regarding the behavior of LLMs processing long context.\n\n### InfiCoder-Eval\n- https://arxiv.org/abs/2404.07940\n- https://infi-coder.github.io/inficoder-eval/\n\nLarge Language Models for understanding and generating code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiCoder-Eval, a large-scale freeform question-answering (QA) benchmark for code, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. To evaluate the response correctness, InfiCoder-Eval supports four types of model-free metrics and domain experts carefully choose and concretize the criterion for each question. We conduct a systematic evaluation for more than 80 code LLMs on InfiCoder-Eval, leading to a series of insightful findings. Furthermore, our detailed analyses showcase possible directions for further improvement of code LLMs.\n\n### MathVerse\n- https://arxiv.org/pdf/2403.14624.pdf\n- https://mathverse-cuhk.github.io/\n- https://github.com/ZrrSkywalker/MathVerse\n- https://huggingface.co/datasets/AI4Math/MathVerse\n\nWe introduce Logo MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into 6 distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows Logo MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs.\n\n### CoderUJB\n- https://arxiv.org/abs/2403.19287\n- https://github.com/WisdomShell/ujb\n\nCoderUJB (Unified Java Benchmark): A new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java’s prevalence in real-world software production.\n\n### LooGLE\n- https://arxiv.org/abs/2311.04939\n- https://huggingface.co/datasets/bigainlco/LooGLE\n- https://github.com/bigai-nlco/LooGLE\n\nLooGLE is a comprehensive evaluation benchmark for LLM long context understanding which contains up-to-date (all after 2022) and extremely long realistic documents (over 24k tokens per document, many of which exceed 100k words) and 6,000 newly generated questions spanning diverse domains and categories. \n\n### McEval\n- https://github.com/MCEVAL/McEval\n- https://mceval.github.io/leaderboard.html\n- https://arxiv.org/abs/2406.07436 \n\n为了更加全面的探究大语言模型的代码能力，该工作提出了一个涵盖40种编程语言的大规模多语言多任务代码评测基准（McEval），包含了16000个测试样本。评测结果表明开源模型与GPT-4相比，在多语言的编程能力上仍然存在较大差距，绝大多数开源模型甚至无法超越GPT-3.5。此外测试也表明开源模型中如Codestral，DeepSeek-Coder, CodeQwen以及一些衍生模型也展现出优异的多语言能力。该基准的提出对推动多语言代码评测具有重要意义。\n\n### CRAG\n- https://arxiv.org/pdf/2406.04744\n\nRetrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.\n\n### BigCodeBench\n- https://github.com/bigcode-project/bigcodebench\n\nBigCodeBench is an easy-to-use benchmark for code generation with practical and challenging programming tasks. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls. To facilitate the evaluation of LLMs on BigCodeBench, we provide this Python package bigcodebench that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks.\n\n### Prometheus 2\n- https://arxiv.org/abs/2405.01535\n- https://github.com/prometheus-eval/prometheus-eval\n\nProprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs.\n\n### Open LLM Leaderboard\n- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard\n\nEvaluating open LLMs\n\n### CriticGPT\n- https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/\n- https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf\n\nCriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF\n\n### Test of Time\n- https://arxiv.org/abs/2406.09170\n- https://huggingface.co/datasets/baharef/ToT\n\nLarge language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. \n\n### WebCanvas\n- https://arxiv.org/pdf/2406.12373\n- https://imean.ai/web-canvas\n- https://github.com/iMeanAI/WebCanvas\n- https://huggingface.co/datasets/iMeanAI/Mind2Web-Live\n\nExisting benchmarks for web agent tasks are either offline and static, or operate within a fully reproducible environment with limited Internet dynamics. The WebCanvas project aims to pioneer the online evaluation of web agents. Additionally, we offer a suite of toolkits for scaling and maintaining web agent data to support this endeavor. We welcome any constructive feedback on the project and look forward to partnering with you in developing agents for web tasks!\n\n### Lynx\n- https://arxiv.org/abs/2407.08488\n\nRetrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. We introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced reasoning on challenging real-world hallucination scenarios. To evaluate LYNX, we present HaluBench, a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. Our experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX, HaluBench and our evaluation code for public access.\n\n### ComplexBench\n- https://arxiv.org/abs/2407.03978\n- https://github.com/thu-coai/ComplexBench\n\nInstruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.\n\n### Mr-Ben\n- https://github.com/dvlab-research/Mr-Ben\n- https://randolph-zeng.github.io/Mr-Ben.github.io/\n- https://arxiv.org/abs/2406.13975\n\nLarge language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark Mr.Ben that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Mr.Ben is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. By incorporating this approach, Mr.Ben facilitates a multidimensional evaluation of LLM reasoning abilities. We conducted an extensive assessment of open-source and closed-source LLMs using Mr.Ben, which revealed previously unidentified limitations and weaknesses in their meta-reasoning capabilities across different tasks.\n\n### SimpleQA\n- https://openai.com/index/introducing-simpleqa/\n\nIn SimpleQA, we will focus on short, fact-seeking queries, which reduces the scope of the benchmark but makes measuring factuality much more tractable.\n\n### AppBench\n- https://arxiv.org/pdf/2410.19743\n- https://rulegreen.github.io\n- https://github.com/ruleGreen/AppBench\n\nLarge Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources (e.g., different Apps in the iPhone), especially for complex user instructions. In this paper, we introduce \\texttt{AppBench}, the first benchmark to evaluate LLMs' ability to plan and execute multiple APIs from various sources in order to complete the user's task. Specifically, we consider two significant challenges in multiple APIs: \\textit{1) graph structures:} some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and \\textit{2) permission constraints:} which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0\\% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. \n\n### CompassJudger/JudgerBench\n- https://arxiv.org/abs/2410.16256\n- https://huggingface.co/opencompass\n- https://github.com/open-compass/CompassJudger\n- https://huggingface.co/spaces/opencompass/judgerbench_leaderboard\n\nThe CompassJudger-1 series are an All-in-one Judge Models introduced by Opencompass. These models not only excel in various evaluation methods through scoring and comparison but also can output reviews with assessment details in a specified format, making them suitable for any evaluation dataset. Moreover, they can perform general tasks akin to a typical instruction model, thus serving as a versatile tool with strong generalization and judging capabilities.\n\n### CMCOQA\nCMCOQA: A Chinese Medical Complex Open-Question Answering Benchmark\nIEEE BIBM 2024. [paper to be published]\n\n### CodevBench\n- https://arxiv.org/abs/2410.01353\n- https://github.com/LingmaTongyi/Codev-Bench\n- https://huggingface.co/datasets/TongyiLingma/CodevBench\n\nCode completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation benchmark that enables meaningful comparisons between products and guides future advancements. However, existing benchmarks focus more on coarse-grained tasks without industrial analysis resembling general code generation rather than the real-world scenarios developers encounter. Moreover, these benchmarks often rely on costly and time-consuming human annotation, and the standalone test cases fail to leverage minimal tests for maximum repository-level understanding and code coverage. To address these limitations, we first analyze business data from an industrial code completion tool and redefine the evaluation criteria to better align with the developer's intent and desired completion behavior throughout the coding process. Based on these insights, we introduce Codev-Agent, an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage, ensuring fair and effective comparisons. Using Codev-Agent, we present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Bench assesses whether a code completion tool can capture a developer's immediate intent and suggest appropriate code across diverse contexts, providing a more realistic benchmark for code completion in modern software development.\n\n### FrontierMath\n- https://epochai.org/frontiermath\n\nWe introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.\n\n### GIFT-Eval\n- https://arxiv.org/abs/2410.10393\n- https://github.com/SalesforceAIResearch/gift-eval\n- https://www.salesforce.com/blog/gift-eval-time-series-benchmark/\n\nTime series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training. However, the advancement of these models has been hindered by the lack of comprehensive benchmarks. To address this gap, we introduce the General Time Series Forecasting Model Evaluation, GIFT-Eval, a pioneering benchmark aimed at promoting evaluation across diverse datasets. GIFT-Eval encompasses 23 datasets over 144,000 time series and 177 million data points, spanning seven domains, 10 frequencies, multivariate inputs, and prediction lengths ranging from short to long-term forecasts. To facilitate the effective pretraining and evaluation of foundation models, we also provide a non-leaking pretraining dataset containing approximately 230 billion data points. Additionally, we provide a comprehensive analysis of 17 baselines, which includes statistical models, deep learning models, and foundation models. We discuss each model in the context of various benchmark characteristics and offer a qualitative analysis that spans both deep learning and foundation models. We believe the insights from this analysis, along with access to this new standard zero-shot time series forecasting benchmark, will guide future developments in time series foundation models. \n\n### LightEval\n- https://github.com/huggingface/lighteval\n\nLighteval is your all-in-one toolkit for evaluating LLMs across multiple backends—whether it's transformers, tgi, vllm, or nanotron—with ease. Dive deep into your model’s performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack-up.\n\nCustomization at your fingertips: letting you either browse all our existing tasks and metrics or effortlessly create your own, tailored to your needs.\n\nSeamlessly experiment, benchmark, and store your results on the Hugging Face Hub, S3, or locally.\n\n### RMB-Reward-Model-Benchmark\n- https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark\n- https://arxiv.org/abs/2410.09893\n\nRMB is a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (BoN) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization.\n\n### Chinese SimpleQA\n- https://arxiv.org/abs/2411.07140\n- https://openstellarteam.github.io/ChineseSimpleQA\n- https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA\n- https://github.com/OpenStellarTeam/ChineseSimpleQA\n\nChinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, our benchmark covers 6 major topics with 99 diverse subtopics.\n\n### Evalchemy\n- https://github.com/mlfoundations/evalchemy\n\nA framework for gold standard language model evaluations\n\nEvalchemy is a unified and easy-to-use toolkit for evaluating language models, focussing on post-trained models. Evalchemy is developed by the DataComp community and Bespoke Labs and builds on the LM-Eval-Harness to provide a unified, easy-to-use platform for language model evaluation. Evalchemy integrates multiple existing benchmarks, such as RepoBench, AlpacaEval, and ZeroEval.\n\n### WebWalker\n- https://github.com/Alibaba-nlp/WebWalker\n- https://alibaba-nlp.github.io/WebWalker\n-https://arxiv.org/pdf/2501.07572\n\nRetrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.\n\n\n### Getting a Judge-LLM\n- https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/getting-a-judge-llm.md\n\nWhen using an existing LLM, you can go for generalist, high capability models, using small specialist models trained specifically to discriminate from preference data, or training your own.\n\n### PRMBench\n- https://arxiv.org/abs/2501.03124\n- https://prmbench.github.io/\n- https://github.com/ssmisya/PRMBench\n\nProcess-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.\n\n### OmniDocBench\n- https://arxiv.org/abs/2412.07626\n- https://github.com/opendatalab/OmniDocBench\n\nDocument content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies.\n\n### CodeArena\n- https://arxiv.org/abs/2412.05210v1\n- https://github.com/QwenLM/Qwen2.5-Coder/tree/main/qwencoder-eval/instruct/CodeArena\n- https://codearenaeval.github.io/leaderboard.html\n\nThe current codeLLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the modelgenerated response and human preference, we present a rigorous human-curated benchmark codearena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus syncode-instruct (nearly 10B tokens) by scaling instructions from the website. The results find performance differences between code execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 20+ LLMs reveal a notable performance gap between open codeLLMs (e.g. Qwen-Coder) and closed-source LLMs (e.g., o1 and Claude series), underscoring the importance of the alignment of the human preference.\n\n### HALoGEN\n- https://arxiv.org/abs/2501.08292\n\nDespite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.\n\n### Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding\n- https://github.com/hemingkx/SpeculativeDecodingPapers\n- https://arxiv.org/abs/2401.07851\n\nTo mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first efficiently drafts several future tokens and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, including current leading techniques, the challenges faced, and potential future directions in this field. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.\n\n### QAnything\n- https://github.com/netease-youdao/QAnything\n\nQAnything(Question and Answer based on Anything) is a local knowledge base question-answering system designed to support a wide range of file formats and databases, allowing for offline installation and use.\n\nWith QAnything, you can simply drop any locally stored file of any format and receive accurate, fast, and reliable answers.\n\nCurrently supported formats include: PDF, Word (doc/docx), PPT, Markdown, Eml, TXT, Images (jpg, png, etc.), Web links and more formats coming soon…\n\n### Meta-Prompting\n- https://arxiv.org/abs/2401.12954\n- https://github.com/suzgunmirac/meta-prompting\n\nWe introduce meta-prompting, an effective scaffolding technique designed to enhance the functionality of language models (LMs). This approach transforms a single LM into a multi-faceted conductor, adept at managing and integrating multiple independent LM queries. By employing high-level instructions, meta-prompting guides the LM to break down complex tasks into smaller, more manageable subtasks. These subtasks are then handled by distinct \"expert\" instances of the same LM, each operating under specific, tailored instructions. Central to this process is the LM itself, in its role as the conductor, which ensures seamless communication and effective integration of the outputs from these expert models. It additionally employs its inherent critical thinking and robust verification processes to refine and authenticate the end result. This collaborative prompting approach empowers a single LM to simultaneously act as a comprehensive orchestrator and a panel of diverse experts, significantly enhancing its performance across a wide array of tasks. The zero-shot, task-agnostic nature of meta-prompting greatly simplifies user interaction by obviating the need for detailed, task-specific instructions. Furthermore, our research demonstrates the seamless integration of external tools, such as a Python interpreter, into the meta-prompting framework, thereby broadening its applicability and utility. Through rigorous experimentation with GPT-4, we establish the superiority of meta-prompting over conventional scaffolding methods: When averaged across all tasks, including the Game of 24, Checkmate-in-One, and Python Programming Puzzles, meta-prompting, augmented with a Python interpreter functionality, surpasses standard prompting by 17.1%, expert (dynamic) prompting by 17.3%, and multipersona prompting by 15.2%.\n\n### Lepton Search\n- https://github.com/leptonai/search_with_lepton\n\nBuild your own conversational search engine using less than 500 lines of code.\n\n### RLMRec\n- https://github.com/HKUDS/RLMRec\n- https://arxiv.org/abs/2310.15950\n\nRecommender systems have seen significant advancements with the influence of deep learning and graph neural networks, particularly in capturing complex user-item relationships. However, these graph-based recommenders heavily depend on ID-based data, potentially disregarding valuable textual information associated with users and items, resulting in less informative learned representations. Moreover, the utilization of implicit feedback data introduces potential noise and bias, posing challenges for the effectiveness of user preference learning. While the integration of large language models (LLMs) into traditional ID-based recommenders has gained attention, challenges such as scalability issues, limitations in text-only reliance, and prompt input constraints need to be addressed for effective implementation in practical recommender systems. To address these challenges, we propose a model-agnostic framework RLMRec that aims to enhance existing recommenders with LLM-empowered representation learning. It proposes a recommendation paradigm that integrates representation learning with LLMs to capture intricate semantic aspects of user behaviors and preferences. RLMRec incorporates auxiliary textual signals, develops a user/item profiling paradigm empowered by LLMs, and aligns the semantic space of LLMs with the representation space of collaborative relational signals through a cross-view alignment framework. This work further establish a theoretical foundation demonstrating that incorporating textual signals through mutual information maximization enhances the quality of representations. In our evaluation, we integrate RLMRec with state-of-the-art recommender models, while also analyzing its efficiency and robustness to noise data.\n\n### Open-Source AI Cookbook\n- https://huggingface.co/learn/cookbook/\n\nThe Open-Source AI Cookbook is a collection of notebooks illustrating practical aspects of building AI applications and solving various machine learning tasks using open-source tools and models.\n\n### MaLA-500\n- https://huggingface.co/MaLA-LM/mala-500\n- https://arxiv.org/abs/2401.13303\n\nLarge language models have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our experiments on SIB-200 show that MaLA-500 achieves state-of-the-art in-context learning results.\n\n### NVIDIA Chat with RTX\n- https://www.nvidia.com/en-us/ai-on-rtx/chat-with-rtx-generative-ai/\n\nChat With RTX is a demo app that lets you personalize a GPT large language model (LLM) connected to your own content—docs, notes, videos, or other data. Leveraging retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration, you can query a custom chatbot to quickly get contextually relevant answers. And because it all runs locally on your Windows RTX PC or workstation, you’ll get fast and secure results.\n\n### RAG vs Fine-tuning\n- https://arxiv.org/pdf/2401.08406.pdf\n\nThere are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However, the pros and cons of both approaches are not well understood. In this paper, we propose a pipeline for fine-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. Our pipeline consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. We propose metrics to assess the performance of different stages of the RAG and fine-Tuning pipeline. We conduct an in-depth study on an agricultural dataset. Agriculture as an industry has not seen much penetration of AI, and we study a potentially disruptive application - what if we could provide location-specific insights to a farmer? Our results show the effectiveness of our dataset generation pipeline in capturing geographic-specific knowledge, and the quantitative and qualitative benefits of RAG and fine-tuning. We see an accuracy increase of over 6 p.p. when fine-tuning the model and this is cumulative with RAG, which increases accuracy by 5 p.p. further. In one particular experiment, we also demonstrate that the fine-tuned model leverages information from across geographies to answer specific questions, increasing answer similarity from 47% to 72%. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.\n\n### Chain of Abstraction\n- https://arxiv.org/pdf/2401.17464.pdf\n\nTo achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools help LLMs access this external knowledge, but there remains challenges for fine-tuning LLM agents (e.g., Toolformer) to invoke tools in multi-step reasoning problems, where inter-connected tool calls require holistic and efficient tool usage planning.\nIn this work, we propose a new method for LLMs to better leverage tools in multi-step reasoning. Our method, Chain-of-Abstraction (CoA), trains LLMs to first decode reasoning chains with abstract placeholders, and then call domain tools to reify each reasoning chain by filling in specific knowledge. This planning with abstract chains enables LLMs to learn more general reasoning strategies, which are robust to shifts of domain knowledge (e.g., math results) relevant to different reasoning questions. It also allows LLMs to perform decoding and calling of external tools in parallel, which avoids the inference delay caused by waiting for tool responses. In mathematical reasoning and Wiki QA domains, we show that our method consistently outperforms previous chain-of-thought and tool-augmented baselines on both in-distribution and out-of-distribution test sets, with an average ~6% absolute QA accuracy improvement. LLM agents trained with our method also show more efficient tool use, with inference speed being on average ~1.4x faster than baseline tool-augmented LLMs.\n\n### 序列猴子开源数据集\n- https://github.com/mobvoi/seq-monkey-data\n\n序列猴子是出门问问提供的超大规模语言模型，基于其通用的表示与推理能力，支持多轮交互，能够大幅度提高生产效率和数据处理能力，被广泛应用于问答系统、自然语言处理、机器翻译、文本摘要等领域。\n\n序列猴子数据集是用于训练序列猴子模型的数据集合，现选择部分数据集向公众开放。\n\n### Transformer Debugger\n- https://github.com/openai/transformer-debugger\n\nTransformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.\n\nTDB enables rapid exploration before needing to write code, with the ability to intervene in the forward pass and see how it affects a particular behavior. It can be used to answer questions like, \"Why does the model output token A instead of token B for this prompt?\" or \"Why does attention head H attend to token T for this prompt?\" It does so by identifying specific components (neurons, attention heads, autoencoder latents) that contribute to the behavior, showing automatically generated explanations of what causes those components to activate most strongly, and tracing connections between components to help discover circuits.\n\n### RecAI\n- https://arxiv.org/abs/2403.06465\n- https://github.com/microsoft/RecAI\n\nThis paper introduces RecAI, a practical toolkit designed to augment or even revolutionize recommender systems with the advanced capabilities of Large Language Models (LLMs). RecAI provides a suite of tools, including Recommender AI Agent, Recommendation-oriented Language Models, Knowledge Plugin, RecExplainer, and Evaluator, to facilitate the integration of LLMs into recommender systems from multifaceted perspectives. The new generation of recommender systems, empowered by LLMs, are expected to be more versatile, explainable, conversational, and controllable, paving the way for more intelligent and user-centric recommendation experiences. We hope the open-source of RecAI can help accelerate evolution of new advanced recommender systems. \n\n### synthetic-data-save-costs\n- https://huggingface.co/blog/synthetic-data-save-costs\n- https://github.com/MoritzLaurer/synthetic-data-blog/tree/main\n\nIn a case study on identifying investor sentiment in the news, we show how to use an open-source LLM to create synthetic data to train your customized model in a few steps. Our resulting custom RoBERTa model can analyze a large news corpus for around $2.7 compared to $3061 with GPT4; emits around 0.12 kg CO2 compared to very roughly 735 to 1100 kg CO2 with GPT4; with a latency of 0.13 seconds compared to often multiple seconds with GPT4; while performing on par with GPT4 at identifying investor sentiment (both 94% accuracy and 0.94 F1 macro).\n\n### Data is Better Together\n- https://github.com/huggingface/data-is-better-together\n\nData is Better Together is a collab between 🤗 Hugging Face, 🏓 Argilla, and the Open Source ML community. Our goal is to empower the open source community to collectively build impactful datasets.\n\n### Large Language Models in Finance\n- https://github.com/adlnlp/finllms\n- https://arxiv.org/abs/2402.02315\n\nA curated list of resources of LLMs in Finance (FinLLMs) including their history, techniques, evaluation, and opportunities and challenges. It's based on our survey paper: A Survey of Large Language Models in Finance (FinLLMs). This survey will be actively updated including further evaluation of advanced Financial NLP tasks, a collection of financial datasets, and sharing FinLLM use-cases. Please stay tuned!🔥\n\n### WanJuan-CC\n- https://opendatalab.com/OpenDataLab/WanJuanCC\n- https://arxiv.org/abs/2402.19282\n\nThis paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.\n\n### Actions Speak Louder than Words\n- https://arxiv.org/abs/2402.17152\n\nLarge-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute.\nInspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework (``Generative Recommenders''), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data.\nHSTU outperforms baselines over synthetic and public datasets by up to 65.8\\% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4\\% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.\n\n### LLM-UM-Reading\n- https://github.com/TamSiuhin/LLM-UM-Reading\n\nThis repository contains a list of papers on large language models for user modeling (LLM-UM) based on our survey paper: User Modeling in the Era of Large Language Models: Current Research and Future Directions (Zhaoxuan Tan and Meng Jiang). We categorize existing works based on their approaches and applications.\n\n### so-large-lm\n- https://github.com/datawhalechina/so-large-lm\n\n本项目旨在作为一个大规模预训练语言模型的教程，从数据准备、模型构建、训练策略到模型评估与改进，以及模型在安全、隐私、环境和法律道德方面的方面来提供开源知识。\n\n  项目将以斯坦福大学大规模语言模型课程和李宏毅生成式AI课程为基础，结合来自开源贡献者的补充和完善，以及对前沿大模型知识的及时更新，为读者提供较为全面而深入的理论知识和实践方法。通过对模型构建、训练、评估与改进等方面的系统性讲解，以及代码的实战，我们希望建立一个具有广泛参考价值的项目。\n\n  我们的项目团队成员将分工负责各个章节的内容梳理和撰写，并预计在三个月内完成初始版本内容。随后，我们将持续根据社区贡献和反馈进行内容的更新和优化，以确保项目的持续发展和知识的时效性。我们期待通过这个项目，为大型语言模型研究领域贡献一份宝贵的资源，推动相关技术的快速发展和广泛应用。\n\n### Fine-tune Llama 3 with ORPO\n- https://huggingface.co/blog/mlabonne/orpo-llama-3\n\nORPO is a new exciting fine-tuning technique that combines the traditional supervised fine-tuning and preference alignment stages into a single process. This reduces the computational resources and time required for training. Moreover, empirical results demonstrate that ORPO outperforms other alignment methods on various model sizes and benchmarks.\n\nIn this article, we will fine-tune the new Llama 3 8B model using ORPO with the TRL library. The code is available on Google Colab and in the LLM Course on GitHub.\n\n### COIG-CQIA\n- https://arxiv.org/pdf/2403.18058\n- https://huggingface.co/datasets/m-a-p/COIG-CQIA\n\nCOIG-CQIA全称为Chinese Open Instruction Generalist - Quality is All You Need， 是一个开源的高质量指令微调数据集，旨在为中文NLP社区提供高质量且符合人类交互行为的指令微调数据。COIG-CQIA以中文互联网获取到的问答及文章作为原始数据，经过深度清洗、重构及人工审核构建而成。本项目受LIMA: Less Is More for Alignment等研究启发，使用少量高质量的数据即可让大语言模型学习到人类交互行为，因此在数据构建中我们十分注重数据的来源、质量与多样性，数据集详情请见数据介绍以及我们接下来的论文。\n\n### tiny-universe\n- https://github.com/datawhalechina/tiny-universe\n\n本项目是一个从原理出发、以“白盒”为导向、围绕大模型全链路的“手搓”大模型指南，旨在帮助有传统深度学习基础的读者从底层原理出发，“纯手搓”搭建一个清晰、可用的大模型系统，包括大模型本身、RAG 框架、Agent 系统及大模型评估体系。本项目将从基础原理出发，深入剖析每一个技术点并附以完整的代码实现，以细致讲解和代码注释帮助读者独立复现大模型核心部分，并在复现中实现对大模型的深入理解与掌握。\n\n  本项目旨在为广大学习者搭建一个清晰的、可用的、可复现的大模型世界，帮助每一位有兴趣的学习者纯手工独立搭建自己的 Tiny LLM Universe。\n\n### llmc\n- https://github.com/ModelTC/llmc\n\nllmc is an off-the-shell tool designed for compressing LLM, leveraging state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising performance.\n\n### LLMBox\n- https://arxiv.org/abs/2303.18223\n- https://llmbook-zh.github.io\n\nLLMBox is a comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation. LLMBox is designed to be a one-stop solution for training and utilizing LLMs. Through a pratical library design, we achieve a high-level of flexibility and efficiency in both training and utilization stages.\n\n### MarkLLM\n- https://arxiv.org/abs/2405.10051\n- https://github.com/THU-BPM/MarkLLM\n\nMarkLLM is an open-source toolkit developed to facilitate the research and application of watermarking technologies within large language models (LLMs). As the use of large language models (LLMs) expands, ensuring the authenticity and origin of machine-generated text becomes critical. MarkLLM simplifies the access, understanding, and assessment of watermarking technologies, making it accessible to both researchers and the broader community.\n\n### MobileCPM\n- https://github.com/OpenBMB/MobileCPM\n\nMobileCPM is the first open-source toolset for on-device large models, designed to help individual or enterprise developers seamlessly integrate on-device large models into their APP products. In the demo APP shown below, MiniCPM provides on-device model capabilities and comes with several example agents such as a translator, poet, storyteller, and motivational coach to cater to various use cases. Moreover, the types of on-device models and agents can be flexibly expanded. Developers can customize agents to meet business needs and scenarios by adding or modifying prompts and replacing on-device models. The image below demonstrates the conversation effect of the built-in \"motivational coach\" agent. Please note that in this example, the network connection has been disconnected, and the on-device model is used directly.\n\n### LLM-Select\n- https://arxiv.org/abs/2407.02694\n\nIn this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., \"blood pressure\") in predicting an outcome of interest (e.g., \"heart failure\"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could potentially benefit practitioners in domains like healthcare, where collecting high-quality data comes at a high cost.\n\n### Larimar\n- https://arxiv.org/abs/2403.11901\n\nEfficient and accurate updating of knowledge stored in Large Language Models (LLMs) is one of the most pressing research challenges today. This paper presents Larimar - a novel, brain-inspired architecture for enhancing LLMs with a distributed episodic memory. Larimar's memory allows for dynamic, one-shot updates of knowledge without the need for computationally expensive re-training or fine-tuning. Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed - yielding speed-ups of 4-10x depending on the base LLM - as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general. We further provide mechanisms for selective fact forgetting and input context length generalization with Larimar and show their effectiveness.\n\n### PPM\n- https://arxiv.org/pdf/2403.10049.pdf\n\nClick-through rate (CTR) prediction is a core task in recommender systems. Existing methods (IDRec for short) rely on unique identities to represent distinct users and items that have prevailed for decades. On one hand, IDRec often faces significant performance degradation on cold-start problem; on the other hand, IDRec cannot use longer training data due to constraints imposed by iteration efficiency. Most prior studies alleviate the above problems by introducing pre-trained knowledge(e.g. pre-trained user model or multi-modal embeddings). However, the explosive growth of online latency can be attributed to the huge parameters in the pre-trained model. Therefore, most of them cannot employ the unified model of end-to-end training with IDRec in industrial recommender systems, thus limiting the potential of the pre-trained model. To this end, we propose a Pre-trained Plug-in CTR Model, namely PPM. PPM employs multi-modal features as input and utilizes large-scale data for pre-training. Then, PPM is plugged in IDRec model to enhance unified model's performance and iteration efficiency. Upon incorporating IDRec model, certain intermediate results within the network are cached, with only a subset of the parameters participating in training and serving. Hence, our approach can successfully deploy an end-to-end model without causing huge latency increases. Comprehensive offline experiments and online A/B testing at JD E-commerce demonstrate the efficiency and effectiveness of PPM.\n\n### LLaRA\n- https://arxiv.org/abs/2312.02445\n- https://github.com/ljy0ustc/LLaRA\n\nSequential recommendation aims to predict users' next interaction with items based on their past engagement sequence. Recently, the advent of Large Language Models (LLMs) has sparked interest in leveraging them for sequential recommendation, viewing it as language modeling. Previous studies represent items within LLMs' input prompts as either ID indices or textual metadata. However, these approaches often fail to either encapsulate comprehensive world knowledge or exhibit sufficient behavioral understanding. To combine the complementary strengths of conventional recommenders in capturing behavioral patterns of users and LLMs in encoding world knowledge about items, we introduce Large Language-Recommendation Assistant (LLaRA). Specifically, it uses a novel hybrid prompting method that integrates ID-based item embeddings learned by traditional recommendation models with textual item features. Treating the \"sequential behaviors of users\" as a distinct modality beyond texts, we employ a projector to align the traditional recommender's ID embeddings with the LLM's input space. Moreover, rather than directly exposing the hybrid prompt to LLMs, a curriculum learning strategy is adopted to gradually ramp up training complexity. Initially, we warm up the LLM using text-only prompts, which better suit its inherent language modeling ability. Subsequently, we progressively transition to the hybrid prompts, training the model to seamlessly incorporate the behavioral knowledge from the traditional sequential recommender into the LLM. Empirical results validate the effectiveness of our proposed framework.\n\n### Awesome Information Retrieval in the Age of Large Language Model\n- https://github.com/IR-LLM/Awesome-Information-Retrieval-in-the-Age-of-Large-Language-Model\n\nA curated list of awesome papers about information retrieval (IR) in the age of large language model (LLM). These include retrieval augmented large language model, large language model for information retrieval, and so on. \n\n### LLMs heart MIR\n- https://github.com/llms-heart-mir/tutorial\n\nA tutorial on Large Language Models for Music Information Retrieval.\n\n### When to Retrieve\n- https://arxiv.org/abs/2404.19705\n\nIn this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM's parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token, <RET>, when they do not know the answer to a question. Our evaluation of the Adaptive Retrieval LLM (Adapt-LLM) on the PopQA dataset showcases improvements over the same LLM under three configurations: (i) retrieving information for all the questions, (ii) using always the parametric memory of the LLM, and (iii) using a popularity threshold to decide when to use a retriever. Through our analysis, we demonstrate that Adapt-LLM is able to generate the <RET> token when it determines that it does not know how to answer a question, indicating the need for IR, while it achieves notably high accuracy levels when it chooses to rely only on its parametric memory.\n\n### Lite-LLM4Rec\n- https://arxiv.org/pdf/2402.09543\n\nRecently, sequential recommendation has been adapted to the LLM paradigm to enjoy the power of LLMs. LLM-based methods usually formulate recommendation information into natural language and the model is trained to predict the next item in an auto-regressive manner. Despite their notable success, the substantial computational overhead of inference poses a significant obstacle to their real-world applicability. In this work, we endeavor to streamline existing LLM-based recommendation models and propose a simple yet highly effective model Lite-LLM4Rec. The primary goal of Lite-LLM4Rec is to achieve efficient inference for the sequential recommendation task. Lite-LLM4Rec circumvents the beam search decoding by using a straight item projection head for ranking scores generation. This design stems from our empirical observation that beam search decoding is ultimately unnecessary for sequential recommendations. Additionally, Lite-LLM4Rec introduces a hierarchical LLM structure tailored to efficiently handle the extensive contextual information associated with items, thereby reducing computational overhead while enjoying the capabilities of LLMs. Experiments on three publicly available datasets corroborate the effectiveness of Lite-LLM4Rec in both performance and inference efficiency (notably 46.8% performance improvement and 97.28% efficiency improvement on ML-1m) over existing LLM-based methods. Our implementations will be open sourced.\n\n### A Comprehensive Survey on Self-Supervised Learning for Recommendation\n- https://arxiv.org/abs/2404.03354\n- https://github.com/HKUDS/Awesome-SSLRec-Papers\n- https://github.com/HKUDS/SSLRec\n\nA collection of papers and resources about self-supervised learning (SSL) for recommendation (Rec).\n\nRecommender systems personalize suggestions to combat information overload. Deep learning methods like RNNs, GNNs, and Transformers have improved these systems by understanding user behavior better. However, supervised learning struggles with data sparsity. Self-supervised learning (SSL) overcomes this by using inherent data structures for supervision, reducing dependence on labeled data. SSL-based recommender systems accurately predict and recommend, even with sparse data, by leveraging unlabeled data for meaningful representations.\n\n### NoteLLM\n- https://arxiv.org/abs/2403.01744v2\n\nPeople enjoy sharing \"notes\" including their experiences within online communities. Therefore, recommending notes aligned with user interests has become a crucial task. Existing online methods only input notes into BERT-based models to generate note embeddings for assessing similarity. However, they may underutilize some important cues, e.g., hashtags or categories, which represent the key concepts of notes. Indeed, learning to generate hashtags/categories can potentially enhance note embeddings, both of which compress key note information into limited content. Besides, Large Language Models (LLMs) have significantly outperformed BERT in understanding natural languages. It is promising to introduce LLMs into note recommendation. In this paper, we propose a novel unified framework called NoteLLM, which leverages LLMs to address the item-to-item (I2I) note recommendation. Specifically, we utilize Note Compression Prompt to compress a note into a single special token, and further learn the potentially related notes' embeddings via a contrastive learning approach. Moreover, we use NoteLLM to summarize the note and generate the hashtag/category automatically through instruction tuning. Extensive validations on real scenarios demonstrate the effectiveness of our proposed method compared with the online baseline and show major improvements in the recommendation system of Xiaohongshu.\n\n### LEARN\n- https://arxiv.org/abs/2405.03988\n\nContemporary recommender systems predominantly rely on collaborative filtering techniques, employing ID-embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance in cold-start scenarios and long-tail user recommendations. Leveraging the capabilities of Large Language Models (LLMs) pretrained on massive text corpus presents a promising avenue for enhancing recommender systems by integrating open-world domain knowledge. In this paper, we propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge. We address computational complexity concerns by utilizing pretrained LLMs as item encoders and freezing LLM parameters to avoid catastrophic forgetting and preserve open-world knowledge. To bridge the gap between the open-world and collaborative domains, we design a twin-tower structure supervised by the recommendation task and tailored for practical industrial application. Through offline experiments on the large-scale industrial dataset and online experiments on A/B tests, we demonstrate the efficacy of our approach.\n\n### YAYI-UIE\n- https://github.com/wenge-research/YAYI-UIE\n\n雅意信息抽取统一大模型 (YAYI-UIE)在百万级人工构造的高质量信息抽取数据上进行指令微调，统一训练信息抽取任务包括命名实体识别（NER），关系抽取（RE）和事件抽取（EE），实现通用、安全、金融、生物、医疗、商业、个人、车辆、电影、工业、餐厅、科学等场景下结构化抽取。\n\n### XRec\n- https://github.com/HKUDS/XRec\n- https://arxiv.org/pdf/2406.02377\n- https://sites.google.com/view/chaoh\n\nThis paper presents a model-agnostic framework, XRec, that integrates the graph-based collaborative filtering framework with Large Language Models (LLMs) to generate comprehensive explanations for recommendations. By leveraging the inherent collaborative user-item relationships and harnessing the powerful textual generation capabilities of LLMs, XRec establishes a strong connection between collaborative signals and language semantics through the utilization of a Mixture of Experts (MoE) adapter.\n\n### Wukong\n- https://arxiv.org/abs/2403.02545\n\nScaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example, where prior arts fall short.\n\n### Leveraging LLM Reasoning Enhances Personalized Recommender Systems\n- https://arxiv.org/abs/2408.00802\n\nRecent advancements have showcased the potential of Large Language Models (LLMs) in executing reasoning tasks, particularly facilitated by Chain-of-Thought (CoT) prompting. While tasks like arithmetic reasoning involve clear, definitive answers and logical chains of thought, the application of LLM reasoning in recommendation systems (RecSys) presents a distinct challenge. RecSys tasks revolve around subjectivity and personalized preferences, an under-explored domain in utilizing LLMs' reasoning capabilities. Our study explores several aspects to better understand reasoning for RecSys and demonstrate how task quality improves by utilizing LLM reasoning in both zero-shot and finetuning settings. Additionally, we propose RecSAVER (Recommender Systems Automatic Verification and Evaluation of Reasoning) to automatically assess the quality of LLM reasoning responses without the requirement of curated gold references or human raters. We show that our framework aligns with real human judgment on the coherence and faithfulness of reasoning responses. Overall, our work shows that incorporating reasoning into RecSys can improve personalized tasks, paving the way for further advancements in recommender system methodologies.\n\n### Transformers in music recommendation\n- https://research.google/blog/transformers-in-music-recommendation/\n\nWe present a music recommendation ranking system that uses Transformer models to better understand the sequential nature of user actions based on the current user context.\n\n### Financial Datasets\n- https://github.com/virattt/financial-datasets\n\nFinancial Datasets is an open-source Python library that allows developers to create synthetic financial datasets using Large Language Models (LLMs). With this library, you can generate realistic financial datasets based on SEC filings such as 10-Ks, 10-Qs, and other financial reports.\n\n### MicroLlama-300M\n- https://github.com/keeeeenw/MicroLlama\n\nAs an individual with limited access and compute, I have been wondering if I could build a decent large-language model for a while. As the big mega corporations are focused on getting bigger and bigger models, I am going small!\n\n### Mistral 7B v0.2 JAX\n- https://github.com/yixiaoer/mistral-v0.2-jax\n\nThis project is the JAX implementation of Mistral 7B v0.2 Base, advancing the work of my earlier repository mistral 7B JAX.\n\nIt is supported by Cloud TPUs from Google's TPU Research Cloud (TRC).\n\n### gemma-1.1-7b-it\n- https://huggingface.co/google/gemma-1.1-7b-it\n\nThis is Gemma 1.1 7B (IT), an update over the original instruction-tuned Gemma release.\n\n### h2o-danube2-1.8b-chat\n- https://huggingface.co/h2oai/h2o-danube2-1.8b-chat\n\nh2o-danube2-1.8b-chat is a chat fine-tuned model by H2O.ai with 1.8 billion parameters. \n\n### WizardLM-2\n- https://huggingface.co/posts/WizardLM/329547800484476\n- https://wizardlm.github.io/WizardLM2\n- https://huggingface.co/collections/microsoft/wizardlm-661d403f71e6c8257dbd598a\n\nWe introduce and opensource WizardLM-2, our next generation state-of-the-art large language models, which have improved performance on complex chat, multilingual, reasoning and agent. New family includes three cutting-edge models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B.\n\n### RecurrentGemma\n- https://developers.googleblog.com/2024/04/gemma-family-expands.html\n- https://arxiv.org/abs/2402.19427\n- https://github.com/google-deepmind/recurrentgemma/blob/main/colabs/fine_tuning_tutorial_jax.ipynb\n\nRecurrentGemma is a technically distinct model that leverages recurrent neural networks and local attention to improve memory efficiency. \n\n### Nemotron-4 340B\n- https://research.nvidia.com/publication/2024-06_nemotron-4-340b\n\nWe release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows the distribution, modification, and use of the models and their outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.\n\n### Gemma-2\n- https://www.kaggle.com/models/google/gemma-2\n\nGemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.\n\n### Gemini Nano\n- https://deepmind.google/technologies/gemini/nano/\n- https://huggingface.co/wave-on-discord/gemini-nano\n- https://huggingface.co/wave-on-discord/gemini-nano-adapter/tree/main\n\nOur most efficient model for on-device tasks\n\n### TTT\n- https://arxiv.org/abs/2407.04620\n- https://github.com/test-time-training/ttt-lm-jax\n- https://github.com/test-time-training/ttt-lm-pytorch\n\nSelf-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.\n\n### Arcee-Spark\n- https://huggingface.co/arcee-ai/Arcee-Spark\n\nArcee Spark is a powerful 7B parameter language model that punches well above its weight class. Initialized from Qwen2, this model underwent a sophisticated training process.\n\n### Mistral NeMo\n- https://huggingface.co/mistralai/Mistral-Nemo-Base-2407\n- https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407\n- https://mistral.ai/news/mistral-nemo/\n\nMistral NeMo: our new best small model. A state-of-the-art 12B model with 128k context length, built in collaboration with NVIDIA, and released under the Apache 2.0 license.\n\n### Llama 3.1 405B\n- https://ai.meta.com/blog/meta-llama-3-1/\n\nLlama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. With the release of the 405B model, we’re poised to supercharge innovation—with unprecedented opportunities for growth and exploration. We believe the latest generation of Llama will ignite new applications and modeling paradigms, including synthetic data generation to enable the improvement and training of smaller models, as well as model distillation—a capability that has never been achieved at this scale in open source.\n\n### Mistral Large 2\n- https://huggingface.co/mistralai/Mistral-Large-Instruct-2407\n- https://mistral.ai/news/mistral-large-2407/\n\nMistral Large 2 has a 128k context window and supports dozens of languages including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean, along with 80+ coding languages including Python, Java, C, C++, JavaScript, and Bash.\n\n### SmolLM\n- https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966\n\nA series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos.\n\n### DCLM-7B\n- https://huggingface.co/apple/DCLM-7B\n- https://github.com/mlfoundations/dclm\n- https://arxiv.org/abs/2406.11794\n\nDCLM-Baseline-7B is a 7 billion parameter language model trained on the DCLM-Baseline dataset, which was curated as part of the DataComp for Language Models (DCLM) benchmark. This model is designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.\n\n### Minitron\n- https://github.com/NVlabs/Minitron\n\nMinitron is a family of small language models (SLMs) obtained by pruning NVIDIA's Nemotron-4 15B model. We prune model embedding size, attention heads, and MLP intermediate dimension, following which, we perform continued training with distillation to arrive at the final models.\n\n### Gemma 2 2B/ShieldGemma/Gemma Scope\n- https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/\n- https://huggingface.co/collections/google/gemma-2-2b-release-66a20f3796a2ff2a7c76f98f\n\nIn June, we released Gemma 2, our new best-in-class open models, in 27 billion (27B) and 9 billion (9B) parameter sizes. Since its debut, the 27B model quickly became one of the highest-ranking open models on the LMSYS Chatbot Arena leaderboard, even outperforming popular models more than twice its size in real conversations.\n\nBut Gemma is about more than just performance. It's built on a foundation of responsible AI, prioritizing safety and accessibility. To support this commitment, we are excited to announce three new additions to the Gemma 2 family\n\n### SmolLM\n- https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966\n\nA series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos\n\n### nano-llama31\n- https://github.com/karpathy/nano-llama31\n\nThis repo is to Llama 3.1 what nanoGPT is to GPT-2. i.e. it is a minimal, dependency-free implementation of the Llama 3.1 architecture, and it can train, finetune, and inference it very simply. This is compared to the official code release from Meta and the huggingface implementation, which both feature heavier dependencies and a lot more code (e.g. fair).\n\nThe code currently focuses on the 8B base model of Llama 3.1.\n\n### instant-smollm\n- https://huggingface.co/spaces/HuggingFaceTB/instant-smollm\n- https://huggingface.co/blog/smollm\n- https://huggingface.co/spaces/HuggingFaceTB/instant-smol\n\nSmolLM is a series of language models available in three sizes: 135M, 360M, and 1.7B parameters.\n\nThese models are trained on SmolLM-Corpus, a curated collection of high-quality educational and synthetic data designed for training LLMs. For further details, we refer to our blogpost.\n\nTo build SmolLM-Instruct, we finetune the base models on publicly available datasets.\n\n### Jamba 1.5\n- https://www.ai21.com/blog/announcing-jamba-model-family\n- https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini\n\nwe are debuting the Jamba 1.5 family of open models: Jamba 1.5 Mini and Jamba 1.5 Large. Built on our novel SSM-Transformer architecture, these models demonstrate superior long context handling, speed, and quality—outranking competitors in their size class and marking the first time a non-Transformer model has been successfully scaled to the quality and strength of the market’s leading models. \n\n### Phi-3.5\n- https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/ba-p/4225280\n- https://huggingface.co/microsoft/Phi-3.5-mini-instruct\n\nPhi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family and supports 128K token context length. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures.\n\n### 1.5-Pints\n- https://github.com/pints-ai/1.5-Pints\n\nA recipe to pre-train models in 9 days, to become comparable AI assistants to the likes of Apple OpenELM and Microsoft Phi.\nThis repo contains the model architecture, training scripts, and utilities of 1.5-Pints and 0.12-Pint, developed by Pints.AI. By providing access to the model's codebase and architecture, this initiative seeks to facilitate the replication, experimentation, and further open-source development of Pint.\n\n### Llama-3.1-Minitron 4B\n- https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/\n\nLarge language models (LLM) are now a dominant force in natural language processing and understanding, thanks to their effectiveness and versatility. LLMs such as Llama 3.1 405B and NVIDIA Nemotron-4 340B excel in many challenging tasks, including coding, reasoning, and math. They are, however, resource-intensive to deploy. As such, there is another trend in the industry to develop small language models (SLMs), which are sufficiently proficient in many language tasks but much cheaper to deploy to the masses.\n\n### SmolLm2\n- https://github.com/SinatrasC/entropix-smollm\n- https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct\n- https://ollama.com/library/smollm2\n\nSmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters.\n\n### Ministral 3B/8B\n- https://mistral.ai/news/ministraux/\n\nOn the first anniversary of the release of Mistral 7B, the model that revolutionized independent frontier AI innovation for millions, we are proud to introduce two new state-of-the-art models for on-device computing and at-the-edge use cases. We call them les Ministraux: Ministral 3B and Ministral 8B.\n\nThese models set a new frontier in knowledge, commonsense, reasoning, function-calling, and efficiency in the sub-10B category, and can be used or tuned to a variety of uses, from orchestrating agentic workflows to creating specialist task workers. Both models support up to 128k context length (currently 32k on vLLM) and Ministral 8B has a special interleaved sliding-window attention pattern for faster and memory-efficient inference.\n\n### Zamba2-7B\n- https://huggingface.co/Zyphra/Zamba2-7B\n\nZamba2-7B achieves leading and state-of-the-art performance among models ≤8B parameters, outperforming several extremely strong baselines such as Meta's Llama3 series, Google's Gemma series and Mistral-7B. Moreover, due to its unique hybrid SSM architecture, Zamba2-7B achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer based models.\n\nWe believe Zamba2-7B is an ideal generalist model which is cheap and fast to run and fits on the majority of consumer hardware but possesses a powerful intelligence.\n\n### IBM Granite 3.0\n- https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models\n- https://github.com/ibm-granite/granite-3.0-language-models\n\nGranite 3.0 language models are a new set of lightweight state-of-the-art, open foundation models that natively support multilinguality, coding, reasoning, and tool usage, including the potential to be run on constrained compute resources. All the models are publicly released under an Apache 2.0 license for both research and commercial use. The models' data curation and training procedure were designed for enterprise usage and customization in mind, with a process that evaluates datasets for governance, risk and compliance (GRC) criteria, in addition to IBM's standard data clearance process and document quality checks.\n\n### Tülu3\n- https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B\n- https://allenai.org/blog/tulu-3-technical\n\nTülu3 is a leading instruction following model family, offering fully open-source data, code, and recipes designed to serve as a comprehensive guide for modern post-training techniques. Tülu3 is designed for state-of-the-art performance on a diversity of tasks in addition to chat, such as MATH, GSM8K, and IFEval.\n\n### Open-O1\n- https://github.com/Open-Source-O1/Open-O1\n\nOur Open O1 aims to match the powerful capabilities of the proprietary OpenAI O1 model, empowering the community with advanced open-source alternatives. Our model has been developed by curating a set SFT data for CoT Activation, which was then used to train both LLaMA and Qwen models. This training approach has endowed the smaller models with enhanced long-reasoning and problem-solving capabilities.\n\n### open-r1\n- https://github.com/huggingface/open-r1\n- huggingface.co/blog/open-r1/update-1\n\nA fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!\n\n### sky-t1\n- https://novasky-ai.github.io/posts/sky-t1/\n\nWe introduce Sky-T1-32B-Preview, our reasoning model that performs on par with o1-preview on popular reasoning and coding benchmarks. Remarkably, Sky-T1-32B-Preview was trained for less than $450, demonstrating that it is possible to replicate high-level reasoning capabilities affordably and efficiently. \n\n### Phi-4\n- https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090\n- https://huggingface.co/collections/microsoft/phi-4-677e9380e514feb5577a40e4\n\n14B parameter state-of-the-art small language model (SLM) that excels at complex reasoning in areas such as math, in addition to conventional language processing. Phi-4 is the latest member of our Phi family of small language models and demonstrates what’s possible as we continue to probe the boundaries of SLMs.\n\n### Dolphin 3.0\n- https://huggingface.co/cognitivecomputations/Dolphin3.0-Llama3.1-8B\n\nDolphin 3.0 is the next generation of the Dolphin series of instruct-tuned models. Designed to be the ultimate general purpose local model, enabling coding, math, agentic, function calling, and general use cases.\n\n### Falcon 3\n- https://falconllm.tii.ae/falcon3/index.html\n\nAI continues to redefine industries and transform our interactions with technology, but accessibility remains a critical challenge. Advanced AI models often require robust infrastructure, limiting their reach.\nFalcon 3 has been meticulously designed to address this gap.\nAs an open-source large language model (LLM), Falcon 3 is designed to democratize advanced AI by combining outstanding performance with the ability to run on lightweight devices, including laptops.\nReleased under TII’s Falcon License 2.0, Falcon 3 is a pioneering step toward making advanced AI tools available to all.\n\n### Bamba\n- https://huggingface.co/blog/bamba\n\nWe introduce Bamba-9B, an inference-efficient Hybrid Mamba2 model trained by IBM, Princeton, CMU, and UIUC on completely open data. At inference time, the model demonstrates 2.5x throughput improvement and 2x latency speedup compared to standard transformers in vLLM. To foster community experimentation, the model is immediately available to use in transformers, vLLM, TRL, and llama.cpp. We also release tuning, training, and extended pretraining recipes with a stateful data loader, and invite the community to further improve this model. Let's overcome the KV-cache bottleneck together!\n\n### Byte Latent Transformer\n- https://arxiv.org/pdf/2412.09871\n- https://github.com/facebookresearch/blt\n\nWe introduce the Byte Latent Transformer architecture (BLTs), a new byte-level LLM architecture that for the first time, matches tokenization-based LLM performance at scale, with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented dynamically based on the entropy of the next byte, allocating more compute and model capacity where there is more data complexity. The BLT architecture includes new attention mechanisms to maximize the information flow between byte and patch hidden representations and a new type of byte-sequence memory. We present the first scaling study of byte-level models up to 8B parameters and 8T training bytes, showing for the first time that we can train a model end-to-end at scale from bytes with no tokenization or other preprocessing. Scaling trends reveal training and inference efficiency benefits from dynamically selecting very long patches on average, along with qualitative improvements with reasoning and long tail generalization from modeling byte-sequences.\n\n### Llama-3.3-70B-Instruct\n- https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct\n\nThe Meta Llama 3.3 multilingual large language model (LLM) is an instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.\n\n### Granite 3.1\n- https://www.ibm.com/granite/docs/models/granite/\n\nGranite 3.1 models leverage new dense architecture. These models were trained with 12 trillion tokens across 12 languages and 116 programming languages. IBM is committed to open-source innovation, and these models are no exception. All Granite models are licensed under Apache 2.0.\n\n### mini-deepseek-r1\n- https://www.philschmid.de/mini-deepseek-r1\n- https://x.com/jiayi_pirate/status/1882839370505621655\n\nIn this blog post we want to recreate the small \"aha moment\" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. We will train an open model using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game. The Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to reach or get as close as possible to a target number.\n\n### RL, Reasoning & Writing: GRPO on Base model\n- https://colab.research.google.com/drive/1Ty0ovsrpw8i-zJvDhlSAtBIVw3EZfHK5?usp=sharing\n\nThis notebook introduces a series of experiment of RL training on a base model, rather than an instruct-model. It is obviously inspired by R0, the other DeepSeek model trained from DeepSeekv3: while R1 was more classically post-trained by a series of instruct finetuning and\n\nWe reuse the same RL method as R0, GRPO. For a more straightforward way of testing on an instruct model, you can check Will Brown's script that I ported to Google Colab. Here instead we'll take up the opportunity to explore alternative forms of RL tuning that fits better with using a base model as a starting point: poetry writing. We're going to make an RL poet.\n\n### encoder-decoder-slm\n- https://github.com/microsoft/encoder-decoder-slm\n\nWhile large language models continue to grow in size, smaller models (≤1B parameters) require thoughtful architectural decisions. Our work demonstrates that encoder-decoder models inherently outperform decoder-only architectures before any optimizations.\n\n### CodecLM\n- https://arxiv.org/abs/2404.05875\n\nInstruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.\n\n### MEGALODON\n- https://arxiv.org/pdf/2404.08801.pdf\n- https://github.com/XuezheMax/megalodon\n\nThe quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67).\n\n### Stable LM 2 12B\n- https://huggingface.co/stabilityai/stablelm-2-12b\n\nStable LM 2 12B is a 12.1 billion parameter decoder-only language model pre-trained on 2 trillion tokens of diverse multilingual and code datasets for two epochs.\n\n### Mixtral 8x22B\n- https://twitter.com/mistralai/status/1777869263778291896\n\nNew MoE model by MistralAI\n\n### Phi-3\n- https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/\n- https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3\n- https://aka.ms/phi3-azure-ai\n- https://ollama.com/library/phi3\n- https://export.arxiv.org/abs/2404.14219\n\nWe introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).\n\n### Llama 3\n- https://llama.meta.com/llama3/\n- https://ai.meta.com/blog/meta-llama-3/\n\nToday, we’re excited to share the first two models of the next generation of Llama, Meta Llama 3, available for broad use. This release features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases. This next generation of Llama demonstrates state-of-the-art performance on a wide range of industry benchmarks and offers new capabilities, including improved reasoning. We believe these are the best open source models of their class, period. In support of our longstanding open approach, we’re putting Llama 3 in the hands of the community. We want to kickstart the next wave of innovation in AI across the stack—from applications to developer tools to evals to inference optimizations and more. We can’t wait to see what you build and look forward to your feedback.\n\n### OpenELM\n- https://github.com/apple/corenet\n- https://arxiv.org/pdf/2404.14619.pdf\n- https://huggingface.co/apple\n\nCoreNet is a deep neural network toolkit that allows researchers and engineers to train standard and novel small and large-scale models for variety of tasks, including foundation models (e.g., CLIP and LLM), object classification, object detection, and semantic segmentation.\n\n### base-7b-v0.2\n- https://huggingface.co/internistai/base-7b-v0.2\n\nInternist.ai 7b is a medical domain large language model trained by medical doctors to demonstrate the benefits of a physician-in-the-loop approach. The training data was carefully curated by medical doctors to ensure clinical relevance and required quality for clinical practice.\n\n### FILM-7B\n- https://github.com/microsoft/FILM\n- https://arxiv.org/pdf/2404.16811\n\nWhile many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU).\n\n### llama3 implemented from scratch\n- https://github.com/naklecha/llama3-from-scratch\n\nin this file, i implemented llama3 from scratch, one tensor and matrix multiplication at a time.\nalso, im going to load tensors directly from the model file that meta provided for llama3, you need to download the weights before running this file.\n\n### 2.3MParams-LLM-From-Scratch-Python\n- https://github.com/FareedKhan-dev/create-million-parameter-llm-from-scratch\n\nMaking your own Large Language Model (LLM) is a cool thing that many big companies like Google, Twitter, and Facebook are doing. They release different versions of these models, like 7 billion, 13 billion, or 70 billion. Even smaller communities are doing it too. You might have read blogs or watched videos on creating your own LLM, but they usually talk a lot about theory and not so much about the actual steps and code.\n\n### KAN-GPT\n- https://github.com/AdityaNG/kan-gpt\n\nThe PyTorch implementation of Generative Pre-trained Transformers (GPTs) using Kolmogorov-Arnold Networks (KANs) for language modeling\n\n### Aya-23\n- https://huggingface.co/CohereForAI/aya-23-8B\n- https://huggingface.co/CohereForAI/aya-23-35B\n- https://arxiv.org/abs/2405.15032\n\nThis technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.\n\n### Mamba-2\n- https://arxiv.org/abs/2405.21060\n\nWhile Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.\n\n### Recurrentgemma\n- https://huggingface.co/google/recurrentgemma-9b\n- https://ai.google.dev/gemma/docs/recurrentgemma/model_card\n\nRecurrentGemma is a family of open language models built on a novel recurrent architecture developed at Google. Both pre-trained and instruction-tuned versions are available in English.\n\nLike Gemma, RecurrentGemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Because of its novel architecture, RecurrentGemma requires less memory than Gemma and achieves faster inference when generating long sequences.\n\n> 持续更新中 (Continuously Updated)... \n"
  }
]