> \[!IMPORTANT\]
>
> **Star Us**, You will receive all release notifications from GitHub without any delay ~ ⭐️
Star History
## 🧭 Welcome
to **OpenCompass**!
Just like a compass guides us on our journey, OpenCompass will guide you through the complex landscape of evaluating large language models. With its powerful algorithms and intuitive interface, OpenCompass makes it easy to assess the quality and effectiveness of your NLP models.
🚩🚩🚩 Explore opportunities at OpenCompass! We're currently **hiring full-time researchers/engineers and interns**. If you're passionate about LLM and OpenCompass, don't hesitate to reach out to us via [email](mailto:zhangsongyang@pjlab.org.cn). We'd love to hear from you!
🔥🔥🔥 We are delighted to announce that **the OpenCompass has been recommended by the Meta AI**, click [Get Started](https://ai.meta.com/llama/get-started/#validation) of Llama for more information.
> **Attention**
> Breaking Change Notice: In version 0.4.0, we are consolidating all AMOTIC configuration files (previously located in ./configs/datasets, ./configs/models, and ./configs/summarizers) into the opencompass package. Users are advised to update their configuration references to reflect this structural change.
## 🚀 What's New
- **\[2026.02.05\]** OpenCompass now supports Intern-S1-Pro related general and scientific evaluation benchmarks. Please check [Example for Evaluating Intern-S1-Pro](examples/eval_intern_s1_pro.py) and [Model Card](https://huggingface.co/internlm/Intern-S1-Pro) for more details! 🔥🔥🔥
- **\[2025.12.08\]** OpenCompass now supports evaluation for SciReasoner. Please check [Example for Evaluating SciReasoner](examples/eval_scireasoner.py) and [Project GitHub Repo](https://github.com/InternScience/SciReason) for more details! 🔥🔥🔥
- **\[2025.07.26\]** OpenCompass now supports Intern-S1 related general and scientific evaluation benchmarks. Please check [Tutorial for Evaluating Intern-S1](https://opencompass.readthedocs.io/en/latest/user_guides/interns1.html) for more details! 🔥🔥🔥
- **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHVerifyEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
- **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
- **\[2024.11.14\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [BABILong](https://arxiv.org/pdf/2406.10149). Have a look at the [demo](examples/eval_babilong.py) and give it a try! 🔥🔥🔥
- **\[2024.10.14\]** We now support the OpenAI multilingual QA dataset [MMMLU](https://huggingface.co/datasets/openai/MMMLU). Feel free to give it a try! 🔥🔥🔥
- **\[2024.09.19\]** We now support [Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) with multiple backend(huggingface/vllm/lmdeploy). Feel free to give them a try! 🔥🔥🔥
- **\[2024.09.17\]** We now support OpenAI o1(`o1-mini-2024-09-12` and `o1-preview-2024-09-12`). Feel free to give them a try! 🔥🔥🔥
- **\[2024.09.05\]** We now support answer extraction through model post-processing to provide a more accurate representation of the model's capabilities. As part of this update, we have integrated [XFinder](https://github.com/IAAR-Shanghai/xFinder) as our first post-processing model. For more detailed information, please refer to the [documentation](opencompass/utils/postprocessors/xfinder/README.md), and give it a try! 🔥🔥🔥
- **\[2024.08.20\]** OpenCompass now supports the [SciCode](https://github.com/scicode-bench/SciCode): A Research Coding Benchmark Curated by Scientists. 🔥🔥🔥
- **\[2024.08.16\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [RULER](https://arxiv.org/pdf/2404.06654). RULER provides an evaluation of long-context including retrieval, multi-hop tracing, aggregation, and question answering through flexible configurations. Check out the [RULER](configs/datasets/ruler/README.md) evaluation config now! 🔥🔥🔥
- **\[2024.08.09\]** We have released the example data and configuration for the CompassBench-202408, welcome to [CompassBench](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/compassbench_intro.html) for more details. 🔥🔥🔥
- **\[2024.08.01\]** We supported the [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) models. Welcome to try! 🔥🔥🔥
- **\[2024.07.23\]** We supported the [ModelScope](www.modelscope.cn) datasets, you can load them on demand without downloading all the data to your local disk. Welcome to try! 🔥🔥🔥
- **\[2024.07.17\]** We are excited to announce the release of NeedleBench's [technical report](http://arxiv.org/abs/2407.11963). We invite you to visit our [support documentation](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html) for detailed evaluation guidelines. 🔥🔥🔥
- **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
- **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
> [More](docs/en/notes/news.md)
## 📊 Leaderboard
We provide [OpenCompass Leaderboard](https://rank.opencompass.org.cn/home) for the community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address `opencompass@pjlab.org.cn`.
You can also refer to [Guide to Reproducing CompassAcademic Leaderboard Results](https://opencompass.readthedocs.io/zh-cn/latest/academic.html) to quickly reproduce the leaderboard results.
## 🛠️ Installation
Below are the steps for quick installation and datasets preparation.
### 💻 Environment Setup
We highly recommend using conda to manage your python environment.
- #### Create your virtual environment
```bash
conda create --name opencompass python=3.10 -y
conda activate opencompass
```
- #### Install OpenCompass via pip
```bash
pip install -U opencompass
## Full installation (with support for more datasets)
# pip install "opencompass[full]"
## Environment with model acceleration frameworks
## Manage different acceleration frameworks using virtual environments
## since they usually have dependency conflicts with each other.
# pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"
## API evaluation (i.e. Openai, Qwen)
# pip install "opencompass[api]"
```
- #### Install OpenCompass from source
If you want to use opencompass's latest features, or develop new features, you can also build it from source
```bash
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# pip install -e ".[full]"
# pip install -e ".[vllm]"
```
### 📂 Data Preparation
You can choose one for the following method to prepare datasets.
#### Offline Preparation
You can download and extract the datasets with the following commands:
```bash
# Download dataset to data/ folder
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
```
#### Automatic Download from OpenCompass
We have supported download datasets automatic from the OpenCompass storage server. You can run the evaluation with extra `--dry-run` to download these datasets.
Currently, the supported datasets are listed in [here](https://github.com/open-compass/opencompass/blob/main/opencompass/utils/datasets_info.py#L259). More datasets will be uploaded recently.
#### (Optional) Automatic Download with ModelScope
Also you can use the [ModelScope](www.modelscope.cn) to load the datasets on demand.
Installation:
```bash
pip install modelscope[framework]
export DATASET_SOURCE=ModelScope
```
Then submit the evaluation task without downloading all the data to your local disk. Available datasets include:
```bash
humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ceval, math, LCSTS, Xsum, winogrande, openbookqa, AGIEval, gsm8k, nq, race, siqa, mbpp, mmlu, hellaswag, ARC, BBH, xstory_cloze, summedits, GAOKAO-BENCH, OCNLI, cmnli
```
Some third-party features, like Humaneval and Llama, may require additional steps to work properly, for detailed steps please refer to the [Installation Guide](https://opencompass.readthedocs.io/en/latest/get_started/installation.html).
## 🏗️ ️Evaluation
After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared. Now you can start your first evaluation using OpenCompass!
### Your first evaluation with OpenCompass!
OpenCompass support setting your configs via CLI or a python script. For simple evaluation settings we recommend using CLI, for more complex evaluation, it is suggested using the script way. You can find more example scripts under the configs folder.
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass examples/eval_chat_demo.py
```
You can find more script examples under [examples](./examples) folder.
### API evaluation
OpenCompass, by its design, does not really discriminate between open-source models and API models. You can evaluate both model types in the same way or even in one settings.
```bash
export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# CLI
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass examples/eval_api_demo.py
# You can use o1_mini_2024_09_12/o1_preview_2024_09_12 for o1 models, we set max_completion_tokens=8192 as default.
```
### Accelerated Evaluation
Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
# Python scripts
opencompass examples/eval_lmdeploy_demo.py
```
### Supported Models and Datasets
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
#### Supported Models
If the model is not on the list but supported by Huggingface AutoModel class or encapsulation of inference engine based on OpenAI interface (see [docs](https://opencompass.readthedocs.io/en/latest/advanced_guides/new_model.html) for details), you can also evaluate it with OpenCompass. You are welcome to contribute to the maintenance of the OpenCompass supported model and dataset lists.
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
#### Supported Datasets
Currently, OpenCompass have provided standard recommended configurations for datasets. Generally, config files ending with `_gen.py` or `_llm_judge_gen.py` will point to the recommended config we provide for this dataset. You can refer to [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for more details.
```bash
# Recommended Evaluation Config based on Rules
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
# Recommended Evaluation Config based on LLM Judge
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
```
If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
```
> \[!TIP\]
>
> `--hf-num-gpus` is used for model parallel(huggingface format), `--max-num-worker` is used for data parallel.
> \[!TIP\]
>
> configuration with `_ppl` is designed for base model typically.
> configuration with `_gen` can be used for both base model and chat model.
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.
## 📣 OpenCompass 2.0
We are thrilled to introduce OpenCompass 2.0, an advanced suite featuring three key components: [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home).

**CompassRank** has been significantly enhanced into the leaderboards that now incorporates both open-source benchmarks and proprietary benchmarks. This upgrade allows for a more comprehensive evaluation of models across the industry.
**CompassHub** presents a pioneering benchmark browser interface, designed to simplify and expedite the exploration and utilization of an extensive array of benchmarks for researchers and practitioners alike. To enhance the visibility of your own benchmark within the community, we warmly invite you to contribute it to CompassHub. You may initiate the submission process by clicking [here](https://hub.opencompass.org.cn/dataset-submit).
**CompassKit** is a powerful collection of evaluation toolkits specifically tailored for Large Language Models and Large Vision-language Models. It provides an extensive set of tools to assess and measure the performance of these complex models effectively. Welcome to try our toolkits for in your research and products.
## ✨ Introduction

OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include:
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
- **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours.
- **Diversified evaluation paradigms**: Support for zero-shot, few-shot, and chain-of-thought evaluations, combined with standard or dialogue-type prompt templates, to easily stimulate the maximum performance of various models.
- **Modular design with high extensibility**: Want to add new models or datasets, customize an advanced task division strategy, or even support a new cluster management system? Everything about OpenCompass can be easily expanded!
- **Experiment management and reporting mechanism**: Use config files to fully record each experiment, and support real-time reporting of results.
## 📖 Dataset Support
We have supported a statistical list of all datasets that can be used on this platform in the documentation on the OpenCompass website.
You can quickly find the dataset you need from the list through sorting, filtering, and searching functions.
In addition, we provide a recommended configuration for each dataset, and some datasets also support LLM Judge-based configurations.
Please refer to the dataset statistics chapter of [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for details.
## 🔜 Roadmap
- [x] Subjective Evaluation
- [x] Release CompassAreana.
- [x] Subjective evaluation.
- [x] Long-context
- [x] Long-context evaluation with extensive datasets.
- [ ] Long-context leaderboard.
- [x] Coding
- [ ] Coding evaluation leaderboard.
- [x] Non-python language evaluation service.
- [x] Agent
- [ ] Support various agent frameworks.
- [x] Evaluation of tool use of the LLMs.
- [x] Robustness
- [x] Support various attack methods.
## 👷♂️ Contributing
We appreciate all contributions to improving OpenCompass. Please refer to the [contributing guideline](https://opencompass.readthedocs.io/en/latest/notes/contribution_guide.html) for the best practice.
## 🤝 Acknowledgements
Some code in this project is cited and modified from [OpenICL](https://github.com/Shark-NLP/OpenICL).
Some datasets and prompt implementations are modified from [chain-of-thought-hub](https://github.com/FranxYao/chain-of-thought-hub) and [instruct-eval](https://github.com/declare-lab/instruct-eval).
## 🖊️ Citation
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
```
[github-contributors-link]: https://github.com/open-compass/opencompass/graphs/contributors
[github-contributors-shield]: https://img.shields.io/github/contributors/open-compass/opencompass?color=c4f042&labelColor=black&style=flat-square
[github-forks-link]: https://github.com/open-compass/opencompass/network/members
[github-forks-shield]: https://img.shields.io/github/forks/open-compass/opencompass?color=8ae8ff&labelColor=black&style=flat-square
[github-issues-link]: https://github.com/open-compass/opencompass/issues
[github-issues-shield]: https://img.shields.io/github/issues/open-compass/opencompass?color=ff80eb&labelColor=black&style=flat-square
[github-license-link]: https://github.com/open-compass/opencompass/blob/main/LICENSE
[github-license-shield]: https://img.shields.io/github/license/open-compass/opencompass?color=white&labelColor=black&style=flat-square
[github-release-link]: https://github.com/open-compass/opencompass/releases
[github-release-shield]: https://img.shields.io/github/v/release/open-compass/opencompass?color=369eff&labelColor=black&logo=github&style=flat-square
[github-releasedate-link]: https://github.com/open-compass/opencompass/releases
[github-releasedate-shield]: https://img.shields.io/github/release-date/open-compass/opencompass?labelColor=black&style=flat-square
[github-stars-link]: https://github.com/open-compass/opencompass/stargazers
[github-stars-shield]: https://img.shields.io/github/stars/open-compass/opencompass?color=ffcb47&labelColor=black&style=flat-square
[github-trending-shield]: https://trendshift.io/api/badge/repositories/6630
[github-trending-url]: https://trendshift.io/repositories/6630
================================================
FILE: autotest/__init__.py
================================================
"""OpenCompass automated test package."""
__all__ = []
================================================
FILE: autotest/cluster/__init__.py
================================================
"""OpenCompass inference test configurations."""
__all__ = []
================================================
FILE: autotest/cluster/chat_models.py
================================================
from mmengine.config import read_base
from opencompass.models import (HuggingFacewithChatTemplate,
TurboMindModelwithChatTemplate,
VLLMwithChatTemplate)
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_gen import \
race_datasets # noqa: F401, E501
# re-design .. including some models and modify all kinds of configs
from ...rjob import eval, infer # noqa: F401, E501
Qwen3_0_6B_FP8_hf = dict(
type=HuggingFacewithChatTemplate,
abbr='qwen3_0_6b_fp8-hf',
path='Qwen/Qwen3-0.6B-FP8',
max_out_len=16384,
batch_size=8,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
Qwen3_0_6B_FP8_turbomind = dict(
type=TurboMindModelwithChatTemplate,
abbr='qwen3-0_6b-fp8-turbomind',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=32768, max_batch_size=1),
gen_config=dict(top_k=1, max_new_tokens=16384),
max_seq_len=32768,
max_out_len=16384,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
Qwen3_0_6B_FP8_vllm = dict(
type=VLLMwithChatTemplate,
abbr='qwen3-0_6b-fp8-vllm',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(tensor_parallel_size=1),
generation_kwargs=dict(temperature=0), # greedy
max_seq_len=32768,
max_out_len=16384,
batch_size=1,
run_cfg=dict(num_gpus=1),
)
race_datasets = [race_datasets[1]]
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
models = [Qwen3_0_6B_FP8_hf, Qwen3_0_6B_FP8_turbomind, Qwen3_0_6B_FP8_vllm]
summarizer = dict(
dataset_abbrs=[
'gsm8k',
'race-middle',
'race-high',
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
================================================
FILE: autotest/eval/__init__.py
================================================
"""OpenCompass inference test configurations."""
__all__ = []
================================================
FILE: autotest/eval/eval_base_fullbench.py
================================================
from mmengine.config import read_base
with read_base():
from autotest.eval.models import base_models
from opencompass.configs.datasets.ARC_c.ARC_c_few_shot_ppl import \
ARC_c_datasets # noqa: F401, E501
from opencompass.configs.datasets.bbh.bbh_gen_98fba6 import \
bbh_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmmlu.cmmlu_ppl_041cbf import \
cmmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.drop.drop_gen_a2697c import \
drop_datasets # noqa: F401, E501
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_d21e37 import \
GaokaoBench_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_few_shot_ppl_4b5a83 import \
gpqa_datasets # noqa: F401, E501
# Corebench v1.7
from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_ppl_59c85e import \
hellaswag_datasets # noqa: F401, E501
# from opencompass.configs.datasets.humaneval.internal_humaneval_gen_ce6b06 import \ # noqa: F401, E501
# humaneval_datasets as humaneval_v2_datasets # noqa: F401, E501
# from opencompass.configs.datasets.humaneval.internal_humaneval_gen_d2537e import \ # noqa: F401, E501
# humaneval_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_4shot_base_gen_43d5b6 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.MathBench.mathbench_2024_few_shot_mixed_4a3fd4 import \
mathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_742f0c import \
sanitized_mbpp_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d import \
mmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_few_shot_gen_bfaf90 import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.nq.nq_open_1shot_gen_20a989 import \
nq_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_few_shot_ppl import \
race_datasets # noqa: F401, E501
from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_few_shot_ppl import \
BoolQ_datasets # noqa: F401, E501
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import \
TheoremQA_datasets # noqa: F401, E501
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_20a989 import \
triviaqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.wikibench.wikibench_few_shot_ppl_c23d79 import \
wikibench_datasets # noqa: F401, E501
from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
winogrande_datasets # noqa: F401, E501
from opencompass.configs.summarizers.groups.bbh import \
bbh_summary_groups # noqa: F401, E501
# Summary Groups
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.GaokaoBench import \
GaokaoBench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
models = base_models
race_datasets = [race_datasets[1]] # Only take RACE-High
bbh_datasets = [
x for x in bbh_datasets if 'logical_deduction_seven_objects' in x['abbr']
or 'multistep_arithmetic_two' in x['abbr']
]
cmmlu_datasets = [
x for x in cmmlu_datasets if x['abbr'].replace('cmmlu-', '') in [
'ancient_chinese', 'chinese_civil_service_exam',
'chinese_driving_rule', 'chinese_food_culture',
'chinese_foreign_policy', 'chinese_history', 'chinese_literature',
'chinese_teacher_qualification', 'construction_project_management',
'elementary_chinese', 'elementary_commonsense', 'ethnology',
'high_school_politics', 'modern_chinese',
'traditional_chinese_medicine'
]
]
mmlu_datasets = [
x for x in mmlu_datasets if x['abbr'].replace('lukaemon_mmlu_', '') in [
'business_ethics', 'clinical_knowledge', 'college_medicine',
'global_facts', 'human_aging', 'management', 'marketing',
'medical_genetics', 'miscellaneous', 'nutrition',
'professional_accounting', 'professional_medicine', 'virology'
]
]
mmlu_pro_datasets = [mmlu_pro_datasets[0]]
mathbench_datasets = [x for x in mathbench_datasets if 'college' in x['abbr']]
GaokaoBench_datasets = [
x for x in GaokaoBench_datasets if '2010-2022_Math_II_MCQs' in x['abbr']
or '2010-2022_Math_II_Fill-in-the-Blank' in x['abbr']
]
datasets = sum((v for k, v in locals().items()
if k.endswith('_datasets') and 'dingo' not in k.lower()), [])
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
summary_groups.append(
{
'name': 'Mathbench',
'subsets': ['mathbench-a (average)', 'mathbench-t (average)'],
}, )
summarizer = dict(
dataset_abbrs=[
'Language',
['race-high', 'accuracy'],
['ARC-c', 'accuracy'],
['BoolQ', 'accuracy'],
['triviaqa_wiki_1shot', 'score'],
['nq_open_1shot', 'score'],
'',
'General Reasoning',
['drop', 'accuracy'],
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['hellaswag', 'accuracy'],
['TheoremQA', 'score'],
['winogrande', 'accuracy'],
'',
'Math Calculation',
['gsm8k', 'accuracy'],
['GaokaoBench', 'weighted_average'],
'GaokaoBench_2010-2022_Math_II_MCQs',
'GaokaoBench_2010-2022_Math_II_Fill-in-the-Blank',
['math', 'accuracy'],
['Mathbench', 'naive_average'],
'',
'Knowledge',
['wikibench-wiki-single_choice_cncircular', 'perf_4'],
['cmmlu', 'naive_average'],
['mmlu', 'naive_average'],
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['openai_humaneval_v2', 'humaneval_pass@1'],
['sanitized_mbpp', 'score'],
'',
['dingo_en_192', 'score'],
['dingo_zh_170', 'score'],
'',
'mmlu',
'mmlu-stem',
'mmlu-social-science',
'mmlu-humanities',
['mmlu-other', 'accuracy'],
'',
'cmmlu',
'cmmlu-stem',
'cmmlu-social-science',
'cmmlu-humanities',
'cmmlu-other',
['cmmlu-china-specific', 'accuracy'],
'',
'mmlu_pro',
'mmlu_pro_biology',
'mmlu_pro_business',
'mmlu_pro_chemistry',
'mmlu_pro_computer_science',
'mmlu_pro_economics',
'mmlu_pro_engineering',
'mmlu_pro_health',
'mmlu_pro_history',
'mmlu_pro_law',
'mmlu_pro_math',
'mmlu_pro_philosophy',
'mmlu_pro_physics',
'mmlu_pro_psychology',
'mmlu_pro_other',
'',
'bbh-logical_deduction_seven_objects',
'bbh-multistep_arithmetic_two',
'###### MathBench-A: Application Part ######',
'college',
'high',
'middle',
'primary',
'arithmetic',
'mathbench-a (average)',
'###### MathBench-T: Theory Part ######',
'college_knowledge',
'high_knowledge',
'middle_knowledge',
'primary_knowledge',
'mathbench-t (average)',
],
summary_groups=summary_groups,
)
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
================================================
FILE: autotest/eval/eval_base_longtext_fullbench.py
================================================
from mmengine.config import read_base
with read_base():
from autotest.eval.models import base_models
from opencompass.configs.datasets.longbench.longbench import \
longbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.needlebench.needlebench_base.needlebench_base_gen import \
needlebench_datasets # noqa: F401, E501
# summarizer
from opencompass.configs.summarizers.groups.longbench import \
longbench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.needlebench import \
needlebench_internal_200k_summarizer # noqa: F401, E501
from opencompass.configs.summarizers.needlebench import (
needlebench_internal_32k_summarizer,
needlebench_internal_100k_summarizer)
models = base_models
needlebench_internal_32k_summary_groups = needlebench_internal_32k_summarizer[
'summary_groups']
needlebench_internal_100k_summary_groups = (
needlebench_internal_100k_summarizer['summary_groups'])
needlebench_internal_200k_summary_groups = (
needlebench_internal_200k_summarizer['summary_groups'])
datasets = [
v[0] for k, v in locals().items()
if k.endswith('_datasets') and isinstance(v, list) and len(v) > 0
]
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
================================================
FILE: autotest/eval/eval_chat_longtext_fullbench.py
================================================
from mmengine.config import read_base
with read_base():
from autotest.eval.models import models
from opencompass.configs.datasets.babilong.babilong_256k_gen import \
babiLong_256k_datasets # noqa: F401, E501
from opencompass.configs.datasets.longbench.longbench import \
longbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.needlebench.needlebench_128k.needlebench_128k import \
needlebench_datasets as needlebench_128k_datasets # noqa: F401, E501
from opencompass.configs.datasets.ruler.ruler_128k_gen import \
ruler_datasets as ruler_128k_datasets # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat_1m import \
models as lmdeploy_internlm2_5_7b_chat_1m_model # noqa: F401, E501
# Summary Groups
from opencompass.configs.summarizers.groups.babilong import \
babilong_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.longbench import \
longbench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.ruler import \
ruler_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.needlebench import \
needlebench_128k_summarizer # noqa: F401, E501
models = models
datasets = [
v[0] for k, v in locals().items()
if k.endswith('_datasets') and isinstance(v, list) and len(v) > 0
]
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
================================================
FILE: autotest/eval/eval_chat_obj_fullbench_other.py
================================================
from mmengine.config import read_base
with read_base():
# Datasets
from autotest.eval.models import judge_models, models
from opencompass.configs.chatml_datasets.C_MHChem.C_MHChem_gen import \
datasets as C_MHChem_chatml_datasets # noqa: F401, E501
from opencompass.configs.chatml_datasets.CPsyExam.CPsyExam_gen import \
datasets as CPsyExam_chatml_datasets # noqa: F401, E501
from opencompass.configs.chatml_datasets.MaScQA.MaScQA_gen import \
datasets as MaScQA_chatml_datasets # noqa: F401, E501
from opencompass.configs.chatml_datasets.UGPhysics.UGPhysics_gen import \
datasets as UGPhysics_chatml_datasets # noqa: F401, E501
from opencompass.configs.datasets.eese.eese_llm_judge_gen import \
eese_datasets # noqa: F401, E501
models = models
chatml_datasets = [
v[0] for k, v in locals().items()
if k.endswith('_chatml_datasets') and isinstance(v, list) and len(v) > 0
]
datasets = [eese_datasets[0]]
for d in chatml_datasets:
d['test_range'] = '[0:4]'
for d in datasets:
if 'reader_cfg' in d:
d['reader_cfg']['test_range'] = '[0:4]'
else:
d['test_range'] = '[0:4]'
if 'eval_cfg' in d and 'dataset_cfg' in d['eval_cfg'][
'evaluator'] and 'reader_cfg' in d['eval_cfg']['evaluator'][
'dataset_cfg']:
d['eval_cfg']['evaluator']['dataset_cfg']['reader_cfg'][
'test_range'] = '[0:4]'
if 'eval_cfg' in d and 'llm_evaluator' in d['eval_cfg'][
'evaluator'] and 'dataset_cfg' in d['eval_cfg']['evaluator'][
'llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator']['dataset_cfg'][
'reader_cfg']['test_range'] = '[0:4]'
obj_judge_model = judge_models[0]
for d in datasets:
if 'eval_cfg' in d and 'evaluator' in d['eval_cfg']:
if 'judge_cfg' in d['eval_cfg']['evaluator']:
d['eval_cfg']['evaluator']['judge_cfg'] = obj_judge_model
if 'llm_evaluator' in d['eval_cfg']['evaluator'] and 'judge_cfg' in d[
'eval_cfg']['evaluator']['llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator'][
'judge_cfg'] = obj_judge_model
for d in chatml_datasets:
if 'judge_cfg' in d['evaluator']:
d['evaluator']['judge_cfg'] = obj_judge_model
if 'llm_evaluator' in d['evaluator'] and 'judge_cfg' in d['evaluator'][
'llm_evaluator']:
d['evaluator']['llm_evaluator']['judge_cfg'] = obj_judge_model
================================================
FILE: autotest/eval/eval_chat_obj_fullbench_v5.py
================================================
from mmengine.config import read_base
with read_base():
# read hf models - chat models
# Dataset
from autotest.eval.models import models
from opencompass.configs.datasets.aime2024.aime2024_gen_6e39a4 import \
aime2024_datasets # noqa: F401, E501
from opencompass.configs.datasets.ARC_c.ARC_c_cot_gen_926652 import \
ARC_c_datasets # noqa: F401, E501
# remove because of oom
# from opencompass.configs.datasets.ARC_Prize_Public_Evaluation.arc_prize_public_evaluation_gen_872059 import arc_prize_public_evaluation_datasets # noqa: F401, E501
from opencompass.configs.datasets.bbh.bbh_gen_5b92b0 import \
bbh_datasets # noqa: F401, E501
# from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_complete_gen_faf748 import \ # noqa: F401, E501
# bigcodebench_hard_complete_datasets # noqa: F401, E501
# from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_instruct_gen_8815eb import \ # noqa: F401, E501
# bigcodebench_hard_instruct_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmmlu.cmmlu_0shot_cot_gen_305931 import \
cmmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmo_fib.cmo_fib_gen_ace24b import \
cmo_fib_datasets # noqa: F401, E501
from opencompass.configs.datasets.drop.drop_openai_simple_evals_gen_3857b0 import \
drop_datasets # noqa: F401, E501
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_4c31db import \
GaokaoBench_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets # noqa: F401, E501
# new datasets in Fullbench v1.1
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_6e39a4 import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import \
hellaswag_datasets # noqa: F401, E501
from opencompass.configs.datasets.humaneval.humaneval_openai_sample_evals_gen_dcae0e import \
humaneval_datasets # noqa: F401, E501
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.korbench.korbench_single_0_shot_gen import \
korbench_0shot_single_datasets # noqa: F401, E501
from opencompass.configs.datasets.livecodebench.livecodebench_gen_b2b0fd import \
LCB_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_0shot_gen_11c4b5 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.MathBench.mathbench_2024_gen_50a320 import \
mathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.mbpp.sanitized_mbpp_mdblock_gen_a447ff import \
sanitized_mbpp_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu.mmlu_openai_simple_evals_gen_b618ea import \
mmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmmlu_lite.mmmlu_lite_gen_c51a84 import \
mmmlu_lite_datasets # noqa: F401, E501
from opencompass.configs.datasets.musr.musr_gen_3622bb import \
musr_datasets # noqa: F401, E501
from opencompass.configs.datasets.nq.nq_open_1shot_gen_2e45e5 import \
nq_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_cot_gen_d95929 import \
race_datasets # noqa: F401, E501
from opencompass.configs.datasets.scicode.scicode_gen_085b98 import \
SciCode_datasets # noqa: F401, E501
from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_cot_gen_1d56df import \
BoolQ_datasets # noqa: F401, E501
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import \
TheoremQA_datasets # noqa: F401, E501
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_bc5f21 import \
triviaqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.wikibench.wikibench_gen_0978ad import \
wikibench_datasets # noqa: F401, E501
# Summary Groups
# Summary Groups
from opencompass.configs.summarizers.groups.bbh import \
bbh_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.GaokaoBench import \
GaokaoBench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.korbench import \
korbench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.musr_average import \
summarizer as musr_summarizer # noqa: F401, E501
from opencompass.configs.summarizers.groups.scicode import \
scicode_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.mmmlu_lite import \
mmmlu_summary_groups # noqa: F401, E501
models = models
race_datasets = [race_datasets[1]]
bbh_datasets = [
x for x in bbh_datasets if 'logical_deduction_seven_objects' in x['abbr']
or 'multistep_arithmetic_two' in x['abbr']
]
cmmlu_datasets = [
x for x in cmmlu_datasets if x['abbr'].replace('cmmlu-', '') in [
'ancient_chinese', 'chinese_civil_service_exam',
'chinese_driving_rule', 'chinese_food_culture',
'chinese_foreign_policy', 'chinese_history', 'chinese_literature',
'chinese_teacher_qualification', 'construction_project_management',
'elementary_chinese', 'elementary_commonsense', 'ethnology',
'high_school_politics', 'modern_chinese',
'traditional_chinese_medicine'
]
]
mmlu_datasets = [
x for x in mmlu_datasets if x['abbr'].replace('lukaemon_mmlu_', '') in [
'business_ethics', 'clinical_knowledge', 'college_medicine',
'global_facts', 'human_aging', 'management', 'marketing',
'medical_genetics', 'miscellaneous', 'nutrition',
'professional_accounting', 'professional_medicine', 'virology'
]
]
mmlu_pro_datasets = [mmlu_pro_datasets[0]]
mmmlu_lite_datasets = [
x for x in mmmlu_lite_datasets if 'mmlu_lite_AR-XY' in x['abbr']
]
mathbench_datasets = [x for x in mathbench_datasets if 'college' in x['abbr']]
GaokaoBench_datasets = [
x for x in GaokaoBench_datasets if '2010-2022_Math_II_MCQs' in x['abbr']
or '2010-2022_Math_II_Fill-in-the-Blank' in x['abbr']
]
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')
and 'scicode' not in k.lower() and 'teval' not in k and 'human' not in k),
[],
)
datasets += humaneval_datasets
# datasets += SciCode_datasets
musr_summary_groups = musr_summarizer['summary_groups']
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
summary_groups.append(
{
'name': 'Mathbench',
'subsets': ['mathbench-a (average)', 'mathbench-t (average)'],
}, )
# Summarizer
summarizer = dict(
dataset_abbrs=[
'Language',
['race-high', 'accuracy'],
['ARC-c', 'accuracy'],
['BoolQ', 'accuracy'],
['triviaqa_wiki_1shot', 'score'],
['nq_open_1shot', 'score'],
['mmmlu_lite', 'naive_average'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['drop', 'accuracy'],
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['hellaswag', 'accuracy'],
['TheoremQA', 'score'],
['musr_average', 'naive_average'],
['korbench_single', 'naive_average'],
['ARC_Prize_Public_Evaluation', 'accuracy'],
'',
'Math Calculation',
['gsm8k', 'accuracy'],
['GaokaoBench', 'weighted_average'],
['math', 'accuracy'],
['cmo_fib', 'accuracy'],
['aime2024', 'accuracy'],
['Mathbench', 'naive_average'],
'',
'Knowledge',
['wikibench-wiki-single_choice_cncircular', 'perf_4'],
['cmmlu', 'naive_average'],
['mmlu', 'naive_average'],
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['sanitized_mbpp', 'score'],
['humanevalx', 'naive_average'],
['ds1000', 'naive_average'],
['lcb_code_generation', 'pass@1'],
['lcb_code_execution', 'pass@1'],
['lcb_test_output', 'pass@1'],
['bigcodebench_hard_instruct', 'pass@1'],
['bigcodebench_hard_complete', 'pass@1'],
'',
'Agent',
['teval', 'naive_average'],
['SciCode', 'accuracy'],
['SciCode', 'sub_accuracy'],
'',
'bbh-logical_deduction_seven_objects',
'bbh-multistep_arithmetic_two',
'',
'mmlu',
'mmlu-stem',
'mmlu-social-science',
'mmlu-humanities',
'mmlu-other',
'',
'cmmlu',
'cmmlu-stem',
'cmmlu-social-science',
'cmmlu-humanities',
'cmmlu-other',
'cmmlu-china-specific',
'',
'mmlu_pro',
'mmlu_pro_biology',
'mmlu_pro_business',
'mmlu_pro_chemistry',
'mmlu_pro_computer_science',
'mmlu_pro_economics',
'mmlu_pro_engineering',
'mmlu_pro_health',
'mmlu_pro_history',
'mmlu_pro_law',
'mmlu_pro_math',
'mmlu_pro_philosophy',
'mmlu_pro_physics',
'mmlu_pro_psychology',
'mmlu_pro_other',
'',
'ds1000_Pandas',
'ds1000_Numpy',
'ds1000_Tensorflow',
'ds1000_Scipy',
'ds1000_Sklearn',
'ds1000_Pytorch',
'ds1000_Matplotlib',
'',
'mmmlu_lite',
'openai_mmmlu_lite_AR-XY',
'openai_mmmlu_lite_BN-BD',
'openai_mmmlu_lite_DE-DE',
'openai_mmmlu_lite_ES-LA',
'openai_mmmlu_lite_FR-FR',
'openai_mmmlu_lite_HI-IN',
'openai_mmmlu_lite_ID-ID',
'openai_mmmlu_lite_IT-IT',
'openai_mmmlu_lite_JA-JP',
'openai_mmmlu_lite_KO-KR',
'openai_mmmlu_lite_PT-BR',
'openai_mmmlu_lite_SW-KE',
'openai_mmmlu_lite_YO-NG',
'openai_mmmlu_lite_ZH-CN',
'',
'###### MathBench-A: Application Part ######',
'college',
'high',
'middle',
'primary',
'arithmetic',
'mathbench-a (average)',
'###### MathBench-T: Theory Part ######',
'college_knowledge',
'high_knowledge',
'middle_knowledge',
'primary_knowledge',
'mathbench-t (average)',
],
summary_groups=summary_groups,
)
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
================================================
FILE: autotest/eval/eval_chat_obj_fullbench_v6.py
================================================
from mmengine.config import read_base
with read_base():
from autotest.eval.models import judge_models, models
from opencompass.configs.datasets.aime2024.aime2024_llmjudge_gen_5e9f4f import \
aime2024_datasets # noqa: F401, E501
from opencompass.configs.datasets.aime2025.aime2025_llmjudge_gen_5e9f4f import \
aime2025_datasets # noqa: F401, E501
from opencompass.configs.datasets.ARC_Prize_Public_Evaluation.arc_prize_public_evaluation_gen_fedd04 import \
arc_prize_public_evaluation_datasets # noqa: F401, E501
from opencompass.configs.datasets.bbh.bbh_llmjudge_gen_b5bdf1 import \
bbh_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmo_fib.cmo_fib_gen_2783e5 import \
cmo_fib_datasets # noqa: F401, E501
# General Reasoning
from opencompass.configs.datasets.drop.drop_llmjudge_gen_3857b0 import \
drop_datasets # noqa: F401, E501
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_d16acb import \
GaokaoBench_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_0shot_nocot_genericllmeval_gen_772ea0 import \
gpqa_datasets # noqa: F401, E501
# Math Calculation
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_17d799 import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.hellaswag.hellaswag_llmjudge_gen_809ef1 import \
hellaswag_datasets # noqa: F401, E501
from opencompass.configs.datasets.korbench.korbench_llmjudge_gen_56cf43 import \
korbench_0shot_single_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_500_llmjudge_gen_6ff468 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.MathBench.mathbench_2024_gen_4b8f28 import \
mathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.musr.musr_llmjudge_gen_b47fd3 import \
musr_datasets # noqa: F401, E501
from opencompass.configs.datasets.supergpqa.supergpqa_llmjudge_gen_12b8bc import \
supergpqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.teval.teval_en_gen_1ac254 import \
teval_datasets as teval_en_datasets # noqa: F401, E501
from opencompass.configs.datasets.teval.teval_zh_gen_1ac254 import \
teval_datasets as teval_zh_datasets # noqa: F401, E501
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_c87d61 import \
triviaqa_datasets # noqa: F401, E501
# Summary Groups
from opencompass.configs.summarizers.groups.bbeh import \
bbeh_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.bbh import \
bbh_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.GaokaoBench import \
GaokaoBench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.korbench import \
korbench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.musr_average import \
summarizer as musr_summarizer
from opencompass.configs.summarizers.groups.teval import \
teval_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.mmmlu_lite import \
mmmlu_summary_groups # noqa: F401, E501
models = models
datasets = [
v[0] for k, v in locals().items() if k.endswith('_datasets')
and 'scicode' not in k.lower() and 'teval' not in k.lower()
and 'arc_prize' not in k.lower() and isinstance(v, list) and len(v) > 0
]
datasets += arc_prize_public_evaluation_datasets
datasets += teval_en_datasets
datasets += teval_zh_datasets
musr_summary_groups = musr_summarizer['summary_groups']
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
summary_groups.append(
{
'name': 'Mathbench',
'subsets': ['mathbench-a (average)', 'mathbench-t (average)'],
}, )
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
if 'dataset_cfg' in d['eval_cfg']['evaluator'] and 'reader_cfg' in d[
'eval_cfg']['evaluator']['dataset_cfg']:
d['eval_cfg']['evaluator']['dataset_cfg']['reader_cfg'][
'test_range'] = '[0:4]'
if 'llm_evaluator' in d['eval_cfg']['evaluator'] and 'dataset_cfg' in d[
'eval_cfg']['evaluator']['llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator']['dataset_cfg'][
'reader_cfg']['test_range'] = '[0:4]'
obj_judge_model = judge_models[0]
for d in datasets:
if 'judge_cfg' in d['eval_cfg']['evaluator']:
d['eval_cfg']['evaluator']['judge_cfg'] = obj_judge_model
if 'llm_evaluator' in d['eval_cfg']['evaluator'] and 'judge_cfg' in d[
'eval_cfg']['evaluator']['llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator'][
'judge_cfg'] = obj_judge_model
================================================
FILE: autotest/eval/eval_chat_obj_fullbench_v7.py
================================================
from mmengine.config import read_base
with read_base():
# Datasets
# Instruct Following
# # # # Math Calculation
from autotest.eval.models import judge_models, models
from opencompass.configs.datasets.aime2024.aime2024_cascade_eval_gen_5e9f4f import \
aime2024_datasets # noqa: F401, E501
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import \
aime2025_datasets # noqa: F401, E501
# # # General Reasoning
from opencompass.configs.datasets.bbeh.bbeh_llmjudge_gen_86c3a0 import \
bbeh_datasets # noqa: F401, E501
from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_complete_gen_2888d3 import \
bigcodebench_hard_complete_datasets # noqa: F401, E501
from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_instruct_gen_c3d5ad import \
bigcodebench_hard_instruct_datasets # noqa: F401, E501
from opencompass.configs.datasets.chem_exam.competition_gen import \
chem_competition_instruct_datasets # noqa: F401, E501
from opencompass.configs.datasets.chem_exam.gaokao_gen import \
chem_gaokao_instruct_datasets # noqa: F401, E501
from opencompass.configs.datasets.ChemBench.ChemBench_llmjudge_gen_c584cf import \
chembench_datasets # noqa: F401, E501
from opencompass.configs.datasets.ClimaQA.ClimaQA_Gold_llm_judge_gen_f15343 import \
climaqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmmlu.cmmlu_llmjudge_gen_e1cd9a import \
cmmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.Earth_Silver.Earth_Silver_llmjudge_gen import \
earth_silver_mcq_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_cascade_eval_gen_772ea0 import \
gpqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.HLE.hle_llmverify_gen_6ff468 import \
hle_datasets # noqa: F401, E501
# # Coding
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.kcle.kcle_llm_judge_gen import \
kcle_datasets # noqa: F401, E501
from opencompass.configs.datasets.korbench.korbench_single_0shot_cascade_eval_gen_56cf43 import \
korbench_0shot_single_datasets # noqa: F401, E501
from opencompass.configs.datasets.livecodebench.livecodebench_gen_a4f90b import \
LCBCodeGeneration_dataset # noqa: F401, E501
from opencompass.configs.datasets.livemathbench.livemathbench_hard_custom_cascade_eval_gen_4bce59 import \
livemathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.matbench.matbench_llm_judge_gen_0e9276 import \
matbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_500_cascade_eval_gen_6ff468 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.mbpp.sanitized_mbpp_mdblock_gen_a447ff import \
sanitized_mbpp_datasets # noqa: F401, E501
from opencompass.configs.datasets.MedXpertQA.MedXpertQA_llmjudge_gen import \
medxpertqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu.mmlu_llmjudge_gen_f4336b import \
mmlu_datasets # noqa: F401, E501
# # # Knowledge
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_nocot_genericllmeval_gen_08c1de import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.OlymMATH.olymmath_llmverify_gen_97b203 import \
olymmath_datasets # noqa: F401, E501
from opencompass.configs.datasets.OlympiadBench.OlympiadBench_0shot_llmverify_gen_be8b13 import \
olympiadbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.PHYBench.phybench_gen import \
phybench_datasets # noqa: F401, E501
from opencompass.configs.datasets.PHYSICS.PHYSICS_llm_judge_gen_a133a2 import \
physics_datasets # noqa: F401, E501
from opencompass.configs.datasets.ProteinLMBench.ProteinLMBench_llmjudge_gen_a67965 import \
proteinlmbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.R_Bench.rbench_llmjudge_gen_c89350 import \
RBench_datasets # noqa: F401, E501
# # Academic
from opencompass.configs.datasets.SmolInstruct.smolinstruct_0shot_instruct_gen import \
smolinstruct_datasets_0shot_instruct as \
smolinstruct_datasets # noqa: F401, E501
from opencompass.configs.datasets.srbench.srbench_gen import \
srbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.supergpqa.supergpqa_cascade_gen_1545c1 import \
supergpqa_datasets # noqa: F401, E501
# Summary Groups
from opencompass.configs.summarizers.groups.bbeh import \
bbeh_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.korbench import \
korbench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.OlympiadBench import \
OlympiadBenchPhysics_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.OlympiadBench import ( # noqa: F401, E501
OlympiadBench_summary_groups, OlympiadBenchMath_summary_groups)
from opencompass.configs.summarizers.groups.PHYSICS import \
physics_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.supergpqa import \
supergpqa_summary_groups # noqa: F401, E501
models = models
# Add lattest LCB version
LCBCodeGeneration_v6_datasets = LCBCodeGeneration_dataset
LCBCodeGeneration_v6_datasets['abbr'] = 'lcb_code_generation_v6'
LCBCodeGeneration_v6_datasets['release_version'] = 'v6'
LCBCodeGeneration_v6_datasets['eval_cfg']['evaluator'][
'release_version'] = 'v6'
LCBCodeGeneration_v6_datasets = [LCBCodeGeneration_v6_datasets]
repeated_info = [
(math_datasets, 1),
(gpqa_datasets, 1),
(aime2024_datasets, 1),
(aime2025_datasets, 1),
(olympiadbench_datasets, 1),
(livemathbench_datasets, 1),
(olymmath_datasets, 1),
(korbench_0shot_single_datasets, 1),
]
for datasets_, num in repeated_info:
for dataset_ in datasets_:
dataset_['n'] = num
dataset_['k'] = num
datasets = [
v[0] for k, v in locals().items()
if k.endswith('_datasets') and 'bigcode' not in k.lower()
and 'humaneval' not in k.lower() and isinstance(v, list) and len(v) > 0
]
datasets += bigcodebench_hard_instruct_datasets
datasets += bigcodebench_hard_complete_datasets
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
if 'dataset_cfg' in d['eval_cfg']['evaluator'] and 'reader_cfg' in d[
'eval_cfg']['evaluator']['dataset_cfg']:
d['eval_cfg']['evaluator']['dataset_cfg']['reader_cfg'][
'test_range'] = '[0:4]'
if 'llm_evaluator' in d['eval_cfg']['evaluator'] and 'dataset_cfg' in d[
'eval_cfg']['evaluator']['llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator']['dataset_cfg'][
'reader_cfg']['test_range'] = '[0:4]'
obj_judge_model = judge_models[0]
for d in datasets:
if 'judge_cfg' in d['eval_cfg']['evaluator']:
d['eval_cfg']['evaluator']['judge_cfg'] = obj_judge_model
if 'llm_evaluator' in d['eval_cfg']['evaluator'] and 'judge_cfg' in d[
'eval_cfg']['evaluator']['llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator'][
'judge_cfg'] = obj_judge_model
================================================
FILE: autotest/eval/eval_chat_obj_fullbench_v8.py
================================================
from mmengine.config import read_base
with read_base():
# Datasets
from autotest.eval.models import judge_models, models
from opencompass.configs.datasets.atlas.atlas_val_gen_b2d1b6 import \
atlas_datasets # noqa: F401, E501
from opencompass.configs.datasets.biodata.biodata_task_gen import \
biodata_task_datasets # noqa: F401, E501
from opencompass.configs.datasets.CMPhysBench.cmphysbench_gen import \
cmphysbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.IFBench.IFBench_gen import \
ifbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.livecodebench_pro.livecodebench_pro_gen import \
lcb_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.MolInstructions_chem.mol_instructions_chem_gen import \
mol_gen_selfies_datasets # noqa: F401, E501
from opencompass.configs.datasets.openswi.openswi_gen import \
openswi_datasets # noqa: F401, E501
models = models
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
for d in datasets:
if 'n' in d:
d['n'] = 1
if 'reader_cfg' in d:
d['reader_cfg']['test_range'] = '[0:4]'
else:
d['test_range'] = '[0:4]'
if 'eval_cfg' in d and 'dataset_cfg' in d['eval_cfg'][
'evaluator'] and 'reader_cfg' in d['eval_cfg']['evaluator'][
'dataset_cfg']:
d['eval_cfg']['evaluator']['dataset_cfg']['reader_cfg'][
'test_range'] = '[0:4]'
if 'eval_cfg' in d and 'llm_evaluator' in d['eval_cfg'][
'evaluator'] and 'dataset_cfg' in d['eval_cfg']['evaluator'][
'llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator']['dataset_cfg'][
'reader_cfg']['test_range'] = '[0:4]'
obj_judge_model = judge_models[0]
for d in datasets:
if 'eval_cfg' in d and 'evaluator' in d['eval_cfg']:
if 'atlas' in d['abbr'] and 'judge_cfg' in d['eval_cfg']['evaluator']:
d['eval_cfg']['evaluator']['judge_cfg'] = dict(
judgers=[obj_judge_model])
elif 'judge_cfg' in d['eval_cfg']['evaluator']:
d['eval_cfg']['evaluator']['judge_cfg'] = obj_judge_model
elif 'llm_evaluator' in d['eval_cfg'][
'evaluator'] and 'judge_cfg' in d[ # noqa
'eval_cfg']['evaluator']['llm_evaluator']: # noqa
d['eval_cfg']['evaluator']['llm_evaluator'][
'judge_cfg'] = obj_judge_model
================================================
FILE: autotest/eval/eval_chat_obj_v8.py
================================================
from mmengine.config import read_base
with read_base():
# Datasets
from autotest.eval.models import judge_models, test_models
from opencompass.configs.datasets.aime2026.aime2026_cascade_eval_gen_6ff468 import \
aime2026_datasets # noqa: F401, E501
from opencompass.configs.datasets.biodata.biodata_task_gen import \
biodata_task_datasets # noqa: F401, E501
from opencompass.configs.datasets.hmmt2026.hmmt2026_cascade_eval_gen_6ff468 import \
hmmt2026_datasets # noqa: F401, E501
from opencompass.configs.datasets.MolInstructions_chem.mol_instructions_chem_gen import \
mol_gen_selfies_datasets # noqa: F401, E501
from opencompass.configs.datasets.SciReasoner.scireasoner_gen import ( # noqa: F401, E501
mini_bio_instruction_datasets, mini_composition_material_datasets,
mini_GUE_datasets, mini_LLM4Mat_datasets,
mini_modulus_material_datasets, mini_mol_biotext_datasets,
mini_mol_mol_datasets, mini_mol_protein_datasets, mini_opi_datasets,
mini_PEER_datasets, mini_Retrosynthesis_uspto50k_datasets,
mini_smol_datasets, mini_UMG_Datasets, mini_uncond_material_datasets,
mini_uncond_protein_datasets, mini_uncond_RNA_datasets)
models = test_models
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
obj_judge_model = judge_models[0]
for d in datasets:
if 'eval_cfg' in d and 'evaluator' in d['eval_cfg']:
if 'atlas' in d['abbr'] and 'judge_cfg' in d['eval_cfg']['evaluator']:
d['eval_cfg']['evaluator']['judge_cfg'] = dict(
judgers=[obj_judge_model])
elif 'judge_cfg' in d['eval_cfg']['evaluator']:
d['eval_cfg']['evaluator']['judge_cfg'] = obj_judge_model
elif 'llm_evaluator' in d['eval_cfg'][
'evaluator'] and 'judge_cfg' in d[ # noqa
'eval_cfg']['evaluator']['llm_evaluator']: # noqa
d['eval_cfg']['evaluator']['llm_evaluator'][
'judge_cfg'] = obj_judge_model
================================================
FILE: autotest/eval/eval_chat_sub_fullbench.py
================================================
from mmengine.config import read_base
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import DefaultSubjectiveSummarizer
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
with read_base():
# read hf models - chat models
# Dataset
from autotest.eval.models import judge_models, models
from opencompass.configs.datasets.chinese_simpleqa.chinese_simpleqa_gen import \
csimpleqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.SimpleQA.simpleqa_gen_0283c3 import \
simpleqa_datasets # noqa: F401, E501; noqa: F401, E501
from opencompass.configs.datasets.subjective.alignbench.alignbench_v1_1_judgeby_critiquellm_new import \
alignbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4_new import \
alpacav2_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare_new import \
arenahard_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.compassarena.compassarena_compare_new import \
compassarena_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.followbench.followbench_llmeval_new import \
followbench_llmeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.multiround.mtbench101_judge_new import \
mtbench101_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.wildbench.wildbench_pair_judge_new import \
wildbench_datasets # noqa: F401, E501
models = models
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')
and 'mtbench101' not in k and 'wildbench' not in k), [])
datasets += mtbench101_datasets # noqa: F401, E501
datasets += wildbench_datasets # noqa: F401, E501
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
judge_models = judge_models
eval = dict(
partitioner=dict(type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
summary_groups = []
summary_groups.append({
'name': 'compassarena_language',
'subsets': [
['compassarena_language', '内容总结'],
],
})
summary_groups.append({
'name': 'compassarena_knowledge',
'subsets': [
['compassarena_knowledge', '生活常识_ZH'],
],
})
summary_groups.append({
'name': 'compassarena_reason_v2',
'subsets': [
['compassarena_reason_v2', 'reasoning'],
],
})
summary_groups.append({
'name': 'compassarena_math_v2',
'subsets': [
['compassarena_math_v2', '高等数学_ZH'],
],
})
summary_groups.append({
'name': 'compassarena_creationv2_zh',
'subsets': [
['compassarena_creationv2_zh', '内容扩写_ZH'],
],
})
summary_groups.append({
'name':
'CompassArena',
'subsets': [
'compassarena_language',
'compassarena_knowledge',
'compassarena_reason_v2',
'compassarena_math_v2',
'compassarena_creationv2_zh',
],
})
summary_groups.append({
'name':
'FoFo',
'subsets': [['fofo_test_prompts', 'overall'],
['fofo_test_prompts_cn', 'overall']],
})
summary_groups.append({
'name':
'Followbench',
'subsets': [
['followbench_llmeval_en', 'HSR_AVG'],
['followbench_llmeval_en', 'SSR_AVG'],
],
})
# Summarizer
summarizer = dict(
dataset_abbrs=[
['alignment_bench_v1_1', '总分'],
['alpaca_eval', 'total'],
['arenahard', 'score'],
['Followbench', 'naive_average'],
['CompassArena', 'naive_average'],
['FoFo', 'naive_average'],
['mtbench101', 'avg'],
['wildbench', 'average'],
['simpleqa', 'accuracy_given_attempted'],
['chinese_simpleqa', 'given_attempted_accuracy'],
'',
['alignment_bench_v1_1', '专业能力'],
['alignment_bench_v1_1', '数学计算'],
['alignment_bench_v1_1', '基本任务'],
['alignment_bench_v1_1', '逻辑推理'],
['alignment_bench_v1_1', '中文理解'],
['alignment_bench_v1_1', '文本写作'],
['alignment_bench_v1_1', '角色扮演'],
['alignment_bench_v1_1', '综合问答'],
['alpaca_eval', 'helpful_base'],
['alpaca_eval', 'koala'],
['alpaca_eval', 'oasst'],
['alpaca_eval', 'selfinstruct'],
['alpaca_eval', 'vicuna'],
['compassarena_language', 'naive_average'],
['compassarena_knowledge', 'naive_average'],
['compassarena_reason_v2', 'naive_average'],
['compassarena_math_v2', 'naive_average'],
['compassarena_creationv2_zh', 'naive_average'],
['fofo_test_prompts', 'overall'],
['fofo_test_prompts_cn', 'overall'],
['followbench_llmeval_en', 'HSR_AVG'],
['followbench_llmeval_en', 'SSR_AVG'],
['followbench_llmeval_en', 'HSR_L1'],
['followbench_llmeval_en', 'HSR_L2'],
['followbench_llmeval_en', 'HSR_L3'],
['followbench_llmeval_en', 'HSR_L4'],
['followbench_llmeval_en', 'HSR_L5'],
['followbench_llmeval_en', 'SSR_L1'],
['followbench_llmeval_en', 'SSR_L2'],
['followbench_llmeval_en', 'SSR_L3'],
['followbench_llmeval_en', 'SSR_L4'],
['followbench_llmeval_en', 'SSR_L5'],
['simpleqa', 'f1'],
],
type=DefaultSubjectiveSummarizer,
summary_groups=summary_groups,
)
================================================
FILE: autotest/eval/models.py
================================================
from opencompass.models import TurboMindModel, TurboMindModelwithChatTemplate
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
models = [
dict(type=TurboMindModelwithChatTemplate,
abbr='qwen3-8b-fullbench',
path='Qwen/Qwen3-8B',
engine_config=dict(session_len=32768, max_batch_size=1, tp=1),
gen_config=dict(do_sample=False, enable_thinking=True),
max_seq_len=32768,
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
]
test_models = [
dict(type=TurboMindModelwithChatTemplate,
abbr='test_model',
path='intern/Intern-S1-Pro',
engine_config=dict(session_len=32768, max_batch_size=1, tp=16),
gen_config=dict(do_sample=False, enable_thinking=True),
max_seq_len=32768,
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=16),
pred_postprocessor=dict(type=extract_non_reasoning_content))
]
judge_models = [
dict(type=TurboMindModelwithChatTemplate,
abbr='qwen3-8b-fullbench',
path='Qwen/Qwen3-8B',
engine_config=dict(session_len=46000, max_batch_size=1, tp=1),
gen_config=dict(do_sample=False, enable_thinking=True),
max_seq_len=46000,
max_out_len=46000,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
]
base_models = [
dict(
type=TurboMindModel,
abbr='qwen3-8b-base-fullbench',
path='Qwen/Qwen3-8B-Base',
engine_config=dict(session_len=32768, max_batch_size=1, tp=1),
gen_config=dict(top_k=1,
temperature=1e-6,
top_p=0.9,
max_new_tokens=1024),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1),
)
]
================================================
FILE: autotest/model/__init__.py
================================================
"""OpenCompass inference test configurations."""
__all__ = []
================================================
FILE: autotest/model/base_datasets.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.gpqa.gpqa_few_shot_ppl_4b5a83 import \
gpqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.infinitebench.infinitebenchretrievepasskey.infinitebench_retrievepasskey_gen import \
InfiniteBench_retrievepasskey_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_few_shot_gen_bfaf90 import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
winogrande_datasets # noqa: F401, E501
# humaneval_datasets = [humaneval_datasets[0]]
mmlu_pro_datasets = [mmlu_pro_datasets[0]]
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
================================================
FILE: autotest/model/chat_datasets.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import \
aime2025_datasets # noqa: F401, E501
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # noqa: F401, E501
# from opencompass.configs.datasets.mmlu_pro.mmlu_pro_gen import \
# mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.HLE.hle_gen import \
hle_datasets # noqa: F401, E501
# from opencompass.configs.datasets.humaneval.humaneval_gen import \
# humaneval_datasets # noqa: F401, E501
from opencompass.configs.datasets.IFEval.IFEval_gen import \
ifeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.infinitebench.infinitebenchretrievepasskey.infinitebench_retrievepasskey_gen import \
InfiniteBench_retrievepasskey_datasets # noqa: F401, E501
# humaneval_datasets = [humaneval_datasets[0]]
ifeval_datasets = [ifeval_datasets[0]]
# mmlu_pro_datasets = [mmlu_pro_datasets[0]]
hle_datasets = [hle_datasets[0]]
aime2025_datasets = [aime2025_datasets[0]]
aime2025_datasets[0]['n'] = 2
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:4]'
if 'dataset_cfg' in d['eval_cfg']['evaluator'] and 'reader_cfg' in d[
'eval_cfg']['evaluator']['dataset_cfg']:
d['eval_cfg']['evaluator']['dataset_cfg']['reader_cfg'][
'test_range'] = '[0:4]'
if 'llm_evaluator' in d['eval_cfg']['evaluator'] and 'dataset_cfg' in d[
'eval_cfg']['evaluator']['llm_evaluator']:
d['eval_cfg']['evaluator']['llm_evaluator']['dataset_cfg'][
'reader_cfg']['test_range'] = '[0:4]'
================================================
FILE: autotest/model/constant.py
================================================
meta_template = dict(
begin=dict(
role='SYSTEM',
api_role='SYSTEM',
prompt='''
Your answers should be full of happy and lovely tone. Answer the question simply and clearly. Don\'t use any abbreviations and don\'t use any punctuation. Don\'t think too much.''', # noqa
),
round=[ # noqa
dict(role='HUMAN', api_role='HUMAN', prompt='{input}'),
dict(role='BOT', api_role='BOT', generate=True),
])
================================================
FILE: autotest/model/infer_api.py
================================================
from mmengine.config import read_base
from opencompass.models.openai_api import OpenAISDK
from opencompass.models.openai_streaming import OpenAISDKStreaming
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.chat_datasets import datasets
from autotest.model.constant import meta_template as test_meta_template
datasets = datasets
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
API_BASE = 'http://localhost:23333/v1'
MODEL_PATH = 'Qwen/Qwen3-8B'
TOKENIZER_PATH = 'Qwen/Qwen3-8B'
BASE_API = dict(
type=OpenAISDK,
key='EMPTY',
openai_api_base=API_BASE,
path=MODEL_PATH,
tokenizer_path=TOKENIZER_PATH,
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=128,
batch_size=128,
retry=20,
pred_postprocessor=dict(type=extract_non_reasoning_content),
)
BASE_STREAMING = dict(
type=OpenAISDKStreaming,
key='EMPTY',
openai_api_base=API_BASE,
path=MODEL_PATH,
tokenizer_path=TOKENIZER_PATH,
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=128,
batch_size=128,
stream=True,
retry=20,
pred_postprocessor=dict(type=extract_non_reasoning_content),
)
API_BASIC = dict(
**BASE_API,
abbr='lmdeploy-api-test',
max_out_len=1024,
max_seq_len=4096,
)
API_STREAMING = dict(
**BASE_STREAMING,
abbr='lmdeploy-api-streaming-test',
max_out_len=1024,
max_seq_len=4096,
)
API_STREAMING_CHUNK = dict(
**BASE_STREAMING,
abbr='lmdeploy-api-streaming-test-chunk',
max_out_len=1024,
max_seq_len=4096,
stream_chunk_size=10,
verbose=True,
)
API_MAXLEN = dict(
**BASE_API,
abbr='lmdeploy-api-test-maxlen',
max_out_len=4096,
max_seq_len=4096,
)
API_MAXLEN_MID = dict(
**BASE_API,
abbr='lmdeploy-api-test-maxlen-mid',
max_out_len=3896,
max_seq_len=4096,
mode='mid',
)
API_NOTHINK = dict(
**BASE_API,
abbr='lmdeploy-api-test-nothink',
max_out_len=4096,
max_seq_len=4096,
extra_body={'enable_thinking': False},
)
API_IGNORE_EOS = dict(
**BASE_API,
abbr='lmdeploy-api-test-ignore-eos',
max_out_len=128,
max_seq_len=4096,
extra_body={
'ignore_eos': True,
},
)
API_CHAT_TEMPLATE = dict(
**BASE_API,
abbr='lmdeploy-api-test-chat-template',
max_out_len=1024,
max_seq_len=1024,
extra_body={'enable_thinking': False},
)
API_CHAT_TEMPLATE['meta_template'] = test_meta_template
API_OPENAI_STOP = dict(
**BASE_API,
abbr='lmdeploy-api-test-openai-stop',
max_out_len=512,
max_seq_len=4096,
openai_extra_kwargs=dict(
stop=[' and', '', ' to', '\n\n', 'Question:', 'Answer:'], ),
)
API_OPENAI_LOGPROBS = dict(
**BASE_API,
abbr='lmdeploy-api-test-openai-logprobs',
max_out_len=256,
max_seq_len=4096,
openai_extra_kwargs=dict(
logprobs=True,
top_logprobs=5,
),
)
API_OPENAI_COMBINE = dict(
**BASE_API,
abbr='lmdeploy-api-test-openai-combine',
max_out_len=512,
max_seq_len=4096,
openai_extra_kwargs=dict(
presence_penalty=0.3,
frequency_penalty=0.2,
top_p=0.85,
seed=42,
user='opencompass-regression',
),
)
API_LONG_OUTPUT_128K = dict(
**BASE_API,
abbr='lmdeploy-api-test-long-output-128k',
max_out_len=4096,
max_seq_len=131072,
)
models = [
API_BASIC,
API_STREAMING,
API_STREAMING_CHUNK,
API_MAXLEN,
API_MAXLEN_MID,
API_NOTHINK,
API_IGNORE_EOS,
API_CHAT_TEMPLATE,
API_OPENAI_STOP,
API_OPENAI_LOGPROBS,
API_OPENAI_COMBINE,
API_LONG_OUTPUT_128K,
]
for m in models:
m['temperature'] = 0
================================================
FILE: autotest/model/infer_api_rollout.py
================================================
from mmengine.config import read_base
from opencompass.models import OpenAISDKRollout
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.chat_datasets import datasets
datasets = datasets
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
API_BASE = 'http://localhost:23333/v1'
MODEL_PATH = 'Qwen/Qwen3-8B'
TOKENIZER_PATH = 'Qwen/Qwen3-8B'
BASE_ROLLOUT = dict(
type=OpenAISDKRollout,
key='EMPTY',
openai_api_base=API_BASE,
path=MODEL_PATH,
tokenizer_path=TOKENIZER_PATH,
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=128,
batch_size=128,
retry=20,
pred_postprocessor=dict(type=extract_non_reasoning_content),
)
API_ROLLOUT_BASIC = dict(
**BASE_ROLLOUT,
abbr='lmdeploy-api-test-rollout',
max_out_len=1024,
max_seq_len=4096,
temperature=0.01,
logprobs=True,
top_logprobs=5,
extra_body=dict(top_k=20),
openai_extra_kwargs=dict(top_p=0.95),
)
API_ROLLOUT_STOP = dict(
**BASE_ROLLOUT,
abbr='lmdeploy-api-test-rollout-stop',
max_out_len=512,
max_seq_len=4096,
temperature=0.2,
logprobs=True,
top_logprobs=5,
openai_extra_kwargs=dict(
stop=[' and', '', ' to', '\n\n', 'Question:', 'Answer:'],
top_p=0.9,
),
)
API_ROLLOUT_COMBINE = dict(
**BASE_ROLLOUT,
abbr='lmdeploy-api-test-rollout-combine',
max_out_len=512,
max_seq_len=4096,
temperature=0.2,
logprobs=True,
top_logprobs=5,
openai_extra_kwargs=dict(
presence_penalty=0.3,
frequency_penalty=0.2,
top_p=0.85,
seed=42,
user='opencompass-regression',
),
)
API_ROLLOUT_IGNORE_EOS = dict(
**BASE_ROLLOUT,
abbr='lmdeploy-api-test-rollout-ignore-eos',
max_out_len=128,
max_seq_len=4096,
temperature=0.2,
logprobs=True,
top_logprobs=5,
extra_body={
'ignore_eos': True,
},
)
API_ROLLOUT_NO_THINK = dict(
**BASE_ROLLOUT,
abbr='lmdeploy-api-test-rollout-no-think',
max_out_len=128,
max_seq_len=4096,
temperature=0.2,
logprobs=True,
top_logprobs=5,
extra_body={
'enable_thinking': False,
},
)
API_ROLLOUT_LONG_OUTPUT_128K = dict(
**BASE_ROLLOUT,
abbr='lmdeploy-api-test-rollout-long-output-128k',
max_out_len=1024,
max_seq_len=131072,
temperature=0.01,
logprobs=True,
top_logprobs=5,
)
models = [
API_ROLLOUT_BASIC,
API_ROLLOUT_STOP,
API_ROLLOUT_COMBINE,
API_ROLLOUT_IGNORE_EOS,
API_ROLLOUT_NO_THINK,
API_ROLLOUT_LONG_OUTPUT_128K,
]
for m in models:
if 'openai_extra_kwargs' not in m:
m['openai_extra_kwargs'] = dict(top_k=1,
temperature=1.0,
repetition_penalty=1.0)
else:
m['openai_extra_kwargs']['top_k'] = 1
m['openai_extra_kwargs']['temperature'] = 1.0
m['openai_extra_kwargs']['repetition_penalty'] = 1.0
================================================
FILE: autotest/model/infer_lmdeploy_base.py
================================================
from mmengine.config import read_base
from opencompass.models import TurboMindModel
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.base_datasets import datasets
from autotest.model.constant import meta_template as test_meta_template
datasets = datasets
Qwen3_0_6B_Base = dict(
type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(max_batch_size=1, session_len=128000),
gen_config=dict(do_sample=False),
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
Qwen3_0_6B_Base_PYTORCH = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-pytorch',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(backend='pytorch',
session_len=32768,
max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_BACKEND = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-backend',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=32768,
max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
backend='pytorch',
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_IGNORE_EOS = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-ignore-eos',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(do_sample=False,
max_new_tokens=128,
ignore_eos=True),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_TEMP0 = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-temp0',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.0, do_sample=False),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_BAD_WORDS = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-bad-words',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(
temperature=0.0,
do_sample=False,
bad_words=['', '', ' to']),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_SESSION_LEN = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-session-len',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=10,
max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False),
max_seq_len=32768,
max_out_len=8192,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_new_tokens and min_new_tokens
# which should generate between 90 and 100 tokens
Qwen3_0_6B_Base_NEW_TOKENS = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-new-tokens',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
min_new_tokens=90,
max_new_tokens=100),
max_seq_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_MAX_SEQ_LEN = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-max-seq-len',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=200,
max_out_len=100,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_STOP_WORDS = dict(
type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-stop-words',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096, max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
stopping_criteria=[' and', '', ' to']),
max_seq_len=4096,
max_out_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_TEMPLATE = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-template',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=32768,
max_batch_size=1),
gen_config=dict(do_sample=False,
max_new_tokens=256),
max_seq_len=32768,
batch_size=1,
meta_template=test_meta_template,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_DROP_MIDDLE = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-drop-middle',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=32768,
max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=2048,
max_out_len=2000,
batch_size=1,
drop_middle=True,
run_cfg=dict(num_gpus=1))
# Test case for combined parameters
Qwen3_0_6B_BASE_COMBINED = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-combined',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.1,
top_p=0.5,
do_sample=False,
repetition_penalty=0.000001,
random_seed=42,
max_new_tokens=128,
skip_special_tokens=True),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for do_sample=True
Qwen3_0_6B_Base_DO_SAMPLE = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-do-sample',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(do_sample=True,
temperature=0.7,
top_p=0.9,
max_new_tokens=1024),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_token_ids, no should be in the output
Qwen3_0_6B_Base_STOP_TOKEN_IDS = dict(
type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-stop-token-ids',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096, max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
max_new_tokens=1024,
stop_token_ids=[151645, 151668]),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for bad_token_ids, no should be in the output
Qwen3_0_6B_Base_BAD_TOKEN_IDS = dict(
type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-bad-token-ids',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096, max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
max_new_tokens=1024,
bad_token_ids=[151645, 151668]),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for logprobs
Qwen3_0_6B_Base_LOGPROBS = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-logprobs',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
max_new_tokens=1024,
logprobs=5),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_ENDSTR = dict(type=TurboMindModel,
abbr='lmdeploy-qwen3-0_6b-base-end-str',
path='Qwen/Qwen3-0.6B-Base',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
max_new_tokens=1024),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
end_str='',
run_cfg=dict(num_gpus=1))
models = [
Qwen3_0_6B_Base, Qwen3_0_6B_Base_PYTORCH, Qwen3_0_6B_Base_BACKEND,
Qwen3_0_6B_Base_DROP_MIDDLE, Qwen3_0_6B_Base_IGNORE_EOS,
Qwen3_0_6B_Base_TEMP0, Qwen3_0_6B_Base_DO_SAMPLE,
Qwen3_0_6B_Base_BAD_WORDS, Qwen3_0_6B_Base_STOP_WORDS,
Qwen3_0_6B_Base_STOP_TOKEN_IDS, Qwen3_0_6B_Base_BAD_TOKEN_IDS,
Qwen3_0_6B_Base_NEW_TOKENS, Qwen3_0_6B_Base_MAX_SEQ_LEN,
Qwen3_0_6B_Base_SESSION_LEN, Qwen3_0_6B_BASE_COMBINED,
Qwen3_0_6B_Base_LOGPROBS, Qwen3_0_6B_Base_TEMPLATE, Qwen3_0_6B_Base_ENDSTR
]
================================================
FILE: autotest/model/infer_lmdeploy_chat.py
================================================
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.chat_datasets import datasets
from autotest.model.constant import meta_template as test_meta_template
datasets = datasets
# Base model testcase
Qwen3_0_6B_FP8 = dict(
type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-base',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(max_batch_size=1, session_len=128000),
gen_config=dict(do_sample=False),
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
# Test case for PyTorch backend
Qwen3_0_6B_FP8_PYTORCH = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-pytorch',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(backend='pytorch',
session_len=128000,
max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=128000,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for backend should be same as PyTorch backend
Qwen3_0_6B_FP8_BACKEND = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-backend',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=128000,
max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=128000,
max_out_len=1024,
batch_size=1,
backend='pytorch',
run_cfg=dict(num_gpus=1))
# test case for ignore_eos, which is used in some models
Qwen3_0_6B_FP8_IGNORE_EOS = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-ignore-eos',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(do_sample=False,
max_new_tokens=128,
ignore_eos=True),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_TEMP0 = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-temp0',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.0, do_sample=False),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_BAD_WORDS = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-bad-words',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(
temperature=0.0,
do_sample=False,
bad_words=['', '', ' to']),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_SESSION_LEN = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-session-len',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=10,
max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False),
max_seq_len=32768,
max_out_len=8192,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_new_tokens and min_new_tokens,
# which should generate between 90 and 100 tokens
Qwen3_0_6B_FP8_NEW_TOKENS = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-new-tokens',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
min_new_tokens=90,
max_new_tokens=100),
max_seq_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_seq_len and max_out_len
Qwen3_0_6B_FP8_MAX_SEQ_LEN = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-max-seq-len',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=200,
max_out_len=100,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_words, no stop tokens should be in the output
# for example, ' to '
Qwen3_0_6B_FP8_STOP_WORDS = dict(
type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-stop-words',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096, max_batch_size=1),
gen_config=dict(temperature=0.0, do_sample=False),
max_seq_len=4096,
max_out_len=4096,
batch_size=1,
stop_words=[' and', '', ' to', '\n\n', 'Question:', 'Answer:'],
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_TEMPLATE = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-template',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=32768,
max_batch_size=1),
gen_config=dict(do_sample=False,
max_new_tokens=256),
max_seq_len=32768,
batch_size=1,
meta_template=test_meta_template,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_DROP_MIDDLE = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-drop-middle',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=32768,
max_batch_size=1),
gen_config=dict(do_sample=False),
max_seq_len=2048,
max_out_len=2000,
batch_size=1,
drop_middle=True,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_HF_OVER = dict(
type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-hf-over',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(
session_len=32768,
max_batch_size=1,
hf_overrides=dict(
rope_scaling=dict(rope_type='yarn',
factor=4.0,
original_max_position_embeddings=32768))),
gen_config=dict(top_p=0.9,
temperature=0.7,
do_sample=False,
max_new_tokens=1024),
max_seq_len=128000,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Additional test cases for various top_k values
Qwen3_0_6B_FP8_TOPK50 = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-topk50',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(top_k=50,
temperature=1.2,
max_new_tokens=1024),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for combined parameters
Qwen3_0_6B_FP8_COMBINED = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-combined',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.1,
top_p=0.5,
do_sample=False,
repetition_penalty=0.000001,
random_seed=42,
max_new_tokens=128,
skip_special_tokens=True),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for do_sample=True
Qwen3_0_6B_FP8_DO_SAMPLE = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-do-sample',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(do_sample=True,
max_new_tokens=1024),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_token_ids, no should be in the output
Qwen3_0_6B_FP8_STOP_TOKEN_IDS = dict(
type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-stop-token-ids',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096, max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
max_new_tokens=1024,
stop_token_ids=[151645, 151668]),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for bad_token_ids, no should be in the output
Qwen3_0_6B_FP8_BAD_TOKEN_IDS = dict(
type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-bad-token-ids',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096, max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
max_new_tokens=1024,
bad_token_ids=[151645, 151668]),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for logprobs
Qwen3_0_6B_FP8_LOGPROBS = dict(type=TurboMindModelwithChatTemplate,
abbr='lmdeploy-qwen3-0_6b-fp8-logprobs',
path='Qwen/Qwen3-0.6B-FP8',
engine_config=dict(session_len=4096,
max_batch_size=1),
gen_config=dict(temperature=0.0,
do_sample=False,
max_new_tokens=1024,
logprobs=5),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
models = [
Qwen3_0_6B_FP8,
Qwen3_0_6B_FP8_PYTORCH,
Qwen3_0_6B_FP8_BACKEND,
Qwen3_0_6B_FP8_DROP_MIDDLE,
Qwen3_0_6B_FP8_IGNORE_EOS,
Qwen3_0_6B_FP8_TEMP0,
Qwen3_0_6B_FP8_TOPK50,
Qwen3_0_6B_FP8_DO_SAMPLE,
Qwen3_0_6B_FP8_BAD_WORDS,
Qwen3_0_6B_FP8_STOP_WORDS,
Qwen3_0_6B_FP8_STOP_TOKEN_IDS,
Qwen3_0_6B_FP8_BAD_TOKEN_IDS,
Qwen3_0_6B_FP8_NEW_TOKENS,
Qwen3_0_6B_FP8_MAX_SEQ_LEN,
Qwen3_0_6B_FP8_SESSION_LEN,
Qwen3_0_6B_FP8_LOGPROBS,
Qwen3_0_6B_FP8_TEMPLATE,
Qwen3_0_6B_FP8_HF_OVER,
Qwen3_0_6B_FP8_COMBINED,
]
================================================
FILE: autotest/model/infer_transformers_base.py
================================================
from mmengine.config import read_base
from opencompass.models import HuggingFaceBaseModel
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.base_datasets import datasets
from autotest.model.constant import meta_template as test_meta_template
datasets = datasets
# Base model testcase
Qwen3_0_6B_Base = dict(
type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(top_k=1, max_new_tokens=32768),
max_seq_len=128000,
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
Qwen3_0_6B_Base_TEMP0 = dict(type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base-temp0',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_new_tokens and min_new_tokens,
# which should generate between 90 and 100 tokens
Qwen3_0_6B_Base_NEW_TOKENS = dict(type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base-new-tokens',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(temperature=0.0,
top_k=1,
min_new_tokens=90,
max_new_tokens=100),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_seq_len and max_out_len
Qwen3_0_6B_Base_MAX_SEQ_LEN = dict(type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base-max-seq-len',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(temperature=0.0,
top_k=1),
max_seq_len=200,
max_out_len=100,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_words, no stop tokens should be in the output
Qwen3_0_6B_Base_STOP_WORDS = dict(
type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base-stop-words',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=4096,
max_out_len=4096,
batch_size=1,
stop_words=[' and', '', ' to', '\n\n', 'Question:', 'Answer:'],
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_TEMPLATE = dict(type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base-template',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(top_k=1,
max_new_tokens=256),
max_seq_len=4096,
batch_size=1,
meta_template=test_meta_template,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_DROP_MIDDLE = dict(type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base-drop-middle',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(top_k=1,
max_new_tokens=8192),
max_seq_len=2048,
max_out_len=2000,
batch_size=1,
drop_middle=True,
run_cfg=dict(num_gpus=1))
# Test case for combined parameters
Qwen3_0_6B_Base_COMBINED = dict(type=HuggingFaceBaseModel,
abbr='hf-qwen3-0_6b-base-combined',
path='Qwen/Qwen3-0.6B-Base',
generation_kwargs=dict(
temperature=0.1,
top_p=0.5,
repetition_penalty=0.000001,
max_new_tokens=128,
),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
models = [
Qwen3_0_6B_Base, Qwen3_0_6B_Base_TEMP0, Qwen3_0_6B_Base_STOP_WORDS,
Qwen3_0_6B_Base_NEW_TOKENS, Qwen3_0_6B_Base_MAX_SEQ_LEN,
Qwen3_0_6B_Base_COMBINED, Qwen3_0_6B_Base_TEMPLATE,
Qwen3_0_6B_Base_DROP_MIDDLE
]
================================================
FILE: autotest/model/infer_transformers_chat.py
================================================
from mmengine.config import read_base
from opencompass.models import HuggingFacewithChatTemplate
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.chat_datasets import datasets
from autotest.model.constant import meta_template as test_meta_template
datasets = datasets
# Base model testcase
Qwen3_0_6B_FP8 = dict(
type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-base',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(top_k=1, max_new_tokens=32768),
max_seq_len=128000,
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
Qwen3_0_6B_FP8_TEMP0 = dict(type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-temp0',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_new_tokens and min_new_tokens
# which should generate between 90 and 100 tokens
Qwen3_0_6B_FP8_NEW_TOKENS = dict(type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-new-tokens',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(temperature=0.0,
top_k=1,
min_new_tokens=90,
max_new_tokens=100),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_seq_len and max_out_len
Qwen3_0_6B_FP8_MAX_SEQ_LEN = dict(type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-max-seq-len',
path='Qwen/Qwen3-0.6B-FP8',
max_seq_len=200,
max_out_len=100,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_words, no stop tokens should be in the output
Qwen3_0_6B_FP8_STOP_WORDS = dict(
type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-stop-words',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=4096,
max_out_len=4096,
batch_size=1,
stop_words=[' and', '', ' to', '\n\n', 'Question:', 'Answer:'],
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_TEMPLATE = dict(type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-template',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(top_k=1,
max_new_tokens=256),
max_seq_len=4096,
batch_size=1,
meta_template=test_meta_template,
run_cfg=dict(num_gpus=1))
# Test case for combined parameters
Qwen3_0_6B_FP8_COMBINED = dict(type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-combined',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(
temperature=0.1,
top_p=0.5,
repetition_penalty=0.000001,
random_seed=42,
max_new_tokens=128,
skip_special_tokens=True,
),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_TOKENIZER_ONLY = dict(type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-tokenizer-only',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(temperature=0.0,
top_k=1),
max_seq_len=4096,
max_out_len=1024,
tokenizer_only=True,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_MID = dict(type=HuggingFacewithChatTemplate,
abbr='hf-qwen3-0_6b-fp8-mid',
path='Qwen/Qwen3-0.6B-FP8',
generation_kwargs=dict(top_k=1),
max_seq_len=2048,
max_out_len=2000,
batch_size=1,
mode='mid',
run_cfg=dict(num_gpus=1))
models = [
Qwen3_0_6B_FP8,
Qwen3_0_6B_FP8_TEMP0,
Qwen3_0_6B_FP8_STOP_WORDS,
Qwen3_0_6B_FP8_NEW_TOKENS,
Qwen3_0_6B_FP8_MAX_SEQ_LEN,
Qwen3_0_6B_FP8_TEMPLATE,
Qwen3_0_6B_FP8_COMBINED,
Qwen3_0_6B_FP8_TOKENIZER_ONLY,
]
================================================
FILE: autotest/model/infer_vllm_base.py
================================================
from mmengine.config import read_base
from opencompass.models import VLLM
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.base_datasets import datasets
from autotest.model.constant import meta_template as test_meta_template
datasets = [x for x in datasets.copy() if 'InfiniteBench' not in x['abbr']]
# Base model testcase
Qwen3_0_6B_Base = dict(
type=VLLM,
abbr='vllm-qwen3-0_6b-base',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=32768,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(top_k=1, max_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
Qwen3_0_6B_Base_TEMP0 = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-temp0',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_BAD_WORDS = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-bad-words',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(
temperature=0.0,
top_k=1,
bad_words=['', '', ' to'],
),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_SESSION_LEN = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-session-len',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=10,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1),
max_seq_len=32768,
max_out_len=8192,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_new_tokens and min_new_tokens,
# which should generate between 90 and 100 tokens
Qwen3_0_6B_Base_NEW_TOKENS = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-new-tokens',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1,
min_tokens=90,
max_tokens=100),
max_seq_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_seq_len and max_out_len
Qwen3_0_6B_Base_MAX_SEQ_LEN = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-max-seq-len',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1),
max_seq_len=200,
max_out_len=100,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_words, no stop tokens should be in the output
Qwen3_0_6B_Base_STOP_WORDS = dict(
type=VLLM,
abbr='vllm-qwen3-0_6b-base-stop-words',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=4096,
max_out_len=4096,
batch_size=1,
stop_words=[' and', '', ' to', '\n\n', 'Question:', 'Answer:'],
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_TEMPLATE = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-template',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=32768,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(top_k=1,
max_tokens=256),
max_seq_len=32768,
batch_size=1,
meta_template=test_meta_template,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_MID = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-mid',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=32768,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(top_k=1, max_tokens=8192),
max_seq_len=2048,
max_out_len=2000,
batch_size=1,
mode='mid',
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_Base_HF_OVER = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-hf-over',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=32768,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
hf_overrides=dict(rope_scaling=dict(
rope_type='yarn',
factor=4.0,
original_max_position_embeddings=32768,
)),
),
generation_kwargs=dict(top_p=0.9,
temperature=0.7,
max_tokens=1024),
max_seq_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for combined parameters
Qwen3_0_6B_Base_COMBINED = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-combined',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(
temperature=0.1,
top_p=0.5,
repetition_penalty=0.000001,
random_seed=42,
max_tokens=128,
skip_special_tokens=True,
),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_token_ids
Qwen3_0_6B_Base_STOP_TOKEN_IDS = dict(
type=VLLM,
abbr='vllm-qwen3-0_6b-base-stop-token-ids',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1,
max_tokens=1024,
stop_token_ids=[151645, 151668]),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for logprobs
Qwen3_0_6B_Base_LOGPROBS = dict(type=VLLM,
abbr='vllm-qwen3-0_6b-base-logprobs',
path='Qwen/Qwen3-0.6B-Base',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1,
max_tokens=1024,
logprobs=5),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
models = [
Qwen3_0_6B_Base, Qwen3_0_6B_Base_MID, Qwen3_0_6B_Base_TEMP0,
Qwen3_0_6B_Base_BAD_WORDS, Qwen3_0_6B_Base_STOP_WORDS,
Qwen3_0_6B_Base_STOP_TOKEN_IDS, Qwen3_0_6B_Base_NEW_TOKENS,
Qwen3_0_6B_Base_MAX_SEQ_LEN, Qwen3_0_6B_Base_SESSION_LEN,
Qwen3_0_6B_Base_COMBINED, Qwen3_0_6B_Base_LOGPROBS,
Qwen3_0_6B_Base_TEMPLATE, Qwen3_0_6B_Base_HF_OVER
]
================================================
FILE: autotest/model/infer_vllm_chat.py
================================================
from mmengine.config import read_base
from opencompass.models import VLLMwithChatTemplate
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from autotest.model.chat_datasets import datasets
from autotest.model.constant import meta_template as test_meta_template
datasets = [x for x in datasets.copy() if 'InfiniteBench' not in x['abbr']]
# Base model testcase
Qwen3_0_6B_FP8 = dict(
type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-base',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=32768,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(top_k=1, max_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content))
Qwen3_0_6B_FP8_TEMP0 = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-temp0',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=32768,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_SESSION_LEN = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-session-len',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=10,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1),
max_seq_len=32768,
max_out_len=8192,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_new_tokens and min_new_tokens
# which should generate between 90 and 100 tokens
Qwen3_0_6B_FP8_NEW_TOKENS = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-new-tokens',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1,
min_tokens=90,
max_tokens=100),
max_seq_len=32768,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for max_seq_len and max_out_len
Qwen3_0_6B_FP8_MAX_SEQ_LEN = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-max-seq-len',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1),
max_seq_len=200,
max_out_len=100,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_words, no stop tokens should be in the output
Qwen3_0_6B_FP8_STOP_WORDS = dict(
type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-stop-words',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0, top_k=1),
max_seq_len=4096,
max_out_len=4096,
batch_size=1,
stop_words=[' and', '', ' to', '\n\n', 'Question:', 'Answer:'],
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_TEMPLATE = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-template',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=32768,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(top_k=1, max_tokens=256),
max_seq_len=32768,
batch_size=1,
meta_template=test_meta_template,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_HF_OVER = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-hf-over',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=32768,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
hf_overrides=dict(rope_scaling=dict(
rope_type='yarn',
factor=4.0,
original_max_position_embeddings=32768,
)),
),
generation_kwargs=dict(top_p=0.9,
temperature=0.7,
max_tokens=1024),
max_seq_len=128000,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Additional test cases for various top_k values
Qwen3_0_6B_FP8_TOPK50 = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-topk50',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(top_k=50,
temperature=1.2,
max_tokens=1024),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for combined parameters
Qwen3_0_6B_FP8_COMBINED = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-combined',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(
temperature=0.1,
top_p=0.5,
repetition_penalty=0.000001,
random_seed=42,
max_tokens=128,
skip_special_tokens=True,
),
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for stop_token_ids
Qwen3_0_6B_FP8_STOP_TOKEN_IDS = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-stop-token-ids',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(
temperature=0.0,
top_k=1,
max_tokens=1024,
stop_token_ids=[151645, 151668]),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
# Test case for logprobs
Qwen3_0_6B_FP8_LOGPROBS = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-logprobs',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1,
max_tokens=1024,
logprobs=5),
max_seq_len=4096,
max_out_len=1024,
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_NOTHINK = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-nothink',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1,
max_tokens=1024),
max_seq_len=4096,
max_out_len=1024,
chat_template_kwargs=dict(enable_thinking=False),
batch_size=1,
run_cfg=dict(num_gpus=1))
Qwen3_0_6B_FP8_TOKENIZER_ONLY = dict(type=VLLMwithChatTemplate,
abbr='vllm-qwen3-0_6b-fp8-tokenizer-only',
path='Qwen/Qwen3-0.6B-FP8',
model_kwargs=dict(
max_model_len=4096,
max_num_seqs=1,
tensor_parallel_size=1,
gpu_memory_utilization=0.7,
),
generation_kwargs=dict(temperature=0.0,
top_k=1,
max_tokens=1024),
max_seq_len=4096,
max_out_len=1024,
tokenizer_only=True,
batch_size=1,
run_cfg=dict(num_gpus=1))
models = [
Qwen3_0_6B_FP8,
Qwen3_0_6B_FP8_TEMP0,
Qwen3_0_6B_FP8_TOPK50,
Qwen3_0_6B_FP8_STOP_WORDS,
Qwen3_0_6B_FP8_STOP_TOKEN_IDS,
Qwen3_0_6B_FP8_NEW_TOKENS,
Qwen3_0_6B_FP8_MAX_SEQ_LEN,
Qwen3_0_6B_FP8_SESSION_LEN,
Qwen3_0_6B_FP8_LOGPROBS,
Qwen3_0_6B_FP8_TEMPLATE,
Qwen3_0_6B_FP8_HF_OVER,
Qwen3_0_6B_FP8_COMBINED,
Qwen3_0_6B_FP8_NOTHINK,
Qwen3_0_6B_FP8_TOKENIZER_ONLY,
]
================================================
FILE: autotest/oc_score_baseline.yaml
================================================
qwen2.5-7b-hf:
demo_gsm8k_accuracy: 78.12
race-middle_accuracy: 90.46
race-high_accuracy: 86.54
internlm3-8b-instruct-lmdeploy:
demo_gsm8k_accuracy: 75.00
race-middle_accuracy: 93.31
race-high_accuracy: 90.28
internlm3-8b-instruct_hf-lmdeploy:
demo_gsm8k_accuracy: 73.44
race-middle_accuracy: 93.38
race-high_accuracy: 90.34
Qwen2.5-7B_hf:
demo_gsm8k_accuracy: 78.12
race-middle_accuracy: 90.46
race-high_accuracy: 86.54
Qwen3-0.6B_hf-vllm:
demo_gsm8k_accuracy: 53.12
race-middle_accuracy: 38.23
race-high_accuracy: 28.07
lmdeploy-api-test:
IFEval_Prompt-level-strict-accuracy: 81.25
hle_llmjudge_accuracy: 75.00
mmlu_pro_math_accuracy: 18.75
mmlu_pro_other_accuracy: 25.00
race-middle_accuracy: 87.50
race-high_accuracy: 81.25
gsm8k_accuracy: 18.75
lmdeploy-api-streaming-test:
IFEval_Prompt-level-strict-accuracy: 68.75
hle_llmjudge_accuracy: 81.25
mmlu_pro_math_accuracy: 18.75
mmlu_pro_other_accuracy: 31.25
race-middle_accuracy: 81.25
race-high_accuracy: 75.00
gsm8k_accuracy: 25.00
lmdeploy-api-streaming-test-chunk:
IFEval_Prompt-level-strict-accuracy: 81.25
hle_llmjudge_accuracy: 75
mmlu_pro_math_accuracy: 18.75
mmlu_pro_other_accuracy: 37.50
race-middle_accuracy: 75
race-high_accuracy: 81.25
gsm8k_accuracy: 37.5
lmdeploy-api-test-maxlen:
IFEval_Prompt-level-strict-accuracy: 93.75
hle_llmjudge_accuracy: 56.25
mmlu_pro_math_accuracy: 62.50
mmlu_pro_other_accuracy: 43.75
race-middle_accuracy: 81.25
race-high_accuracy: 81.25
gsm8k_accuracy: 31.25
lmdeploy-api-test-maxlen-mid:
IFEval_Prompt-level-strict-accuracy: 12.50
hle_llmjudge_accuracy: 81.25
mmlu_pro_math_accuracy: 0
mmlu_pro_other_accuracy: 0
race-middle_accuracy: 18.75
race-high_accuracy: 12.50
gsm8k_accuracy: 43.75
lmdeploy-api-test-nothink:
IFEval_Prompt-level-strict-accuracy: 93.75
hle_llmjudge_accuracy: 62.50
mmlu_pro_math_accuracy: 62.50
mmlu_pro_other_accuracy: 50.00
race-middle_accuracy: 87.50
race-high_accuracy: 87.50
gsm8k_accuracy: 31.25
lmdeploy-api-test-chat-template:
IFEval_Prompt-level-strict-accuracy: 68.75
hle_llmjudge_accuracy: 62.50
mmlu_pro_math_accuracy: 18.75
mmlu_pro_other_accuracy: 6.25
race-middle_accuracy: 81.25
race-high_accuracy: 68.75
================================================
FILE: autotest/utils/compare_results.py
================================================
import filecmp
import os
import fire
def compare_results(folder1,
folder2,
compare_type: str = 'predictions',
results_ignore_list: list = [
'srbench.json', 'dingo_en_192.json',
'dingo_zh_170.json', 'qa_dingo_cn.json', 'srbench.json'
]):
# Initialize ignore_list if not provided
if results_ignore_list is None:
results_ignore_list = []
# Verify that both folders exist
assert os.path.isdir(folder1), f'Folder does not exist: {folder1}'
assert os.path.isdir(folder2), f'Folder does not exist: {folder2}'
sub_folder1 = get_all_subpaths(folder1)[0]
sub_folder2 = get_all_subpaths(folder2)[0]
print(f'compare {compare_type}')
compare_folders(os.path.join(sub_folder1, compare_type),
os.path.join(sub_folder2, compare_type),
results_ignore_list=results_ignore_list)
def compare_folders(folder1, folder2, results_ignore_list=None):
'''
Compare the contents of files with the same name in two folders
and their subfolders,
ignoring files specified in the ignore_list.
:param folder1: Path to the first folder
:param folder2: Path to the second folder
:param ignore_list: List of filenames to ignore
(e.g., ['temp.txt', '.DS_Store'])
:raises: AssertionError if any non-ignored files are missing or
have different content
'''
# Initialize ignore_list if not provided
if results_ignore_list is None:
results_ignore_list = []
# Verify that both folders exist
assert os.path.isdir(folder1), f'Folder does not exist: {folder1}'
assert os.path.isdir(folder2), f'Folder does not exist: {folder2}'
# List to store differences found
diff_files = []
# Walk through the first folder and compare files
for root, dirs, files in os.walk(folder1):
for file in files:
# Skip files in the ignore list
if os.path.basename(file) in results_ignore_list:
print('ignore case: ' + os.path.basename(file))
continue
# Get relative path to maintain folder structure
rel_path = os.path.relpath(os.path.join(root, file), folder1)
file2 = os.path.join(folder2, rel_path)
# Check if file exists in the second folder
if not os.path.exists(file2):
diff_files.append((rel_path, 'File missing in second folder'))
continue
# Compare file content (shallow=False for content comparison)
if not filecmp.cmp(os.path.join(root, file), file2, shallow=False):
diff_files.append((rel_path, 'Content differs'))
# Check for files in folder2 that don't exist in folder1
for root, dirs, files in os.walk(folder2):
for file in files:
# Skip files in the ignore list
if file in results_ignore_list:
continue
rel_path = os.path.relpath(os.path.join(root, file), folder2)
file1 = os.path.join(folder1, rel_path)
if not os.path.exists(file1):
diff_files.append((rel_path, 'File missing in first folder'))
# Raise AssertionError if any differences were found
if diff_files:
error_msg = 'Found differences in files:\n'
error_msg += '\n'.join(f'{path}: {reason}'
for path, reason in diff_files)
raise AssertionError(error_msg)
def get_all_subpaths(directory):
'''
Get all subpaths (files and directories) within a given directory.
Args:
directory (str): The root directory path to search
Returns:
list: A list of all complete subpaths
'''
subpaths = []
# Verify the directory exists
if not os.path.isdir(directory):
raise ValueError(f'Directory does not exist: {directory}')
# Walk through the directory tree
for root, dirs, files in os.walk(directory):
# Add directories first
for dir_name in dirs:
full_path = os.path.join(root, dir_name)
subpaths.append(full_path)
# Then add files
for file_name in files:
full_path = os.path.join(root, file_name)
subpaths.append(full_path)
return subpaths
if __name__ == '__main__':
fire.Fire()
================================================
FILE: autotest/utils/health_check.py
================================================
from time import sleep
import fire
import requests
def health_check(url: str = 'http://0.0.0.0:23333', timeout: int = 300):
for i in range(int(timeout / 5)):
try:
sleep(5)
response = requests.get(f'{url}/health', timeout=5)
if response.status_code == 200:
return True
except requests.exceptions.RequestException:
continue
raise TimeoutError(f'Health check failed after {timeout} seconds.')
if __name__ == '__main__':
fire.Fire(health_check)
================================================
FILE: autotest/utils/oc_score_assert.py
================================================
import csv
import os
import pytest
import yaml
output_path = 'regression_result_daily'
@pytest.fixture()
def baseline_scores(request):
config_path = os.path.join(request.config.rootdir,
'autotest/oc_score_baseline.yaml')
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config
@pytest.fixture()
def result_scores():
file = find_csv_files(output_path)
if file is None:
return None
return read_csv_file(file)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores')
class TestCmdCase:
@pytest.mark.case1
@pytest.mark.parametrize('model, dataset',
[('qwen2.5-7b-hf', 'race-middle_accuracy'),
('qwen2.5-7b-hf', 'race-high_accuracy'),
('qwen2.5-7b-hf', 'demo_gsm8k_accuracy')])
def test_cmd_case1(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.case2
@pytest.mark.parametrize(
'model, dataset',
[('qwen2.5-7b-hf', 'race-middle_accuracy'),
('qwen2.5-7b-hf', 'race-high_accuracy'),
('qwen2.5-7b-hf', 'demo_gsm8k_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'race-middle_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'race-high_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'demo_gsm8k_accuracy')])
def test_cmd_case2(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.case3
@pytest.mark.parametrize('model, dataset',
[('Qwen2.5-7B_hf', 'race-middle_accuracy'),
('Qwen2.5-7B_hf', 'race-high_accuracy'),
('Qwen2.5-7B_hf', 'demo_gsm8k_accuracy')])
def test_cmd_case3(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.case4
@pytest.mark.parametrize(
'model, dataset',
[('internlm3-8b-instruct_hf-lmdeploy', 'race-middle_accuracy'),
('internlm3-8b-instruct_hf-lmdeploy', 'race-high_accuracy'),
('internlm3-8b-instruct_hf-lmdeploy', 'demo_gsm8k_accuracy')])
def test_cmd_case4(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.case5
@pytest.mark.parametrize('model, dataset',
[('Qwen3-0.6B_hf-vllm', 'race-middle_accuracy'),
('Qwen3-0.6B_hf-vllm', 'race-high_accuracy'),
('Qwen3-0.6B_hf-vllm', 'demo_gsm8k_accuracy')])
def test_cmd_case5(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
def assert_score(model_type, score, baseline, dataset: str = ''):
if baseline is None:
return
if score is None or score == '-':
assert False, 'value is none'
if 'batch' not in model_type:
if float(score) <= (float(baseline) +
0.01) and float(score) >= (float(baseline) - 0.01):
print(' '.join([score, 'is equal', str(baseline)]))
assert True
else:
print(' '.join([score, 'is not equal', str(baseline)]))
assert False, ' '.join([score, 'is not equal', str(baseline)])
else:
if dataset.startswith('dingo') or dataset.startswith(
'GPQA') or dataset.startswith('high') or dataset.startswith(
'mmlu_pro_') or dataset.startswith(
'alpaca_eval') or dataset.startswith('compassarena_'):
threshold = 5
elif dataset.startswith('humanevalx') or dataset == 'large_threshold':
threshold = 10
else:
threshold = 3.2
if float(score) <= (baseline + threshold) and float(score) >= (
baseline - threshold):
print(' '.join([
score, 'is between',
str(baseline - threshold), 'and',
str(baseline + threshold)
]))
assert True
else:
print(' '.join([
score, 'is not between',
str(baseline - threshold), 'and',
str(baseline + threshold)
]))
assert False, ' '.join([
score, 'is not between',
str(baseline - threshold), 'and',
str(baseline + threshold)
])
def find_csv_files(directory):
csv_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.csv') and file.startswith('summary'):
csv_files.append(os.path.join(root, file))
csv_files_with_time = {f: os.path.getctime(f) for f in csv_files}
sorted_csv_files = sorted(csv_files_with_time.items(), key=lambda x: x[1])
latest_csv_file = sorted_csv_files[-1][0]
return latest_csv_file
def read_csv_file(file_path):
with open(file_path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
filtered_data = []
for row in reader:
if row['metric'] is not None and 'bpb' not in row[
'metric'] and '_' != row['metric']:
filtered_row = row
filtered_row['dataset'] = row['dataset'] + '_' + row['metric']
del filtered_row['version']
del filtered_row['metric']
del filtered_row['mode']
filtered_data.append(filtered_row)
result = {}
for data in filtered_data:
dataset = data.get('dataset')
for key in data.keys():
if key == 'dataset':
continue
else:
if key in result.keys():
result.get(key)[dataset] = data.get(key)
else:
result[key] = {dataset: data.get(key)}
return result
================================================
FILE: dataset-index.yml
================================================
- ifeval:
name: IFEval
category: Instruction Following
paper: https://arxiv.org/pdf/2311.07911
configpath: opencompass/configs/datasets/IFEval/IFEval_gen.py
configpath_llmjudge: ''
- nphard:
name: NPHardEval
category: Reasoning
paper: https://arxiv.org/pdf/2312.14890v2
configpath: opencompass/configs/datasets/NPHardEval/NPHardEval_gen.py
configpath_llmjudge: ''
- pmmeval:
name: PMMEval
category: Language
paper: https://arxiv.org/pdf/2411.09116v1
configpath: opencompass/configs/datasets/PMMEval/pmmeval_gen.py
configpath_llmjudge: ''
- pi_llm:
name: PI-LLM
category: Memory
paper: https://arxiv.org/abs/2506.08184
configpath: opencompass/configs/datasets/PI_LLM/pi_llm_gen.py
configpath_llmjudge: ''
- theoremqa:
name: TheroremQA
category: Reasoning
paper: https://arxiv.org/pdf/2305.12524
configpath: opencompass/configs/datasets/TheroremQA/TheoremQA_gen.py
configpath_llmjudge: ''
- agieval:
name: AGIEval
category: Examination
paper: https://arxiv.org/pdf/2304.06364
configpath: opencompass/configs/datasets/agieval/agieval_gen.py
configpath_llmjudge: ''
- babilong:
name: BABILong
category: Long Context
paper: https://arxiv.org/pdf/2406.10149
configpath: opencompass/configs/datasets/babilong
configpath_llmjudge: ''
- bigcodebench:
name: BigCodeBench
category: Code
paper: https://arxiv.org/pdf/2406.15877
configpath: opencompass/configs/datasets/bigcodebench/bigcodebench_gen.py
configpath_llmjudge: ''
- calm:
name: CaLM
category: Reasoning
paper: https://arxiv.org/pdf/2405.00622
configpath: opencompass/configs/datasets/calm/calm.py
configpath_llmjudge: ''
- infinitebench:
name: InfiniteBench (∞Bench)
category: Long Context
paper: https://aclanthology.org/2024.acl-long.814.pdf
configpath: opencompass/configs/datasets/infinitebench/infinitebench.py
configpath_llmjudge: ''
- korbench:
name: KOR-Bench
category: Reasoning
paper: https://arxiv.org/pdf/2410.06526v1
configpath: opencompass/configs/datasets/korbench/korbench_gen.py
configpath_llmjudge: opencompass/configs/datasets/korbench/korbench_llm_judge_gen.py
- lawbench:
name: LawBench
category: Knowledge / Law
paper: https://arxiv.org/pdf/2309.16289
configpath:
- opencompass/configs/datasets/lawbench/lawbench_zero_shot_gen_002588.py
- opencompass/configs/datasets/lawbench/lawbench_one_shot_gen_002588.py
configpath_llmjudge: ''
- leval:
name: L-Eval
category: Long Context
paper: https://arxiv.org/pdf/2307.11088v1
configpath: opencompass/configs/datasets/leval/leval.py
configpath_llmjudge: ''
- livecodebench:
name: LiveCodeBench
category: Code
paper: https://arxiv.org/pdf/2403.07974
configpath: opencompass/configs/datasets/livecodebench/livecodebench_gen.py
configpath_llmjudge: ''
- livecodebench_pro:
name: LiveCodeBench Pro
category: Code
paper: https://arxiv.org/pdf/2506.11928
configpath: opencompass/configs/datasets/livecodebench_pro/livecodebench_pro_gen.py
configpath_llmjudge: ''
- livemathbench:
name: LiveMathBench
category: Math
paper: https://arxiv.org/pdf/2412.13147
configpath: opencompass/configs/datasets/livemathbench/livemathbench_gen.py
configpath_llmjudge: ''
- livereasonbench:
name: LiveReasonBench
category: Reasoning
paper: ''
configpath: opencompass/configs/datasets/livereasonbench/livereasonbench_gen.py
configpath_llmjudge: ''
- longbench:
name: LongBench
category: Long Context
paper: https://github.com/THUDM/LongBench
configpath:
- opencompass/configs/datasets/longbench/longbench.py
- opencompass/configs/datasets/longbenchv2/longbenchv2_gen.py
configpath_llmjudge: ''
- lveval:
name: LV-Eval
category: Long Context
paper: https://arxiv.org/pdf/2402.05136
configpath: opencompass/configs/datasets/lveval/lveval.py
configpath_llmjudge: ''
- mastermath2024v1:
name: Mastermath2024v1
category: Math
paper: ''
configpath: opencompass/configs/datasets/mastermath2024v1/mastermath2024v1_gen.py
configpath_llmjudge: ''
- matbench:
name: matbench
category: Science / Material
paper: 'https://www.nature.com/articles/s41524-020-00406-3'
configpath: opencompass/configs/datasets/matbench/matbench_gen_f71840.py
configpath_llmjudge: ''
- medbench:
name: MedBench
category: Knowledge / Medicine
paper: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10778138
configpath: opencompass/configs/datasets/MedBench/medbench_gen.py
configpath_llmjudge: ''
- MedCalc_Bench:
name: MedCalc_Bench
category: Knowledge / Medicine
paper: https://arxiv.org/abs/2406.12036
configpath: opencompass/configs/datasets/MedCalc_Bench/MedCalcBench_official_gen_a5155f.py
configpath_llmjudge: ''
- MedXpertQA:
name: MedQA
category: Knowledge / Medicine
paper: https://arxiv.org/abs/2009.13081
configpath: opencompass/configs/datasets/MedQA/MedQA_gen.py
configpath_llmjudge: opencompass/configs/datasets/MedQA/MedQA_llmjudge_gen.py
- MedXpertQA:
name: MedXpertQA
category: Knowledge / Medicine
paper: https://arxiv.org/abs/2501.18362
configpath: opencompass/configs/datasets/MedXpertQA/MedXpertQA_gen.py
configpath_llmjudge: opencompass/configs/datasets/MedXpertQA/MedXpertQA_llmjudge_gen.py
- ClinicBench:
name: ClinicBench
category: Knowledge / Medicine
paper: https://arxiv.org/abs/2405.00716
configpath: ''
configpath_llmjudge: opencompass/configs/datasets/ClinicBench/ClinicBench_llmjudge_gen.py
- ScienceQA:
name: ScienceQA
category: Knowledge / Medicine
paper: https://arxiv.org/abs/2209.09513
configpath: ''
configpath_llmjudge: opencompass/configs/datasets/ScienceQA/ScienceQA_llmjudge_gen.py
- PubMedQA:
name: PubMedQA
category: Knowledge / Medicine
paper: https://arxiv.org/abs/1909.06146
configpath: ''
configpath_llmjudge: opencompass/configs/datasets/PubMedQA/PubMedQA_llmjudge_gen.py
- musr:
name: MuSR
category: Reasoning
paper: https://arxiv.org/pdf/2310.16049
configpath: opencompass/configs/datasets/musr/musr_gen.py
configpath_llmjudge: opencompass/configs/datasets/musr/musr_llm_judge_gen.py
- needlebench:
name: NeedleBench V1 (Deprecated)
category: Long Context
paper: https://arxiv.org/abs/2407.11963v1
configpath: opencompass/configs/datasets/needlebench
configpath_llmjudge: ''
- needlebench_v2:
name: NeedleBench V2
category: Long Context
paper: https://arxiv.org/abs/2407.11963v2
configpath: opencompass/configs/datasets/needlebench_v2
configpath_llmjudge: ''
- ruler:
name: RULER
category: Long Context
paper: https://arxiv.org/pdf/2404.06654
configpath: opencompass/configs/datasets/ruler
configpath_llmjudge: ''
- alignment:
name: AlignBench
category: Subjective / Alignment
paper: https://arxiv.org/pdf/2311.18743
configpath: opencompass/configs/datasets/subjective/alignbench
configpath_llmjudge: ''
- alpaca:
name: AlpacaEval
category: Subjective / Instruction Following
paper: https://github.com/tatsu-lab/alpaca_eval
configpath: opencompass/configs/datasets/subjective/aplaca_eval
configpath_llmjudge: ''
- arenahard:
name: Arena-Hard
category: Subjective / Chatbot
paper: https://lmsys.org/blog/2024-04-19-arena-hard/
configpath: opencompass/configs/datasets/subjective/arena_hard
configpath_llmjudge: ''
- flames:
name: FLAMES
category: Subjective / Alignment
paper: https://arxiv.org/pdf/2311.06899
configpath: opencompass/configs/datasets/subjective/flames/flames_gen.py
configpath_llmjudge: ''
- fofo:
name: FOFO
category: Subjective / Format Following
paper: https://arxiv.org/pdf/2402.18667
configpath: opencompass/configs/datasets/subjective/fofo
configpath_llmjudge: ''
- followbench:
name: FollowBench
category: Subjective / Instruction Following
paper: https://arxiv.org/pdf/2310.20410
configpath: opencompass/configs/datasets/subjective/followbench
configpath_llmjudge: ''
- hellobench:
name: HelloBench
category: Subjective / Long Context
paper: https://arxiv.org/pdf/2409.16191
configpath: opencompass/configs/datasets/subjective/hellobench
configpath_llmjudge: ''
- judgerbench:
name: JudgerBench
category: Subjective / Long Context
paper: https://arxiv.org/pdf/2410.16256
configpath: opencompass/configs/datasets/subjective/judgerbench
configpath_llmjudge: ''
- multiround:
name: MT-Bench-101
category: Subjective / Multi-Round
paper: https://arxiv.org/pdf/2402.14762
configpath: opencompass/configs/datasets/subjective/multiround
configpath_llmjudge: ''
- wildbench:
name: WildBench
category: Subjective / Real Task
paper: https://arxiv.org/pdf/2406.04770
configpath: opencompass/configs/datasets/subjective/wildbench
configpath_llmjudge: ''
- teval:
name: T-Eval
category: Tool Utilization
paper: https://arxiv.org/pdf/2312.14033
configpath:
- opencompass/configs/datasets/teval/teval_en_gen.py
- opencompass/configs/datasets/teval/teval_zh_gen.py
configpath_llmjudge: ''
- finalceiq:
name: FinanceIQ
category: Knowledge / Finance
paper: https://github.com/Duxiaoman-DI/XuanYuan/tree/main/FinanceIQ
configpath: opencompass/configs/datasets/FinanceIQ/FinanceIQ_gen.py
configpath_llmjudge: ''
- gaokaobench:
name: GAOKAOBench
category: Examination
paper: https://arxiv.org/pdf/2305.12474
configpath: opencompass/configs/datasets/GaokaoBench/GaokaoBench_gen.py
configpath_llmjudge: ''
- lcbench:
name: LCBench
category: Code
paper: https://github.com/open-compass/CodeBench/
configpath: opencompass/configs/datasets/LCBench/lcbench_gen.py
configpath_llmjudge: ''
- MMLUArabic:
name: ArabicMMLU
category: Language
paper: https://arxiv.org/pdf/2402.12840
configpath: opencompass/configs/datasets/MMLUArabic/MMLUArabic_gen.py
configpath_llmjudge: ''
- OpenFinData:
name: OpenFinData
category: Knowledge / Finance
paper: https://github.com/open-compass/OpenFinData
configpath: opencompass/configs/datasets/OpenFinData/OpenFinData_gen.py
configpath_llmjudge: ''
- QuALITY:
name: QuALITY
category: Long Context
paper: https://arxiv.org/pdf/2112.08608
configpath: opencompass/configs/datasets/QuALITY/QuALITY_gen.py
configpath_llmjudge: ''
- advglue:
name: Adversarial GLUE
category: Safety
paper: https://openreview.net/pdf?id=GF9cSKI3A_q
configpath:
- opencompass/configs/datasets/adv_glue/adv_glue_mnli/adv_glue_mnli_gen.py
- opencompass/configs/datasets/adv_glue/adv_glue_mnli_mm/adv_glue_mnli_mm_gen.py
- opencompass/configs/datasets/adv_glue/adv_glue_mnli_qnli/adv_glue_qnli_gen.py
- opencompass/configs/datasets/adv_glue/adv_glue_mnli_qqp/adv_glue_qqp_gen.py
- opencompass/configs/datasets/adv_glue/adv_glue_mnli_rte/adv_glue_rte_gen.py
- opencompass/configs/datasets/adv_glue/adv_glue_mnli_sst2/adv_glue_sst2_gen.py
configpath_llmjudge: ''
- afqmcd:
name: CLUE / AFQMC
category: Language
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_afqmc/CLUE_afqmc_gen.py
configpath_llmjudge: ''
- aime2024:
name: AIME2024
category: Examination
paper: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
configpath: opencompass/configs/datasets/aime2024/aime2024_gen.py
configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llmjudge_gen.py
- anli:
name: Adversarial NLI
category: Reasoning
paper: https://arxiv.org/pdf/1910.14599v2
configpath: opencompass/configs/datasets/anli/anli_gen.py
configpath_llmjudge: ''
- anthropics_evals:
name: Anthropics Evals
category: Safety
paper: https://arxiv.org/pdf/2212.09251
configpath:
- opencompass/configs/datasets/anthropics_evals/airisk_gen.py
- opencompass/configs/datasets/anthropics_evals/persona_gen.py
- opencompass/configs/datasets/anthropics_evals/sycophancy_gen.py
configpath_llmjudge: ''
- apps:
name: APPS
category: Code
paper: https://arxiv.org/pdf/2105.09938
configpath:
- opencompass/configs/datasets/apps/apps_gen.py
- opencompass/configs/datasets/apps/apps_mini_gen.py
configpath_llmjudge: ''
- arc:
name: ARC
category: Reasoning
paper: https://arxiv.org/pdf/1803.05457
configpath:
- opencompass/configs/datasets/ARC_c/ARC_c_gen.py
- opencompass/configs/datasets/ARC_e/ARC_e_gen.py
configpath_llmjudge: ''
- arc_prize_public_eval:
name: ARC Prize
category: ARC-AGI
paper: https://arcprize.org/guide#private
configpath: opencompass/configs/datasets/ARC_Prize_Public_Evaluation/arc_prize_public_evaluation_gen.py
configpath_llmjudge: ''
- ax:
name: SuperGLUE / AX
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath:
- opencompass/configs/datasets/SuperGLUE_AX_b/SuperGLUE_AX_b_gen.py
- opencompass/configs/datasets/SuperGLUE_AX_g/SuperGLUE_AX_g_gen.py
configpath_llmjudge: ''
- bbh:
name: BIG-Bench Hard
category: Reasoning
paper: https://arxiv.org/pdf/2210.09261
configpath: opencompass/configs/datasets/bbh/bbh_gen.py
configpath_llmjudge: opencompass/configs/datasets/bbh/bbh_llm_judge_gen.py
- bbeh:
name: BIG-Bench Extra Hard
category: Reasoning
paper: https://arxiv.org/abs/2502.19187
configpath: opencompass/configs/datasets/bbeh
configpath_llmjudge: ''
- BoolQ:
name: SuperGLUE / BoolQ
category: Knowledge
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_BoolQ/SuperGLUE_BoolQ_gen.py
configpath_llmjudge: ''
- c3:
name: CLUE / C3 (C³)
category: Understanding
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_C3/CLUE_C3_gen.py
configpath_llmjudge: ''
- CARDBiomedBench:
name: CARDBiomedBench
category: Knowledge / Medicine
paper: https://www.biorxiv.org/content/10.1101/2025.01.15.633272v1
configpath: opencompass/configs/datasets/CARDBiomedBench
configpath_llmjudge: 'opencompass/configs/datasets/CARDBiomedBench/CARDBiomedBench_llmjudge_gen_99a231.py'
- cb:
name: SuperGLUE / CB
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_CB/SuperGLUE_CB_gen.py
configpath_llmjudge: ''
- ceval:
name: C-EVAL
category: Examination
paper: https://arxiv.org/pdf/2305.08322v1
configpath: opencompass/configs/datasets/ceval/ceval_gen.py
configpath_llmjudge: ''
- charm:
name: CHARM
category: Reasoning
paper: https://arxiv.org/pdf/2403.14112
configpath: opencompass/configs/datasets/CHARM/charm_reason_gen.py
configpath_llmjudge: ''
- chembench:
name: ChemBench
category: Knowledge / Chemistry
paper: https://arxiv.org/pdf/2404.01475
configpath: opencompass/configs/datasets/ChemBench/ChemBench_gen.py
configpath_llmjudge: ''
- chid:
name: FewCLUE / CHID
category: Language
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_chid/FewCLUE_chid_gen.py
configpath_llmjudge: ''
- chinese_simpleqa:
name: Chinese SimpleQA
category: Knowledge
paper: https://arxiv.org/pdf/2411.07140
configpath: opencompass/configs/datasets/chinese_simpleqa/chinese_simpleqa_gen.py
configpath_llmjudge: ''
- cibench:
name: CIBench
category: Code
paper: https://www.arxiv.org/pdf/2407.10499
configpath:
- opencompass/configs/datasets/CIBench/CIBench_generation_gen_8ab0dc.py
- opencompass/configs/datasets/CIBench/CIBench_template_gen_e6b12a.py
- opencompass/configs/datasets/CIBench/CIBench_template_oracle_gen_fecda1.py
configpath_llmjudge: ''
- civilcomments:
name: CivilComments
category: Safety
paper: https://arxiv.org/pdf/1903.04561
configpath: opencompass/configs/datasets/civilcomments/civilcomments_clp.py
configpath_llmjudge: ''
- clozeTest_maxmin:
name: Cloze Test-max/min
category: Code
paper: https://arxiv.org/pdf/2102.04664
configpath: opencompass/configs/datasets/clozeTest_maxmin/clozeTest_maxmin_gen.py
configpath_llmjudge: ''
- cluewsc:
name: FewCLUE / CLUEWSC
category: Language / WSC
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_cluewsc/FewCLUE_cluewsc_gen.py
configpath_llmjudge: ''
- cmb:
name: CMB
category: Knowledge / Medicine
paper: https://arxiv.org/pdf/2308.08833
configpath: opencompass/configs/datasets/cmb/cmb_gen.py
configpath_llmjudge: ''
- cmmlu:
name: CMMLU
category: Understanding
paper: https://arxiv.org/pdf/2306.09212
configpath: opencompass/configs/datasets/cmmlu/cmmlu_gen.py
configpath_llmjudge: opencompass/configs/datasets/cmmlu/cmmlu_llm_judge_gen.py
- cmnli:
name: CLUE / CMNLI
category: Reasoning
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_gen.py
configpath_llmjudge: ''
- cmo_fib:
name: cmo_fib
category: Examination
paper: ''
configpath: opencompass/configs/datasets/cmo_fib/cmo_fib_gen.py
configpath_llmjudge: ''
- cmrc:
name: CLUE / CMRC
category: Understanding
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_CMRC/CLUE_CMRC_gen.py
configpath_llmjudge: ''
- commonsenseqa:
name: CommonSenseQA
category: Knowledge
paper: https://arxiv.org/pdf/1811.00937v2
configpath: opencompass/configs/datasets/commonsenseqa/commonsenseqa_gen.py
configpath_llmjudge: ''
- commonsenseqa_cn:
name: CommonSenseQA-CN
category: Knowledge
paper: ''
configpath: opencompass/configs/datasets/commonsenseqa_cn/commonsenseqacn_gen.py
configpath_llmjudge: ''
- copa:
name: SuperGLUE / COPA
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_COPA/SuperGLUE_COPA_gen.py
configpath_llmjudge: ''
- crowspairs:
name: CrowsPairs
category: Safety
paper: https://arxiv.org/pdf/2010.00133
configpath: opencompass/configs/datasets/crowspairs/crowspairs_gen.py
configpath_llmjudge: ''
- crowspairs_cn:
name: CrowsPairs-CN
category: Safety
paper: ''
configpath: opencompass/configs/datasets/crowspairs_cn/crowspairscn_gen.py
configpath_llmjudge: ''
- cvalues:
name: CVALUES
category: Safety
paper: http://xdp-expriment.oss-cn-zhangjiakou.aliyuncs.com/shanqi.xgh/release_github/CValues.pdf
configpath: opencompass/configs/datasets/cvalues/cvalues_responsibility_gen.py
configpath_llmjudge: ''
- drcd:
name: CLUE / DRCD
category: Understanding
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_DRCD/CLUE_DRCD_gen.py
configpath_llmjudge: ''
- drop:
name: DROP (DROP Simple Eval)
category: Understanding
paper: https://arxiv.org/pdf/1903.00161
configpath: opencompass/configs/datasets/drop/drop_gen.py
configpath_llmjudge: opencompass/configs/datasets/drop/drop_llm_judge_gen.py
- ds1000:
name: DS-1000
category: Code
paper: https://arxiv.org/pdf/2211.11501
configpath:
- opencompass/configs/datasets/ds1000/ds1000_gen_5c4bec.py
configpath_llmjudge: ''
- eprstmt:
name: FewCLUE / EPRSTMT
category: Understanding
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_eprstmt/FewCLUE_eprstmt_gen.py
configpath_llmjudge: ''
- flores:
name: Flores
category: Language
paper: https://aclanthology.org/D19-1632.pdf
configpath: opencompass/configs/datasets/flores/flores_gen.py
configpath_llmjudge: ''
- game24:
name: Game24
category: Math
paper: https://huggingface.co/datasets/nlile/24-game
configpath: opencompass/configs/datasets/game24/game24_gen.py
configpath_llmjudge: ''
- govrepcrs:
name: Government Report Dataset
category: Long Context
paper: https://aclanthology.org/2021.naacl-main.112.pdf
configpath: opencompass/configs/datasets/govrepcrs/govrepcrs_gen.py
configpath_llmjudge: ''
- gpqa:
name: GPQA
category: Knowledge
paper: https://arxiv.org/pdf/2311.12022v1
configpath: opencompass/configs/datasets/gpqa/gpqa_gen.py
configpath_llmjudge: opencompass/configs/datasets/gpqa/gpqa_llm_judge_gen.py
- gsm8k:
name: GSM8K
category: Math
paper: https://arxiv.org/pdf/2110.14168v2
configpath: opencompass/configs/datasets/gsm8k/gsm8k_gen.py
configpath_llmjudge: ''
- gsm_hard:
name: GSM-Hard
category: Math
paper: https://proceedings.mlr.press/v202/gao23f/gao23f.pdf
configpath: opencompass/configs/datasets/gsm_hard/gsmhard_gen.py
configpath_llmjudge: ''
- hle:
name: HLE(Humanity's Last Exam)
category: Reasoning
paper: https://lastexam.ai/paper
configpath: opencompass/configs/datasets/HLE/hle_gen.py
configpath_llmjudge: ''
- hellaswag:
name: HellaSwag
category: Reasoning
paper: https://arxiv.org/pdf/1905.07830
configpath: opencompass/configs/datasets/hellaswag/hellaswag_gen.py
configpath_llmjudge: opencompass/configs/datasets/hellaswag/hellaswag_llm_judge_gen.py
- humaneval:
name: HumanEval
category: Code
paper: https://arxiv.org/pdf/2107.03374v2
configpath: opencompass/configs/datasets/humaneval/humaneval_gen.py
configpath_llmjudge: ''
- humaneval_cn:
name: HumanEval-CN
category: Code
paper: ''
configpath: opencompass/configs/datasets/humaneval_cn/humaneval_cn_gen.py
configpath_llmjudge: ''
- humaneval_multi:
name: Multi-HumanEval
category: Code
paper: https://arxiv.org/pdf/2210.14868
configpath: opencompass/configs/datasets/humaneval_multi/humaneval_multi_gen.py
configpath_llmjudge: ''
- humaneval_multi:
name: HumanEval+
category: Code
paper: https://arxiv.org/pdf/2305.01210
configpath: opencompass/configs/datasets/humaneval_plus/humaneval_plus_gen.py
configpath_llmjudge: ''
- humanevalx:
name: HumanEval-X
category: Code
paper: https://dl.acm.org/doi/pdf/10.1145/3580305.3599790
configpath: opencompass/configs/datasets/humanevalx/humanevalx_gen.py
configpath_llmjudge: ''
- humaneval_pro:
name: HumanEval Pro
category: Code
paper: https://arxiv.org/abs/2412.21199
configpath: opencompass/configs/datasets/humaneval_pro/humaneval_pro_gen.py
configpath_llmjudge: ''
- hungarian_math:
name: Hungarian_Math
category: Math
paper: https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam
configpath: opencompass/configs/datasets/hungarian_exam/hungarian_exam_gen.py
configpath_llmjudge: ''
- iwslt2017:
name: IWSLT2017
category: Language
paper: https://cris.fbk.eu/bitstream/11582/312796/1/iwslt17-overview.pdf
configpath: opencompass/configs/datasets/iwslt2017/iwslt2017_gen.py
configpath_llmjudge: ''
- jigsawmultilingual:
name: JigsawMultilingual
category: Safety
paper: https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data
configpath: opencompass/configs/datasets/jigsawmultilingual/jigsawmultilingual_clp.py
configpath_llmjudge: ''
- lambada:
name: LAMBADA
category: Understanding
paper: https://arxiv.org/pdf/1606.06031
configpath: opencompass/configs/datasets/lambada/lambada_gen.py
configpath_llmjudge: ''
- lcsts:
name: LCSTS
category: Understanding
paper: https://aclanthology.org/D15-1229.pdf
configpath: opencompass/configs/datasets/lcsts/lcsts_gen.py
configpath_llmjudge: ''
- livestembench:
name: LiveStemBench
category: ''
paper: ''
configpath: opencompass/configs/datasets/livestembench/livestembench_gen.py
configpath_llmjudge: ''
- llm_compression:
name: LLM Compression
category: Bits Per Character (BPC)
paper: https://arxiv.org/pdf/2404.09937
configpath: opencompass/configs/datasets/llm_compression/llm_compression.py
configpath_llmjudge: ''
- math:
name: MATH
category: Math
paper: https://arxiv.org/pdf/2103.03874
configpath: opencompass/configs/datasets/math/math_gen.py
configpath_llmjudge: opencompass/configs/datasets/math/math_llm_judge_gen.py
- math500:
name: MATH500
category: Math
paper: https://github.com/openai/prm800k
configpath: opencompass/configs/datasets/math/math_prm800k_500_gen.py
configpath_llmjudge: opencompass/configs/datasets/math/math_prm800k_500_llm_judge_gen.py
- math401:
name: MATH 401
category: Math
paper: https://arxiv.org/pdf/2304.02015
configpath: opencompass/configs/datasets/math401/math401_gen.py
configpath_llmjudge: ''
- mathbench:
name: MathBench
category: Math
paper: https://arxiv.org/pdf/2405.12209
configpath: opencompass/configs/datasets/mathbench/mathbench_gen.py
configpath_llmjudge: ''
- mbpp:
name: MBPP
category: Code
paper: https://arxiv.org/pdf/2108.07732
configpath: opencompass/configs/datasets/mbpp/mbpp_gen.py
configpath_llmjudge: ''
- mbpp_cn:
name: MBPP-CN
category: Code
paper: ''
configpath: opencompass/configs/datasets/mbpp_cn/mbpp_cn_gen.py
configpath_llmjudge: ''
- mbpp_plus:
name: MBPP-PLUS
category: Code
paper: ''
configpath: opencompass/configs/datasets/mbpp_plus/mbpp_plus_gen.py
configpath_llmjudge: ''
- mbpp_pro:
name: MBPP Pro
category: Code
paper: https://arxiv.org/abs/2412.21199
configpath: opencompass/configs/datasets/mbpp_pro/mbpp_pro_gen.py
configpath_llmjudge: ''
- mgsm:
name: MGSM
category: Language / Math
paper: https://arxiv.org/pdf/2210.03057
configpath: opencompass/configs/datasets/mgsm/mgsm_gen.py
configpath_llmjudge: ''
- mmlu:
name: MMLU
category: Understanding
paper: https://arxiv.org/pdf/2009.03300
configpath: opencompass/configs/datasets/mmlu/mmlu_gen.py
configpath_llmjudge: opencompass/configs/datasets/mmlu/mmlu_llm_judge_gen.py
- SciEval:
name: SciEval
category: Understanding
paper: https://arxiv.org/pdf/2308.13149
configpath: opencompass/configs/datasets/SciEval/SciEval_gen.py
configpath_llmjudge: opencompass/configs/datasets/SciEval/SciEval_llm_judge_gen.py
- mmlu_cf:
name: MMLU-CF
category: Understanding
paper: https://arxiv.org/pdf/2412.15194
configpath: opencompass/configs/datasets/mmlu_cf/mmlu_cf_gen.py
configpath_llmjudge: ''
- mmlu_pro:
name: MMLU-Pro
category: Understanding
paper: https://arxiv.org/pdf/2406.01574
configpath: opencompass/configs/datasets/mmlu_pro/mmlu_pro_gen.py
configpath_llmjudge: opencompass/configs/datasets/mmlu_pro/mmlu_pro_llm_judge_gen.py
- mmmlu:
name: MMMLU
category: Language / Understanding
paper: https://huggingface.co/datasets/openai/MMMLU
configpath:
- opencompass/configs/datasets/mmmlu/mmmlu_gen.py
- opencompass/configs/datasets/mmmlu_lite/mmmlu_lite_gen.py
configpath_llmjudge: ''
- multirc:
name: SuperGLUE / MultiRC
category: Understanding
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_MultiRC/SuperGLUE_MultiRC_gen.py
configpath_llmjudge: ''
- multipl_e:
name: MultiPL-E
category: Code
paper: https://arxiv.org/pdf/2210.14868
configpath: opencompass/configs/datasets/multipl_e
configpath_llmjudge: ''
- narrativeqa:
name: NarrativeQA
category: Understanding
paper: https://github.com/google-deepmind/narrativeqa
configpath: opencompass/configs/datasets/narrativeqa/narrativeqa_gen.py
configpath_llmjudge: ''
- natural_question:
name: NaturalQuestions
category: Knowledge
paper: https://github.com/google-research-datasets/natural-questions
configpath: opencompass/configs/datasets/nq/nq_gen.py
configpath_llmjudge: ''
- natural_question_cn:
name: NaturalQuestions-CN
category: Knowledge
paper: ''
configpath: opencompass/configs/datasets/nq_cn/nqcn_gen.py
configpath_llmjudge: ''
- obqa:
name: OpenBookQA
category: Knowledge
paper: https://arxiv.org/pdf/1809.02789v1
configpath: opencompass/configs/datasets/obqa/obqa_gen.py
configpath_llmjudge: ''
- olymmath:
name: OlymMATH
category: Math
paper: https://arxiv.org/abs/2503.21380
configpath: ''
configpath_llmjudge: opencompass/configs/datasets/OlymMATH/olymmath_llm_judeg_gen.py
- piqa:
name: OpenBookQA
category: Knowledge / Physics
paper: https://arxiv.org/pdf/1911.11641v1
configpath: opencompass/configs/datasets/piqa/piqa_gen.py
configpath_llmjudge: ''
- ProteinLMBench:
name: ProteinLMBench
category: Knowledge / Biology (Protein)
paper: https://arxiv.org/abs/2406.05540
configpath: opencompass/configs/datasets/ProteinLMBench/ProteinLMBench_gen.py
configpath_llmjudge: opencompass/configs/datasets/ProteinLMBench/ProteinLMBench_llmjudge_gen.py
- py150:
name: py150
category: Code
paper: https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line
configpath: opencompass/configs/datasets/py150/py150_gen.py
configpath_llmjudge: ''
- qasper:
name: Qasper
category: Long Context
paper: https://arxiv.org/pdf/2105.03011
configpath: opencompass/configs/datasets/qasper/qasper_gen.py
configpath_llmjudge: ''
- qaspercut:
name: Qasper-Cut
category: Long Context
paper: ''
configpath: opencompass/configs/datasets/qaspercut/qaspercut_gen.py
configpath_llmjudge: ''
- race:
name: RACE
category: Examination
paper: https://arxiv.org/pdf/1704.04683
configpath: opencompass/configs/datasets/race/race_gen.py
configpath_llmjudge: ''
- rbench:
name: R-Bench
category: Reasoning
paper: https://arxiv.org/pdf/2505.02018
configpath: opencompass/configs/datasets/R-Bench/rbench_gen_37cbaf8.py
configpath_llmjudge: ''
- realtoxicprompts:
name: RealToxicPrompts
category: Safety
paper: https://arxiv.org/pdf/2009.11462
configpath: opencompass/configs/datasets/realtoxicprompts/realtoxicprompts_gen.py
configpath_llmjudge: ''
- record:
name: SuperGLUE / ReCoRD
category: Understanding
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_ReCoRD/SuperGLUE_ReCoRD_gen.py
configpath_llmjudge: ''
- rte:
name: SuperGLUE / RTE
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_RTE/SuperGLUE_RTE_gen.py
configpath_llmjudge: ''
- ocnli:
name: CLUE / OCNLI
category: Reasoning
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_gen.py
configpath_llmjudge: ''
- ocnlifc:
name: FewCLUE / OCNLI-FC
category: Reasoning
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_ocnli_fc/FewCLUE_ocnli_fc_gen.py
configpath_llmjudge: ''
- rolebench:
name: RoleBench
category: Role Play
paper: https://arxiv.org/pdf/2310.00746
configpath: opencompass/configs/datasets/rolebench
configpath_llmjudge: ''
- s3eval:
name: S3Eval
category: Long Context
paper: https://aclanthology.org/2024.naacl-long.69.pdf
configpath: opencompass/configs/datasets/s3eval/s3eval_gen.py
configpath_llmjudge: ''
- scibench:
name: SciBench
category: Reasoning
paper: https://sxkdz.github.io/files/publications/ICML/SciBench/SciBench.pdf
configpath: opencompass/configs/datasets/scibench/scibench_gen.py
configpath_llmjudge: ''
- scicode:
name: SciCode
category: Code
paper: https://arxiv.org/pdf/2407.13168
configpath: opencompass/configs/datasets/scicode/scicode_gen.py
configpath_llmjudge: ''
- seedbench:
name: SeedBench
category: Knowledge
paper: 'https://aclanthology.org/2025.acl-long.1516.pdf'
configpath: opencompass/configs/datasets/SeedBench/seedbench_gen.py
configpath_llmjudge: ''
- simpleqa:
name: SimpleQA
category: Knowledge
paper: https://arxiv.org/pdf/2411.04368
configpath: opencompass/configs/datasets/SimpleQA/simpleqa_gen.py
configpath_llmjudge: ''
- siqa:
name: SocialIQA
category: Reasoning
paper: https://arxiv.org/pdf/1904.09728
configpath: opencompass/configs/datasets/siqa/siqa_gen.py
configpath_llmjudge: ''
- squad20:
name: SQuAD2.0
category: Understanding
paper: https://arxiv.org/pdf/1806.03822
configpath: opencompass/configs/datasets/squad20/squad20_gen.py
configpath_llmjudge: ''
- storycloze:
name: StoryCloze
category: Reasoning
paper: https://aclanthology.org/2022.emnlp-main.616.pdf
configpath: opencompass/configs/datasets/storycloze/storycloze_gen.py
configpath_llmjudge: ''
- strategyqa:
name: StrategyQA
category: Reasoning
paper: https://arxiv.org/pdf/2101.02235
configpath: opencompass/configs/datasets/strategyqa/strategyqa_gen.py
configpath_llmjudge: ''
- summedits:
name: SummEdits
category: Language
paper: https://aclanthology.org/2023.emnlp-main.600.pdf
configpath: opencompass/configs/datasets/summedits/summedits_gen.py
configpath_llmjudge: ''
- summscreen:
name: SummScreen
category: Understanding
paper: https://arxiv.org/pdf/2104.07091v1
configpath: opencompass/configs/datasets/summscreen/summscreen_gen.py
configpath_llmjudge: ''
- svamp:
name: SVAMP
category: Math
paper: https://aclanthology.org/2021.naacl-main.168.pdf
configpath: opencompass/configs/datasets/SVAMP/svamp_gen.py
configpath_llmjudge: ''
- tabmwp:
name: TabMWP
category: Math / Table
paper: https://arxiv.org/pdf/2209.14610
configpath: opencompass/configs/datasets/TabMWP/TabMWP_gen.py
configpath_llmjudge: ''
- taco:
name: TACO
category: Code
paper: https://arxiv.org/pdf/2312.14852
configpath: opencompass/configs/datasets/taco/taco_gen.py
configpath_llmjudge: ''
- tnews:
name: FewCLUE / TNEWS
category: Understanding
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_tnews/FewCLUE_tnews_gen.py
configpath_llmjudge: ''
- bustm:
name: FewCLUE / BUSTM
category: Reasoning
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_bustm/FewCLUE_bustm_gen.py
configpath_llmjudge: ''
- csl:
name: FewCLUE / CSL
category: Understanding
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_csl/FewCLUE_csl_gen.py
configpath_llmjudge: ''
- ocnli_fc:
name: FewCLUE / OCNLI-FC
category: Reasoning
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_ocnli_fc
configpath_llmjudge: ''
- triviaqa:
name: TriviaQA
category: Knowledge
paper: https://arxiv.org/pdf/1705.03551v2
configpath: opencompass/configs/datasets/triviaqa/triviaqa_gen.py
configpath_llmjudge: ''
- triviaqarc:
name: TriviaQA-RC
category: Knowledge / Understanding
paper: ''
configpath: opencompass/configs/datasets/triviaqarc/triviaqarc_gen.py
configpath_llmjudge: ''
- truthfulqa:
name: TruthfulQA
category: Safety
paper: https://arxiv.org/pdf/2109.07958v2
configpath: opencompass/configs/datasets/truthfulqa/truthfulqa_gen.py
configpath_llmjudge: ''
- tydiqa:
name: TyDi-QA
category: Language
paper: https://storage.googleapis.com/tydiqa/tydiqa.pdf
configpath: opencompass/configs/datasets/tydiqa/tydiqa_gen.py
configpath_llmjudge: ''
- wic:
name: SuperGLUE / WiC
category: Language
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_WiC/SuperGLUE_WiC_gen.py
configpath_llmjudge: ''
- wsc:
name: SuperGLUE / WSC
category: Language / WSC
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_WSC/SuperGLUE_WSC_gen.py
configpath_llmjudge: ''
- winogrande:
name: WinoGrande
category: Language / WSC
paper: https://arxiv.org/pdf/1907.10641v2
configpath: opencompass/configs/datasets/winogrande/winogrande_gen.py
configpath_llmjudge: ''
- xcopa:
name: XCOPA
category: Language
paper: https://arxiv.org/pdf/2005.00333
configpath: opencompass/configs/datasets/XCOPA/XCOPA_ppl.py
configpath_llmjudge: ''
- xiezhi:
name: Xiezhi
category: Knowledge
paper: https://arxiv.org/pdf/2306.05783
configpath: opencompass/configs/datasets/xiezhi/xiezhi_gen.py
configpath_llmjudge: ''
- xlsum:
name: XLSum
category: Understanding
paper: https://arxiv.org/pdf/2106.13822v1
configpath: opencompass/configs/datasets/XLSum/XLSum_gen.py
configpath_llmjudge: ''
- xsum:
name: Xsum
category: Understanding
paper: https://arxiv.org/pdf/1808.08745
configpath: opencompass/configs/datasets/Xsum/Xsum_gen.py
configpath_llmjudge: ''
- cola:
name: GLUE / CoLA
category: Understanding
paper: https://arxiv.org/pdf/1804.07461
configpath: opencompass/configs/datasets/GLUE_CoLA/GLUE_CoLA_ppl.py
configpath_llmjudge: ''
- mprc:
name: GLUE / MPRC
category: Understanding
paper: https://arxiv.org/pdf/1804.07461
configpath: opencompass/configs/datasets/GLUE_MRPC/GLUE_MRPC_ppl.py
configpath_llmjudge: ''
- qqp:
name: GLUE / QQP
category: Understanding
paper: https://arxiv.org/pdf/1804.07461
configpath: opencompass/configs/datasets/GLUE_QQP/GLUE_QQP_ppl.py
configpath_llmjudge: ''
- omni_math:
name: Omni-MATH
category: Math
paper: https://omni-math.github.io/
configpath: opencompass/configs/datasets/omni_math/omni_math_gen.py
configpath_llmjudge: ''
- wikibench:
name: WikiBench
category: Knowledge
paper: ''
configpath: opencompass/configs/datasets/wikibench/wikibench_gen.py
configpath_llmjudge: ''
- supergpqa:
name: SuperGPQA
category: Knowledge
paper: https://arxiv.org/pdf/2502.14739
configpath: opencompass/configs/datasets/supergpqa
configpath_llmjudge: ''
- climaqa:
name: ClimaQA
category: Science
paper: https://arxiv.org/pdf/2410.16701
configpath: ''
configpath_llmjudge:
- opencompass/configs/datasets/ClimaQA/ClimaQA_Gold_llm_judge.py
- opencompass/configs/datasets/ClimaQA/ClimaQA_Silver_llm_judge.py
- physics:
name: PHYSICS
category: Science
paper: https://arxiv.org/pdf/2503.21821
configpath: ''
configpath_llmjudge: opencompass/configs/datasets/PHYSICS/PHYSICS_llm_judge_gen_a133a2.py
- smolinstruct:
name: SmolInstruct
category: Science /Chemistry
paper: https://arxiv.org/pdf/2402.09391
configpath: opencompass/configs/datasets/SmolInstruct/smolinstruct_gen.py
configpath_llmjudge: ''
- SciKnowEval:
name: SciKnowEval
category: Science
paper: https://arxiv.org/abs/2406.09098
configpath: opencompass/configs/datasets/SciKnowEval/SciKnowEval_gen_ebe47d.py
configpath_llmjudge: opencompass/configs/datasets/SciKnowEval/SciKnowEval_llmjudge_gen_ebe47d.py
- internsandbox:
name: InternSandbox
category: Reasoning/Code/Agent
paper: ''
configpath: opencompass/configs/datasets/internsandbox/internsandbox_gen_44b982.py
configpath_llmjudge: ''
- nejmaibench:
name: nejmaibench
category: Science /Medicine
paper: https://arxiv.org/pdf/2308.04709
configpath: opencompass/configs/datasets/nejm_ai_benchmark/nejmaibench_gen.py
configpath_llmjudge: opencompass/configs/datasets/nejm_ai_benchmark/nejmaibench_llmjudge_gen.py
- medbullets:
name: Medbullets
category: Science /Medicine
paper: https://arxiv.org/pdf/2402.18060
configpath: opencompass/configs/datasets/Medbullets/medbullets_gen.py
configpath_llmjudge: opencompass/configs/datasets/Medbullets/medbullets_llmjudge_gen.py
- medmcqa:
name: medmcqa
category: Science /Medicine
paper: https://arxiv.org/pdf/2203.14371
configpath: opencompass/configs/datasets/medmcqa/medmcqa_gen.py
configpath_llmjudge: opencompass/configs/datasets/medmcqa/medmcqa_llmjudge_gen.py
- phybench:
name: PHYBench
category: Science /Physics
paper: https://arxiv.org/abs/2504.16074
configpath: opencompass/configs/datasets/PHYBench/phybench_gen.py
configpath_llmjudge: ''
- beyondaime:
name: BeyondAIME
category: Math
paper: ''
configpath: opencompass/configs/datasets/BeyondAIME/beyondaime_gen.py
configpath_llmjudge: ''
- eese:
name: EESE
category: Science
paper: https://arxiv.org/abs/2507.16514
configpath: opencompass/configs/datasets/eese/eese_llm_judge_gen.py
configpath_llmjudge: opencompass/configs/datasets/eese/eese_llm_judge_gen.py
================================================
FILE: docs/en/.readthedocs.yaml
================================================
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.8"
formats:
- epub
sphinx:
configuration: docs/en/conf.py
python:
install:
- requirements: requirements/docs.txt
================================================
FILE: docs/en/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/en/_static/css/readthedocs.css
================================================
.header-logo {
background-image: url("../image/logo.svg");
background-size: 275px 80px;
height: 80px;
width: 275px;
}
@media screen and (min-width: 1100px) {
.header-logo {
top: -25px;
}
}
pre {
white-space: pre;
}
@media screen and (min-width: 2000px) {
.pytorch-content-left {
width: 1200px;
margin-left: 30px;
}
article.pytorch-article {
max-width: 1200px;
}
.pytorch-breadcrumbs-wrapper {
width: 1200px;
}
.pytorch-right-menu.scrolling-fixed {
position: fixed;
top: 45px;
left: 1580px;
}
}
article.pytorch-article section code {
padding: .2em .4em;
background-color: #f3f4f7;
border-radius: 5px;
}
/* Disable the change in tables */
article.pytorch-article section table code {
padding: unset;
background-color: unset;
border-radius: unset;
}
table.autosummary td {
width: 50%
}
img.align-center {
display: block;
margin-left: auto;
margin-right: auto;
}
article.pytorch-article p.rubric {
font-weight: bold;
}
================================================
FILE: docs/en/_static/js/custom.js
================================================
var collapsedSections = ['Dataset Statistics'];
$(document).ready(function () {
$('.dataset').DataTable({
"stateSave": false,
"lengthChange": false,
"pageLength": 20,
"order": [],
"language": {
"info": "Show _START_ to _END_ Items(Totally _TOTAL_ )",
"infoFiltered": "(Filtered from _MAX_ Items)",
"search": "Search:",
"zeroRecords": "Item Not Found",
"paginate": {
"next": "Next",
"previous": "Previous"
},
}
});
});
================================================
FILE: docs/en/_templates/404.html
================================================
{% extends "layout.html" %}
{% block body %}
Page Not Found
The page you are looking for cannot be found.
If you just switched documentation versions, it is likely that the page you were on is moved. You can look for it in
the content table left, or go to the homepage.
{% endblock %}
================================================
FILE: docs/en/_templates/autosummary/class.rst
================================================
.. role:: hidden
:class: hidden-section
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
..
autogenerated from _templates/autosummary/class.rst
note it does not have :inherited-members:
================================================
FILE: docs/en/_templates/callable.rst
================================================
.. role:: hidden
:class: hidden-section
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
:special-members: __call__
..
autogenerated from _templates/callable.rst
note it does not have :inherited-members:
================================================
FILE: docs/en/advanced_guides/accelerator_intro.md
================================================
# Accelerate Evaluation Inference with vLLM or LMDeploy
## Background
During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.
- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
- [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
## Preparation for Acceleration
First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:
### LMDeploy Installation Method
Install LMDeploy using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):
```bash
pip install lmdeploy
```
### VLLM Installation Method
Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
```bash
pip install vllm
```
## Accelerated Evaluation Using VLLM or LMDeploy
### Method 1: Using Command Line Parameters to Change the Inference Backend
OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:
```python
# eval_gsm8k.py
from mmengine.config import read_base
with read_base():
# Select a dataset list
from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
# Select an interested model
from ..models.hf_llama.hf_llama3_8b_instruct import models
```
Here, `hf_llama3_8b_instruct` specifies the original Huggingface model configuration, as shown below:
```python
from opencompass.models import HuggingFacewithChatTemplate
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='llama-3-8b-instruct-hf',
path='meta-llama/Meta-Llama-3-8B-Instruct',
max_out_len=1024,
batch_size=8,
run_cfg=dict(num_gpus=1),
stop_words=['<|end_of_text|>', '<|eot_id|>'],
)
]
```
To evaluate the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model, use:
```bash
python run.py config/eval_gsm8k.py
```
To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:
```bash
python run.py config/eval_gsm8k.py -a vllm
```
or
```bash
python run.py config/eval_gsm8k.py -a lmdeploy
```
### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API
OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:
1. Install the openai package:
```bash
pip install openai
```
2. Deploy the inference acceleration service API for vLLM or LMDeploy. Below is an example for LMDeploy:
```bash
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
```
Parameters for starting the api_server can be checked using `lmdeploy serve api_server -h`, such as --tp for tensor parallelism, --session-len for the maximum context window length, --cache-max-entry-count for adjusting the k/v cache memory usage ratio, etc.
3. Once the service is successfully deployed, modify the evaluation script by changing the model configuration path to the service address, as shown below:
```python
from opencompass.models import OpenAISDK
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = [
dict(
abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
type=OpenAISDK,
key='EMPTY', # API key
openai_api_base='http://0.0.0.0:23333/v1', # Service address
path='Meta-Llama-3-8B-Instruct', # Model name for service request
tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', # The tokenizer name or path, if set to `None`, uses the default `gpt-4` tokenizer
rpm_verbose=True, # Whether to print request rate
meta_template=api_meta_template, # Service request template
query_per_second=1, # Service request rate
max_out_len=1024, # Maximum output length
max_seq_len=4096, # Maximum input length
temperature=0.01, # Generation temperature
batch_size=8, # Batch size
retry=3, # Number of retries
)
]
```
## Acceleration Effect and Performance Comparison
Below is a comparison table of the acceleration effect and performance when using VLLM or LMDeploy on a single A800 GPU for evaluating the Llama-3-8B-Instruct model on the GSM8k dataset:
| Inference Backend | Accuracy | Inference Time (minutes:seconds) | Speedup (relative to Huggingface) |
| ----------------- | -------- | -------------------------------- | --------------------------------- |
| Huggingface | 74.22 | 24:26 | 1.0 |
| LMDeploy | 73.69 | 11:15 | 2.2 |
| VLLM | 72.63 | 07:52 | 3.1 |
================================================
FILE: docs/en/advanced_guides/circular_eval.md
================================================
# CircularEval
## Background
For multiple-choice questions, when a Language Model (LLM) provides the correct option, it does not necessarily imply a true understanding and reasoning of the question. It could be a guess. To differentiate these scenarios and reduce LLM bias towards options, CircularEval (CircularEval) can be utilized. A multiple-choice question is augmented by shuffling its options, and if the LLM correctly answers all variations of the augmented question, it is considered correct under CircularEval.
## Adding Your Own CircularEval Dataset
Generally, to evaluate a dataset using CircularEval, both its loading and evaluation methods need to be rewritten. Modifications are required in both the OpenCompass main library and configuration files. We will use C-Eval as an example for explanation.
OpenCompass main library:
```python
from opencompass.datasets.ceval import CEvalDataset
from opencompass.datasets.circular import CircularDatasetMeta
class CircularCEvalDataset(CEvalDataset, metaclass=CircularDatasetMeta):
# The overloaded dataset class
dataset_class = CEvalDataset
# Splits of the DatasetDict that need CircularEval. For CEvalDataset, which loads [dev, val, test], we only need 'val' and 'test' for CircularEval, not 'dev'
default_circular_splits = ['val', 'test']
# List of keys to be shuffled
default_option_keys = ['A', 'B', 'C', 'D']
# If the content of 'answer_key' is one of ['A', 'B', 'C', 'D'], representing the correct answer. This field indicates how to update the correct answer after shuffling options. Choose either this or default_answer_key_switch_method
default_answer_key = 'answer'
# If 'answer_key' content is not one of ['A', 'B', 'C', 'D'], a function can be used to specify the correct answer after shuffling options. Choose either this or default_answer_key
# def default_answer_key_switch_method(item, circular_pattern):
# # 'item' is the original data item
# # 'circular_pattern' is a tuple indicating the order after shuffling options, e.g., ('D', 'A', 'B', 'C') means the original option A is now D, and so on
# item['answer'] = circular_pattern['ABCD'.index(item['answer'])]
# return item
```
`CircularCEvalDataset` accepts the `circular_pattern` parameter with two values:
- `circular`: Indicates a single cycle. It is the default value. ABCD is expanded to ABCD, BCDA, CDAB, DABC, a total of 4 variations.
- `all_possible`: Indicates all permutations. ABCD is expanded to ABCD, ABDC, ACBD, ACDB, ADBC, ADCB, BACD, ..., a total of 24 variations.
Additionally, we provide a `CircularEvaluator` to replace `AccEvaluator`. This Evaluator also accepts `circular_pattern`, and it should be consistent with the above. It produces the following metrics:
- `acc_{origin|circular|all_possible}`: Treating each question with shuffled options as separate, calculating accuracy.
- `perf_{origin|circular|all_possible}`: Following Circular logic, a question is considered correct only if all its variations with shuffled options are answered correctly, calculating accuracy.
- `more_{num}_{origin|circular|all_possible}`: According to Circular logic, a question is deemed correct if the number of its variations answered correctly is greater than or equal to num, calculating accuracy.
OpenCompass configuration file:
```python
from mmengine.config import read_base
from opencompass.datasets.circular import CircularCEvalDataset
with read_base():
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
for d in ceval_datasets:
# Overloading the load method
d['type'] = CircularCEvalDataset
# Renaming for differentiation from non-circular evaluation versions
d['abbr'] = d['abbr'] + '-circular-4'
# Overloading the evaluation method
d['eval_cfg']['evaluator'] = {'type': CircularEvaluator}
# The dataset after the above operations looks like this:
# dict(
# type=CircularCEvalDataset,
# path='./data/ceval/formal_ceval', # Unchanged
# name='computer_network', # Unchanged
# abbr='ceval-computer_network-circular-4',
# reader_cfg=dict(...), # Unchanged
# infer_cfg=dict(...), # Unchanged
# eval_cfg=dict(evaluator=dict(type=CircularEvaluator), ...),
# )
```
Additionally, for better presentation of results in CircularEval, consider using the following summarizer:
```python
from mmengine.config import read_base
from opencompass.summarizers import CircularSummarizer
with read_base():
from ...summarizers.groups.ceval.ceval_summary_groups
new_summary_groups = []
for item in ceval_summary_groups:
new_summary_groups.append(
{
'name': item['name'] + '-circular-4',
'subsets': [i + '-circular-4' for i in item['subsets']],
}
)
summarizer = dict(
type=CircularSummarizer,
# Select specific metrics to view
metric_types=['acc_origin', 'perf_circular'],
dataset_abbrs = [
'ceval-circular-4',
'ceval-humanities-circular-4',
'ceval-stem-circular-4',
'ceval-social-science-circular-4',
'ceval-other-circular-4',
],
summary_groups=new_summary_groups,
)
```
For more complex evaluation examples, refer to this sample code: https://github.com/open-compass/opencompass/tree/main/examples/eval_circular.py
================================================
FILE: docs/en/advanced_guides/code_eval.md
================================================
# Code Evaluation Tutorial
This tutorial primarily focuses on evaluating a model's coding proficiency, using `humaneval` and `mbpp` as examples.
## pass@1
If you only need to generate a single response to evaluate the pass@1 performance, you can directly use [configs/datasets/humaneval/humaneval_gen_8e312c.py](https://github.com/open-compass/opencompass/blob/main/configs/datasets/humaneval/humaneval_gen_8e312c.py) and [configs/datasets/mbpp/deprecated_mbpp_gen_1e1056.py](https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/deprecated_mbpp_gen_1e1056.py), referring to the general [quick start tutorial](../get_started/quick_start.md).
For multilingual evaluation, please refer to the [Multilingual Code Evaluation Tutorial](./code_eval_service.md).
## pass@k
If you need to generate multiple responses for a single example to evaluate the pass@k performance, consider the following two situations. Here we take 10 responses as an example:
### Typical Situation
For most models that support the `num_return_sequences` parameter in HF's generation, we can use it directly to obtain multiple responses. Refer to the following configuration file:
```python
from opencompass.datasets import MBPPDatasetV2, MBPPPassKEvaluator
with read_base():
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
mbpp_datasets[0]['type'] = MBPPDatasetV2
mbpp_datasets[0]['eval_cfg']['evaluator']['type'] = MBPPPassKEvaluator
mbpp_datasets[0]['reader_cfg']['output_column'] = 'test_column'
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
models = [
dict(
type=HuggingFaceCausalLM,
...,
generation_kwargs=dict(
num_return_sequences=10,
do_sample=True,
top_p=0.95,
temperature=0.8,
),
...,
)
]
```
For `mbpp`, new changes are needed in the dataset and evaluation, so we simultaneously modify the `type`, `eval_cfg.evaluator.type`, `reader_cfg.output_column` fields to accommodate these requirements.
We also need model responses with randomness, thus setting the `generation_kwargs` parameter is necessary. Note that we need to set `num_return_sequences` to get the number of responses.
Note: `num_return_sequences` must be greater than or equal to k, as pass@k itself is a probability estimate.
You can specifically refer to the following configuration file [examples/eval_code_passk.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk.py)
### For Models That Do Not Support Multiple Responses
This applies to some HF models with poorly designed APIs or missing features. In this case, we need to repeatedly construct datasets to achieve multiple response effects. Refer to the following configuration:
```python
from opencompass.datasets import MBPPDatasetV2, MBPPPassKEvaluator
with read_base():
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
humaneval_datasets[0]['abbr'] = 'openai_humaneval_pass10'
humaneval_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['abbr'] = 'mbpp_pass10'
mbpp_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['type'] = MBPPDatasetV2
mbpp_datasets[0]['eval_cfg']['evaluator']['type'] = MBPPPassKEvaluator
mbpp_datasets[0]['reader_cfg']['output_column'] = 'test_column'
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
models = [
dict(
type=HuggingFaceCausalLM,
...,
generation_kwargs=dict(
do_sample=True,
top_p=0.95,
temperature=0.8,
),
...,
)
]
```
Since the dataset's prompt has not been modified, we need to replace the corresponding fields to achieve the purpose of repeating the dataset.
You need to modify these fields:
- `num_repeats`: the number of times the dataset is repeated
- `abbr`: It's best to modify the dataset abbreviation along with the number of repetitions because the number of datasets will change, preventing potential issues arising from discrepancies with the values in `.cache/dataset_size.json`.
For `mbpp`, modify the `type`, `eval_cfg.evaluator.type`, `reader_cfg.output_column` fields as well.
We also need model responses with randomness, thus setting the `generation_kwargs` parameter is necessary.
You can specifically refer to the following configuration file [examples/eval_code_passk_repeat_dataset.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk_repeat_dataset.py)
================================================
FILE: docs/en/advanced_guides/code_eval_service.md
================================================
# Code Evaluation Docker Tutorial
To complete the LLM code capability evaluation, we need to build a separate evaluation environment to avoid executing erroneous code in the development environment, which would inevitably cause losses. The code evaluation service currently used by OpenCompass can refer to the [code-evaluator](https://github.com/open-compass/code-evaluator) project. The following will introduce evaluation tutorials around the code evaluation service.
1. humaneval-x
This is a multi-programming language dataset [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x).
You can download the dataset from this [download link](https://github.com/THUDM/CodeGeeX2/tree/main/benchmark/humanevalx). Please download the language file (××.jsonl.gz) that needs to be evaluated and place it in the `./data/humanevalx` folder.
The currently supported languages are `python`, `cpp`, `go`, `java`, `js`.
2. DS1000
This is a Python multi-algorithm library dataset [ds1000](https://github.com/xlang-ai/DS-1000).
You can download the dataset from this [download link](https://github.com/xlang-ai/DS-1000/blob/main/ds1000_data.zip).
The currently supported algorithm libraries are `Pandas`, `Numpy`, `Tensorflow`, `Scipy`, `Sklearn`, `Pytorch`, `Matplotlib`.
## Launching the Code Evaluation Service
1. Ensure you have installed Docker, please refer to [Docker installation document](https://docs.docker.com/engine/install/).
2. Pull the source code of the code evaluation service project and build the Docker image.
Choose the dockerfile corresponding to the dataset you need, and replace `humanevalx` or `ds1000` in the command below.
```shell
git clone https://github.com/open-compass/code-evaluator.git
docker build -t code-eval-{your-dataset}:latest -f docker/{your-dataset}/Dockerfile .
```
3. Create a container with the following commands:
```shell
# Log output format
docker run -it -p 5000:5000 code-eval-{your-dataset}:latest python server.py
# Run the program in the background
# docker run -itd -p 5000:5000 code-eval-{your-dataset}:latest python server.py
# Using different ports
# docker run -itd -p 5001:5001 code-eval-{your-dataset}:latest python server.py --port 5001
```
**Note:**
- If you encounter a timeout during the evaluation of Go, please use the following command when creating the container.
```shell
docker run -it -p 5000:5000 -e GO111MODULE=on -e GOPROXY=https://goproxy.io code-eval-{your-dataset}:latest python server.py
```
4. To ensure you have access to the service, use the following command to check the inference environment and evaluation service connection status. (If both inferences and code evaluations run on the same host, skip this step.)
```shell
ping your_service_ip_address
telnet your_service_ip_address your_service_port
```
## Local Code Evaluation
When the model inference and code evaluation services are running on the same host or within the same local area network, direct code reasoning and evaluation can be performed. **Note: DS1000 is currently not supported, please proceed with remote evaluation.**
### Configuration File
We provide [the configuration file](https://github.com/open-compass/opencompass/blob/main/examples/eval_codegeex2.py) of using `humanevalx` for evaluation on `codegeex2` as reference.
The dataset and related post-processing configurations files can be found at this [link](https://github.com/open-compass/opencompass/tree/main/configs/datasets/humanevalx) with attention paid to the `evaluator` field in the humanevalx_eval_cfg_dict.
```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator
humanevalx_reader_cfg = dict(
input_columns=['prompt'], output_column='task_id', train_split='test')
humanevalx_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024))
humanevalx_eval_cfg_dict = {
lang : dict(
evaluator=dict(
type=HumanevalXEvaluator,
language=lang,
ip_address="localhost", # replace to your code_eval_server ip_address, port
port=5000), # refer to https://github.com/open-compass/code-evaluator to launch a server
pred_role='BOT')
for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now
}
humanevalx_datasets = [
dict(
type=HumanevalXDataset,
abbr=f'humanevalx-{lang}',
language=lang,
path='./data/humanevalx',
reader_cfg=humanevalx_reader_cfg,
infer_cfg=humanevalx_infer_cfg,
eval_cfg=humanevalx_eval_cfg_dict[lang])
for lang in ['python', 'cpp', 'go', 'java', 'js']
]
```
### Task Launch
Refer to the [Quick Start](../get_started.html)
## Remote Code Evaluation
Model inference and code evaluation services located in different machines which cannot be accessed directly require prior model inference before collecting the code evaluation results. The configuration file and inference process can be reused from the previous tutorial.
### Collect Inference Results(Only for Humanevalx)
In OpenCompass's tools folder, there is a script called `collect_code_preds.py` provided to process and collect the inference results after providing the task launch configuration file during startup along with specifying the working directory used corresponding to the task.
It is the same with `-r` option in `run.py`. More details can be referred through the [documentation](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html#launching-evaluation).
```shell
python tools/collect_code_preds.py [config] [-r latest]
```
The collected results will be organized as following under the `-r` folder:
```
workdir/humanevalx
├── codegeex2-6b
│ ├── humanevalx_cpp.json
│ ├── humanevalx_go.json
│ ├── humanevalx_java.json
│ ├── humanevalx_js.json
│ └── humanevalx_python.json
├── CodeLlama-13b
│ ├── ...
├── CodeLlama-13b-Instruct
│ ├── ...
├── CodeLlama-13b-Python
│ ├── ...
├── ...
```
For DS1000, you just need to obtain the corresponding prediction file generated by `opencompass`.
### Code Evaluation
Make sure your code evaluation service is started, and use `curl` to request:
#### The following only supports Humanevalx
```shell
curl -X POST -F 'file=@{result_absolute_path}' -F 'dataset={dataset/language}' {your_service_ip_address}:{your_service_port}/evaluate
```
For example:
```shell
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' localhost:5000/evaluate
```
The we have:
```
"{\"pass@1\": 37.19512195121951%}"
```
Additionally, we offer an extra option named `with_prompt`(Defaults to `True`), since some models(like `WizardCoder`) generate complete codes without requiring the form of concatenating prompt and prediction. You may refer to the following commands for evaluation.
```shell
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' -H 'with-prompt: False' localhost:5000/evaluate
```
#### The following only supports DS1000
Make sure the code evaluation service is started, then use `curl` to submit a request:
```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' localhost:5000/evaluate
```
DS1000 supports additional debug parameters. Be aware that a large amount of log will be generated when it is turned on:
- `full`: Additional print out of the original prediction for each error sample, post-processing prediction, running program, and final error.
- `half`: Additional print out of the running program and final error for each error sample.
- `error`: Additional print out of the final error for each error sample.
```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' -F 'debug=error' localhost:5000/evaluate
```
You can also modify the `num_workers` in the same way to control the degree of parallelism.
## Advanced Tutorial
Besides evaluating the supported HUMANEVAList data set, users might also need:
### Support New Dataset
Please refer to the [tutorial on supporting new datasets](./new_dataset.md).
### Modify Post-Processing
1. For local evaluation, follow the post-processing section in the tutorial on supporting new datasets to modify the post-processing method.
2. For remote evaluation, please modify the post-processing part in the tool's `collect_code_preds.py`.
3. Some parts of post-processing could also be modified in the code evaluation service, more information will be available in the next section.
### Debugging Code Evaluation Service
When supporting new datasets or modifying post-processors, it is possible that modifications need to be made to the original code evaluation service. Please make changes based on the following steps:
1. Remove the installation of the `code-evaluator` in `Dockerfile`, mount the `code-evaluator` when starting the container instead:
```shell
docker run -it -p 5000:5000 -v /local/path/of/code-evaluator:/workspace/code-evaluator code-eval:latest bash
```
2. Install and start the code evaluation service locally. At this point, any necessary modifications can be made to the local copy of the `code-evaluator`.
```shell
cd code-evaluator && pip install -r requirements.txt
python server.py
```
================================================
FILE: docs/en/advanced_guides/contamination_eval.md
================================================
# Data Contamination Assessment
**Data Contamination** refers to the phenomenon where data intended for downstream testing tasks appear in the training data of large language models (LLMs), resulting in artificially inflated performance metrics in downstream tasks (such as summarization, natural language inference, text classification), which do not accurately reflect the model's true generalization capabilities.
Since the source of data contamination lies in the training data used by LLMs, the most direct method to detect data contamination is to collide test data with training data and then report the extent of overlap between the two. The classic GPT-3 [paper](https://arxiv.org/pdf/2005.14165.pdf) reported on this in Table C.1.
However, today's open-source community often only publishes model parameters, not training datasets. In such cases, how to determine the presence and extent of data contamination remains unsolved. OpenCompass offers two possible solutions.
## Contamination Data Annotation Based on Self-Built Co-Distribution Data
Referencing the method mentioned in Section 5.2 of [Skywork](https://arxiv.org/pdf/2310.19341.pdf), we directly used the dataset [mock_gsm8k_test](https://huggingface.co/datasets/Skywork/mock_gsm8k_test) uploaded to HuggingFace by Skywork.
In this method, the authors used GPT-4 to synthesize data similar to the original GSM8K style, and then calculated the perplexity on the GSM8K training set (train), GSM8K test set (test), and GSM8K reference set (ref). Since the GSM8K reference set was newly generated, the authors considered it as clean, not belonging to any training set of any model. They posited:
- If the test set's perplexity is significantly lower than the reference set's, the test set might have appeared in the model's training phase;
- If the training set's perplexity is significantly lower than the test set's, the training set might have been overfitted by the model.
The following configuration file can be referenced:
```python
from mmengine.config import read_base
with read_base():
from .datasets.gsm8k_contamination.gsm8k_contamination_ppl_ecdd22 import gsm8k_datasets # includes training, test, and reference sets
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model # model under review
from .models.yi.hf_yi_6b import models as hf_yi_6b_model
datasets = [*gsm8k_datasets]
models = [*hf_qwen_7b_model, *hf_yi_6b_model]
```
An example output is as follows:
```text
dataset version metric mode internlm-7b-hf qwen-7b-hf yi-6b-hf chatglm3-6b-base-hf qwen-14b-hf baichuan2-13b-base-hf internlm-20b-hf aquila2-34b-hf ...
--------------- --------- ----------- ------- ---------------- ------------ ---------- --------------------- ------------- ----------------------- ----------------- ---------------- ...
gsm8k-train-ppl 0b8e46 average_ppl unknown 1.5 0.78 1.37 1.16 0.5 0.76 1.41 0.78 ...
gsm8k-test-ppl 0b8e46 average_ppl unknown 1.56 1.33 1.42 1.3 1.15 1.13 1.52 1.16 ...
gsm8k-ref-ppl f729ba average_ppl unknown 1.55 1.2 1.43 1.35 1.27 1.19 1.47 1.35 ...
```
Currently, this solution only supports the GSM8K dataset. We welcome the community to contribute more datasets.
Consider cite the following paper if you find it helpful:
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Contamination Data Annotation Based on Classic Pre-trained Sets
Thanks to [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) and @liyucheng09 for providing this method.
In this method, the authors search the test datasets (such as C-Eval, ARC, HellaSwag, etc.) using the Common Crawl database and Bing search engine, then mark each test sample as clean / question contaminated / both question and answer contaminated.
During testing, OpenCompass
will report the accuracy or perplexity of ceval on subsets composed of these three labels. Generally, the accuracy ranges from low to high: clean, question contaminated, both question and answer contaminated subsets. The authors believe:
- If the performance of the three is relatively close, the contamination level of the model on that test set is light; otherwise, it is heavy.
The following configuration file can be referenced [link](https://github.com/open-compass/opencompass/blob/main/examples/eval_contamination.py):
```python
from mmengine.config import read_base
with read_base():
from .datasets.ceval.ceval_clean_ppl import ceval_datasets # ceval dataset with contamination tags
from .models.yi.hf_yi_6b import models as hf_yi_6b_model # model under review
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model
from .summarizers.contamination import ceval_summarizer as summarizer # output formatting
datasets = [*ceval_datasets]
models = [*hf_yi_6b_model, *hf_qwen_7b_model]
```
An example output is as follows:
```text
dataset version mode yi-6b-hf - - qwen-7b-hf - - ...
---------------------------------------------- --------- ------ ---------------- ----------------------------- --------------------------------------- ---------------- ----------------------------- --------------------------------------- ...
- - - accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated ...
...
ceval-humanities - ppl 74.42 75.00 82.14 67.44 50.00 70.54 ...
ceval-stem - ppl 53.70 57.14 85.61 47.41 52.38 67.63 ...
ceval-social-science - ppl 81.60 84.62 83.09 76.00 61.54 72.79 ...
ceval-other - ppl 72.31 73.91 75.00 58.46 39.13 61.88 ...
ceval-hard - ppl 44.35 37.50 70.00 41.13 25.00 30.00 ...
ceval - ppl 67.32 71.01 81.17 58.97 49.28 67.82 ...
```
Currently, this solution only supports the C-Eval, MMLU, HellaSwag and ARC. [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) also includes CSQA and WinoGrande, but these have not yet been implemented in OpenCompass. We welcome the community to contribute more datasets.
Consider cite the following paper if you find it helpful:
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@article{Li2023AnOS,
title={An Open Source Data Contamination Report for Llama Series Models},
author={Yucheng Li},
journal={ArXiv},
year={2023},
volume={abs/2310.17589},
url={https://api.semanticscholar.org/CorpusID:264490711}
}
```
================================================
FILE: docs/en/advanced_guides/custom_dataset.md
================================================
# Dataset Quick Evaluation Tutorial
OpenCompass provides two paths for quickly evaluating the provided data, the data format protocol based on ChatMLDataset and the data format protocol based on CustomDataset.
Compared to the complete dataset integration process in [new_dataset.md](./new_dataset.md), these two evaluation paths are more convenient and efficient, being able to directly enter the evaluation process without adding new configuration files.
But if you have specific needs for custom reading/inference/evaluation, it is recommended to still follow the complete integration process to add a new dataset.
## Data Format Protocol and Fast Evaluation Based on ChatMLDataset
OpenCompass has recently launched a dataset evaluation mode based on the ChatML dialogue template, which allow users to provide a dataset .json file that conforms to the ChatML dialogue template, and simply set the dataset information config like model configs to start evaluating directly.
### Format Requirements for Data Files
This evaluation method only supports data files in `.json` format, and each sample must comply with the following format:
The format of a text-only dataset with a simple structure:
```jsonl
{
"question":[
{
"role": "system" # Omittable
"content": Str
},
{
"role": "user",
"content": Str
}
],
"answer":[
Str
]
}
{
...
}
...
```
The format of multiple rounds and multiple modes datasets:
```jsonl
{
"question":[
{
"role": "system",
"content": Str,
},
{
"role": "user",
"content": Str or List
[
{
"type": Str, # "image"
"image_url": Str,
},
...
{
"type": Str, # "text"
"text": Str,
},
]
},
{
"role": "assistant",
"content": Str
},
{
"role": "user",
"content": Str or List
},
...
],
"answer":[
Str,
Str,
...
]
}
{
...
}
...
```
(As OpenCompass currently does not support multi-mode evaluation, the template above is for reference only.)
When ChatMLDataset reading `.json` files, it will use `pydantic` to perform simple format validation on the files.
You can use `tools/chatml_fformat_test.py` to check your provided data file.
After format checking, please add a config dictionary named `chatml_datasets` in your running config file to convert the data file into an OpenCompass dataset at runtime.
An example is as follows:
```python
chatml_datasets = [
dict(
abbr='YOUR_DATASET_NAME',
path='YOUR_DATASET_PATH',
evaluator=dict(
type='cascade_evaluator',
rule_evaluator=dict(
type='math_evaluator',
),
llm_evaluator=dict(
type='llm_evaluator',
prompt="YOUR_JUDGE_PROMPT",
judge_cfg=dict(), # YOUR Judge Model Config
)
),
n=1, # Repeat Number
),
]
```
The ChatML evaluation module currently provides four preset evaluators, `mcq_rule_evaluator` used for MCQ evaluation, `math_evaluator` used for latex mathematical formula evaluation, `llm_evaluator` used for evaluating answers that are open-ended or difficult to extract), and `cascade_evaluator`, an evaluation mode composed of rule and LLM evaluators cascaded together.
In addition, if you have a long-term need to use datasets based on ChatML templates, you can contribute your dataset config to `opencompass/config/chatml_datasets`.
An eval example of calling these dataset configs is provided in `examples/evalchat_datasets.py`.
## Data Format Protocol and Fast Evaluation Based on CustomsDataset
(This module is no longer being updated, but it can still be used if there is a need for cli- quick evaluation.)
This module support two types of tasks: multiple choice (`mcq`) and question & answer (`qa`). For `mcq`, both ppl and gen inferences are supported; for `qa`, gen inference is supported.
### Dataset Format
We support datasets in both `.jsonl` and `.csv` formats.
#### Multiple Choice (`mcq`)
For `mcq` datasets, the default fields are as follows:
- `question`: The stem of the multiple-choice question.
- `A`, `B`, `C`, ...: Single uppercase letters representing the options, with no limit on the number. Defaults to parsing consecutive letters strating from `A` as options.
- `answer`: The correct answer to the multiple-choice question, which must be one of the options used above, such as `A`, `B`, `C`, etc.
Non-default fields will be read in but are not used by default. To use them, specify in the `.meta.json` file.
An example of the `.jsonl` format:
```jsonl
{"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B"}
{"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A"}
{"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B"}
{"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C"}
```
An example of the `.csv` format:
```csv
question,A,B,C,answer
127+545+588+620+556+199=,2632,2635,2645,B
735+603+102+335+605=,2376,2380,2410,B
506+346+920+451+910+142+659+850=,4766,4774,4784,C
504+811+870+445=,2615,2630,2750,B
```
#### Question & Answer (`qa`)
For `qa` datasets, the default fields are as follows:
- `question`: The stem of the question & answer question.
- `answer`: The correct answer to the question & answer question. It can be missing, indicating the dataset has no correct answer.
Non-default fields will be read in but are not used by default. To use them, specify in the `.meta.json` file.
An example of the `.jsonl` format:
```jsonl
{"question": "752+361+181+933+235+986=", "answer": "3448"}
{"question": "712+165+223+711=", "answer": "1811"}
{"question": "921+975+888+539=", "answer": "3323"}
{"question": "752+321+388+643+568+982+468+397=", "answer": "4519"}
```
An example of the `.csv` format:
```csv
question,answer
123+147+874+850+915+163+291+604=,3967
149+646+241+898+822+386=,3142
332+424+582+962+735+798+653+214=,4700
649+215+412+495+220+738+989+452=,4170
```
### Command Line List
Custom datasets can be directly called for evaluation through the command line.
```bash
python run.py \
--models hf_llama2_7b \
--custom-dataset-path xxx/test_mcq.csv \
--custom-dataset-data-type mcq \
--custom-dataset-infer-method ppl
```
```bash
python run.py \
--models hf_llama2_7b \
--custom-dataset-path xxx/test_qa.jsonl \
--custom-dataset-data-type qa \
--custom-dataset-infer-method gen
```
In most cases, `--custom-dataset-data-type` and `--custom-dataset-infer-method` can be omitted. OpenCompass will
set them based on the following logic:
- If options like `A`, `B`, `C`, etc., can be parsed from the dataset file, it is considered an `mcq` dataset; otherwise, it is considered a `qa` dataset.
- The default `infer_method` is `gen`.
### Configuration File
In the original configuration file, simply add a new item to the `datasets` variable. Custom datasets can be mixed with regular datasets.
```python
datasets = [
{"path": "xxx/test_mcq.csv", "data_type": "mcq", "infer_method": "ppl"},
{"path": "xxx/test_qa.jsonl", "data_type": "qa", "infer_method": "gen"},
]
```
### Supplemental Information for Dataset `.meta.json`
OpenCompass will try to parse the input dataset file by default, so in most cases, the `.meta.json` file is **not necessary**. However, if the dataset field names are not the default ones, or custom prompt words are required, it should be specified in the `.meta.json` file.
The file is placed in the same directory as the dataset, with the filename followed by `.meta.json`. An example file structure is as follows:
```tree
.
├── test_mcq.csv
├── test_mcq.csv.meta.json
├── test_qa.jsonl
└── test_qa.jsonl.meta.json
```
Possible fields in this file include:
- `abbr` (str): Abbreviation of the dataset, serving as its ID.
- `data_type` (str): Type of dataset, options are `mcq` and `qa`.
- `infer_method` (str): Inference method, options are `ppl` and `gen`.
- `human_prompt` (str): User prompt template for generating prompts. Variables in the template are enclosed in `{}`, like `{question}`, `{opt1}`, etc. If `template` exists, this field will be ignored.
- `bot_prompt` (str): Bot prompt template for generating prompts. Variables in the template are enclosed in `{}`, like `{answer}`, etc. If `template` exists, this field will be ignored.
- `template` (str or dict): Question template for generating prompts. Variables in the template are enclosed in `{}`, like `{question}`, `{opt1}`, etc. The relevant syntax is in [here](../prompt/prompt_template.md) regarding `infer_cfg['prompt_template']['template']`.
- `input_columns` (list): List of input fields for reading data.
- `output_column` (str): Output field for reading data.
- `options` (list): List of options for reading data, valid only when `data_type` is `mcq`.
For example:
```json
{
"human_prompt": "Question: 127 + 545 + 588 + 620 + 556 + 199 =\nA. 2632\nB. 2635\nC. 2645\nAnswer: Let's think step by step, 127 + 545 + 588 + 620 + 556 + 199 = 672 + 588 + 620 + 556 + 199 = 1260 + 620 + 556 + 199 = 1880 + 556 + 199 = 2436 + 199 = 2635. So the answer is B.\nQuestion: {question}\nA. {A}\nB. {B}\nC. {C}\nAnswer: ",
"bot_prompt": "{answer}"
}
```
or
```json
{
"template": "Question: {my_question}\nX. {X}\nY. {Y}\nZ. {Z}\nW. {W}\nAnswer:",
"input_columns": ["my_question", "X", "Y", "Z", "W"],
"output_column": "my_answer",
}
```
================================================
FILE: docs/en/advanced_guides/evaluation_lightllm.md
================================================
# Evaluation with Lightllm
We now support the evaluation of large language models using [Lightllm](https://github.com/ModelTC/lightllm) for inference. Developed by SenseTime, LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. Lightllm provides support for various large Language models, allowing users to perform model inference through Lightllm, locally deploying it as a service. During the evaluation process, OpenCompass feeds data to Lightllm through an API and processes the response. OpenCompass has been adapted for compatibility with Lightllm, and this tutorial will guide you on using OpenCompass to evaluate models with Lightllm as the inference backend.
## Setup
### Install OpenCompass
Please follow the [instructions](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) to install the OpenCompass and prepare the evaluation datasets.
### Install Lightllm
Please follow the [Lightllm homepage](https://github.com/ModelTC/lightllm) to install the Lightllm. Pay attention to aligning the versions of relevant dependencies, especially the version of the Transformers.
## Evaluation
We use the evaluation of Humaneval with the llama2-7B model as an example.
### Step-1: Deploy the model locally as a service using Lightllm.
```shell
python -m lightllm.server.api_server --model_dir /path/llama2-7B \
--host 0.0.0.0 \
--port 1030 \
--nccl_port 2066 \
--max_req_input_len 4096 \
--max_req_total_len 6144 \
--tp 1 \
--trust_remote_code \
--max_total_token_num 120000
```
\*\*Note: \*\* tp can be configured to enable TensorParallel inference on several gpus, suitable for the inference of very large models.
\*\*Note: \*\* The max_total_token_num in the above command will affect the throughput performance during testing. It can be configured according to the documentation on the [Lightllm homepage](https://github.com/ModelTC/lightllm). As long as it does not run out of memory, it is often better to set it as high as possible.
\*\*Note: \*\* If you want to start multiple LightLLM services on the same machine, you need to reconfigure the above port and nccl_port.
You can use the following Python script to quickly test whether the current service has been successfully started.
```python
import time
import requests
import json
url = 'http://localhost:8080/generate'
headers = {'Content-Type': 'application/json'}
data = {
'inputs': 'What is AI?',
"parameters": {
'do_sample': False,
'ignore_eos': False,
'max_new_tokens': 1024,
}
}
response = requests.post(url, headers=headers, data=json.dumps(data))
if response.status_code == 200:
print(response.json())
else:
print('Error:', response.status_code, response.text)
```
### Step-2: Evaluate the above model using OpenCompass.
```shell
python run.py examples/eval_lightllm.py
```
You are expected to get the evaluation results after the inference and evaluation.
\*\*Note: \*\*In `eval_lightllm.py`, please align the configured URL with the service address from the previous step.
================================================
FILE: docs/en/advanced_guides/evaluation_lmdeploy.md
================================================
# Evaluation with LMDeploy
We now support evaluation of models accelerated by the [LMDeploy](https://github.com/InternLM/lmdeploy). LMDeploy is a toolkit designed for compressing, deploying, and serving LLM. It has a remarkable inference performance. We now illustrate how to evaluate a model with the support of LMDeploy in OpenCompass.
## Setup
### Install OpenCompass
Please follow the [instructions](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) to install the OpenCompass and prepare the evaluation datasets.
### Install LMDeploy
Install lmdeploy via pip (python 3.8+)
```shell
pip install lmdeploy
```
The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by:
```shell
export LMDEPLOY_VERSION=0.6.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
## Evaluation
When evaluating a model, it is necessary to prepare an evaluation configuration that specifies information such as the evaluation dataset, the model, and inference parameters.
Taking [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) as an example, the evaluation config is as follows:
```python
# configure the dataset
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_a58960 import \
gsm8k_datasets
# and output the results in a chosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# configure lmdeploy
from opencompass.models import TurboMindModelwithChatTemplate
# configure the model
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr=f'internlm2-chat-7b-lmdeploy',
# model path, which can be the address of a model repository on the Hugging Face Hub or a local path
path='internlm/internlm2-chat-7b',
# inference backend of LMDeploy. It can be either 'turbomind' or 'pytorch'.
# If the model is not supported by 'turbomind', it will fallback to
# 'pytorch'
backend='turbomind',
# For the detailed engine config and generation config, please refer to
# https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py
engine_config=dict(tp=1),
gen_config=dict(do_sample=False),
# the max size of the context window
max_seq_len=7168,
# the max number of new tokens
max_out_len=1024,
# the max number of prompts that LMDeploy receives
# in `generate` function
batch_size=5000,
run_cfg=dict(num_gpus=1),
)
]
```
Place the aforementioned configuration in a file, such as "configs/eval_internlm2_lmdeploy.py". Then, in the home folder of OpenCompass, start evaluation by the following command:
```shell
python run.py configs/eval_internlm2_lmdeploy.py -w outputs
```
You are expected to get the evaluation results after the inference and evaluation.
================================================
FILE: docs/en/advanced_guides/llm_judge.md
================================================
# LLM as Judge Evaluation
## Introduction
The GenericLLMEvaluator is particularly useful for scenarios where rule-based methods (like regular expressions) cannot perfectly judge outputs, such as:
- Cases where models output answer content without option identifiers
- Factual judgment datasets that are difficult to evaluate with rules
- Open-ended responses requiring complex understanding and reasoning
- Evaluation that requires a lot of rules to be designed
OpenCompass provides the GenericLLMEvaluator component to facilitate LLM-as-judge evaluations.
## Dataset Format
The dataset for LLM judge evaluation should be in either JSON Lines (.jsonl) or CSV format. Each entry should contain at least:
- A problem or question
- A reference answer or gold standard
- (The model's prediction will be generated during evaluation)
Example JSONL format:
```json
{"problem": "What is the capital of France?", "answer": "Paris"}
```
Example CSV format:
```csv
problem,answer
"What is the capital of France?","Paris"
```
## Configuration
### Using LLM for Evaluation via Command Line
Some datasets in OpenCompass already include LLM judge configurations.
You need to use a model service (such as OpenAI or DeepSeek's official API) or start a model service locally using tools like LMDeploy, vLLM, or SGLang.
Then, you can set the environment variables for the evaluation service and evaluate models using the following commands:
```bash
export OC_JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct
export OC_JUDGE_API_KEY=sk-1234
export OC_JUDGE_API_BASE=http://172.30.56.1:4000/v1
```
Note that by default, OpenCompass will use these three environment variables, but if you use configuration files to configure the evaluation service, these environment variables will not take effect.
### Using LLM for Evaluation via Configuration Files
To set up an LLM judge evaluation, you'll need to configure three main components:
1. Dataset Reader Configuration
```python
reader_cfg = dict(
input_columns=['problem'], # Column name for the question
output_column='answer' # Column name for the reference answer
)
```
2. Inference Configuration
```python
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}', # Template for prompting the model
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
```
3. Evaluation Configuration with LLM Judge
```python
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator, # Using LLM as evaluator
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=YOUR_JUDGE_TEMPLATE), # Template for the judge
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='path/to/your/dataset',
file_name='your_dataset.jsonl',
reader_cfg=reader_cfg,
),
judge_cfg=YOUR_JUDGE_MODEL_CONFIG, # Configuration for the judge model
dict_postprocessor=dict(type=generic_llmjudge_postprocess), # Post-processing the judge's output
),
)
```
## Using CustomDataset with GenericLLMEvaluator
Here's how to set up a complete configuration for LLM judge evaluation:
```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
# Import your judge model configuration
with read_base():
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
models as judge_model,
)
# Define your judge template
JUDGE_TEMPLATE = """
Please evaluate whether the following response correctly answers the question.
Question: {problem}
Reference Answer: {answer}
Model Response: {prediction}
Is the model response correct? If correct, answer "A"; if incorrect, answer "B".
""".strip()
# Dataset reader configuration
reader_cfg = dict(input_columns=['problem'], output_column='answer')
# Inference configuration for the model being evaluated
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration with LLM judge
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=JUDGE_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='path/to/your/dataset',
file_name='your_dataset.jsonl',
reader_cfg=reader_cfg,
),
judge_cfg=judge_model[0],
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
# Dataset configuration
datasets = [
dict(
type=CustomDataset,
abbr='my-dataset',
path='path/to/your/dataset',
file_name='your_dataset.jsonl',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
]
# Model configuration for the model being evaluated
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='model-to-evaluate',
path='path/to/your/model',
# ... other model configurations
)
]
# Output directory
work_dir = './outputs/llm_judge_eval'
```
## GenericLLMEvaluator
The GenericLLMEvaluator is designed to use an LLM as a judge for evaluating model outputs. Key features include:
1. Flexible prompt templates for instructing the judge
2. Support for various judge models (local or API-based)
3. Customizable evaluation criteria through prompt engineering
4. Post-processing of judge outputs to extract structured evaluations
**Important Note**: The current generic version of the judge template only supports outputs in the format of "A" (correct) or "B" (incorrect), and does not support other output formats (like "CORRECT" or "INCORRECT"). This is because the post-processing function `generic_llmjudge_postprocess` is specifically designed to parse this format.
The evaluator works by:
1. Taking the original problem, reference answer, and model prediction
2. Formatting them into a prompt for the judge model
3. Parsing the judge's response to determine the evaluation result (looking for "A" or "B")
4. Aggregating results across the dataset
If you would like to see the full details of evaluation results, you can add `--dump-eval-details` to the command line when you start the job.
Example evaluation output:
```python
{
'accuracy': 75.0, # Percentage of responses judged as correct
'details': [
{
'origin_prompt': """
Please evaluate whether the following response correctly answers the question.
Question: What is the capital of France?
Reference Answer: Paris
Model Response: Paris
Is the model response correct? If correct, answer "A"; if incorrect, answer "B".
""",
'gold': 'Paris',
'prediction': 'A',
},
# ... more results
]
}
```
## CascadeEvaluator
OpenCompass also provides a CascadeEvaluator that combines the strengths of rule-based evaluation and LLM-based evaluation. The cascade evaluator has two modes:
1. **Cascade Mode (parallel=False)**: First evaluates all samples with a rule-based evaluator, then only sends samples that were deemed incorrect by the rule-based evaluation to an LLM judge for re-evaluation. This approach reduces reliance on LLM judgments while maintaining accuracy, thus lowering evaluation costs and time.
2. **Parallel Mode (parallel=True)**: Evaluates all samples with both the rule-based evaluator and LLM judge, then considers a sample correct if either method marks it as correct. This approach can increase the leniency of evaluation but may result in higher costs since all samples require LLM evaluation.
### Configuring CascadeEvaluator
Here's an example of how to configure the CascadeEvaluator:
```python
# Define a rule-based evaluator
rule_evaluator = dict(type=MATHVerifyEvaluator)
# Define an LLM judge evaluator
llm_judge_evaluator = dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=YOUR_JUDGE_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=YourDataset,
path='path/to/your/dataset',
reader_cfg=reader_cfg,
),
judge_cfg=dict(), # Can use environment variables to configure the judge model
)
# Configure cascade evaluator (cascade mode)
cascade_evaluator = dict(
type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=False # Cascade mode
)
# For parallel mode, set parallel=True
parallel_evaluator = dict(
type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=True # Parallel mode
)
# Use the cascade evaluator in your dataset evaluation config
eval_cfg = dict(evaluator=cascade_evaluator)
```
### Evaluation Results
The cascade evaluator outputs detailed evaluation statistics including:
- Accuracy of the rule-based evaluation
- Accuracy of the LLM evaluation (for samples that failed rule-based evaluation in cascade mode)
- Final combined accuracy
Example output:
```python
{
'accuracy': 85.0, # Final accuracy
'cascade_stats': {
'total_samples': 100,
'rule_correct': 70, # Number of samples correct by rule evaluation
'rule_accuracy': 70.0, # Accuracy of rule evaluation
'llm_evaluated': 30, # Number of samples evaluated by LLM (failed samples in cascade mode)
'llm_correct': 15, # Number of samples correct by LLM evaluation
'llm_accuracy': 50.0, # Accuracy of LLM evaluation
'final_correct': 85, # Total correct samples
'final_accuracy': 85.0, # Final accuracy
'parallel_mode': False, # Whether parallel mode was used
},
'details': [
# Detailed evaluation results for each sample
]
}
```
The cascade evaluator is particularly useful for:
1. Scenarios that require balancing evaluation cost and accuracy
2. Cases where rule-based evaluators are available but might not be comprehensive
3. Evaluation tasks that need more nuanced judgment for edge cases
## Complete Example
For a complete working example using GenericLLMEvaluator
, refer to the `eval_llm_judge.py` file in the examples directory, which demonstrates how to evaluate mathematical problem-solving .
For a complete working example using CascadeEvaluator, refer to the `eval_cascade_evaluator.py` file in the examples directory, which demonstrates how to evaluate mathematical problem-solving .
================================================
FILE: docs/en/advanced_guides/longeval.md
================================================
# Long Context Evaluation Guidance
## Introduction
Although large-scale language models (LLMs) such as GPT-4 have demonstrated significant advantages in handling natural language tasks, most current open-source models can only handle texts with a length of a few thousand tokens, which limits their ability to process long contexts such as reading books and writing text summaries. To explore the performance of models in dealing with long contexts, we use the [L-Eval](https://github.com/OpenLMLab/LEval) and [LongBench](https://github.com/THUDM/LongBench) datasets to test the model's ability to handle long contexts.
## Existing Algorithms and models
When dealing with long context inputs, the two main challenges faced by large models are the inference time cost and catastrophic forgetting. Recently, a large amount of research has been devoted to extending the model length, focusing on three improvement directions:
- Attention mechanisms. The ultimate goal of these methods is to reduce the computation cost of query-key pairs, but they may affect the performance of downstream tasks.
- Input methods. Some studies divide long context inputs into chunks or retrieve pre-existing text segments to enhance the model's ability to handle long contexts, but these methods are only effective for some tasks and are difficult to adapt to multiple downstream tasks.
- Position encoding. This research includes RoPE, ALiBi, Position Interpolation etc., which have shown good results in length extrapolation. These methods have been used to train long context models such as ChatGLM2-6B-32k and LongChat-32k.
First, we introduce some popular position encoding algorithms.
### RoPE
RoPE is a type of positional embedding that injects the information of position in Transformer. It encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. A graphic illustration of RoPE is shown below.
RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.
RoPE is adopted in many LLMs including LLaMA, LLaMA 2 and Vicuna-7b-v1.5-16k.
### ALiBi
Though RoPE and other alternatives to the original sinusoidal position method(like T5 bias) have improved extrapolation, they are considerably slower than the sinusoidal approach and use extra memory and parameter. Therefore, Attention with Linear Biases (ALiBi) is introduced to facilitate efficient extrapolation.
For an input subsequence of length L, the attention sublayer computes the attention scores for the ith query
```{math}
q_{i} \in R^{1 \times d}, (1 \leq i \leq L)
```
in each head, given the first i keys
```{math}
K \in R^{i \times d}
```
where d is the head dimension.
```{math}
softmax(q_{i}K^{T})
```
ALiBi negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query. The only modification it applies is after the query-key dot product, where it adds a static, non-learned bias.
```{math}
softmax(q_{i}K^{T}+m\cdot[-(i-1),...,-2,-1,0])
```
where scalar m is a head-specific slope fixed before training.
ALiBi eliminates position embeddings and it is as fast as the sinusoidal approach. It is used in LLMs including mpt-7b-storywriter, which is prepared to handle extremely long inputs.
### Position Interpolation(PI)
Many existing pre-trained LLMs including LLaMA use positional encodings that have weak extrapolation properties(e.g. RoPE). Position Interpolation is proposed and it can easily enable very long context windows while preserving model quality relatively well for the tasks within its original context window size.
The key idea of Position Interpolation is directly down-scale the position indices so that the maximum position index matches the previous context window limit in the pre-training stage. In other words, to accommodate more input tokens, the algorithm interpolates position encodings at neighboring integer positions, utilizing the fact that position encodings can be applied on non-integer positions, as opposed toextrapolating outside the trained positions, which may lead to catastrophic values. The algorithm requires only a very short period of fine-tuning for the model to fully adapt to greatly extended context windows.
An illustration of Position Interpolation method is shown below. Lower left illustrates Position Interpolation where it downscales the position indices (blue and green dots) themselves from \[0, 4096\] to \[0, 2048\] to force them to reside in the pretrained range.
Position Interpolation empowers ChatGLM2-6B-32k, a model based on ChatGLM2-6B, to deal with a 32k context window size.
Next, we introduce some long context language models we evaluate.
### XGen-7B-8k
XGen-7B-8k is trained with standard dense attention on up to 8k sequence length for up to 1.5T tokens. To mitigate slow training, XGen-7B-8k introduces training in stages with increasing sequence length. First, 800B tokens with sequence length of 2k tokens are observed, then 400B tokens with 4k, finally, 300B tokens with 8k length.
### Vicuna-7b-v1.5-16k
Vicuna-7b-v1.5-16k is fine-tuned from LLaMA 2 with supervised instruction fine-tuning and linear RoPE scaling. The training data is around 125K conversations collected from ShareGPT, a website where users can share their ChatGPT conversation. These conversations are packed into sequences that contain 16k tokens each.
### LongChat-7b-v1.5-32k
LongChat-7b-v1.5-32k is fine-tuned from LLaMA 2 models, which were originally pretrained with 4k context length. The training recipe can be conceptually described in two steps. The first step is condensing RoPE. Since the LLaMA model has not observed scenarios where position_ids > 4096 during the pre-training phase, LongChat condenses position_ids > 4096 to be within 0 to 4096. The second step is fine-tuning LongChat model on curated conversation data. In this step, the data is cleaned using FastChat data pipeline and truncated to the maximum length of model.
### ChatGLM2-6B-32k
The ChatGLM2-6B-32k further strengthens the ability to understand long texts based on the ChatGLM2-6B. Based on the method of Positional Interpolation, and trained with a 32K context length during the dialogue alignment, ChatGLM2-6B-32k can better handle up to 32K context length.
## [L-Eval](https://github.com/OpenLMLab/LEval)
L-Eval is a long context dataset built by OpenLMLab, consisting of 18 subtasks, including texts from various fields such as law, economy, and technology. The dataset consists of a total of 411 documents, over 2000 test cases, with an average document length of 7217 words. The subtasks in this dataset are divided into close-ended and open-ended categories, with 5 close-ended tasks evaluated using the exact match criterion and 13 open-ended tasks evaluated using Rouge scores.
## [LongBench](https://github.com/THUDM/LongBench)
LongBench is a long context dataset built by THUDM, consisting of 21 subtasks with a total of 4750 test cases. This dataset is the first long context dataset that includes both English and Chinese texts, with an average English text length of 6711 words and an average Chinese text length of 13386 characters. The 21 subtasks are divided into 6 types, providing a more comprehensive evaluation of the model's capabilities in various aspects.
## Evaluation Method
Due to the different maximum input lengths accepted by different models, in order to compare these large models more fairly, when the input length exceeds the maximum input limit of the model, we will trim the middle part of the input text to avoid missing prompt words.
## Long Context Ability Ranking
In the LongBench and L-Eval ability rankings, we select the average ranking **(The lower the better)** of each model in the subtask as the standard. It can be seen that GPT-4 and GPT-3.5-turbo-16k still occupy a leading position in long context tasks, while models like ChatGLM2-6B-32k also show significant improvement in long context ability after position interpolation based on ChatGLM2-6B.
The original scores are shown below.
| L-Eval | GPT-4 | GPT-3.5-turbo-16k | chatglm2-6b-32k | vicuna-7b-v1.5-16k | xgen-7b-8k | internlm-chat-7b-8k | longchat-7b-v1.5-32k | chatglm2-6b |
| ----------------- | ----- | ----------------- | --------------- | ------------------ | ---------- | ------------------- | -------------------- | ----------- |
| coursera | 61.05 | 50 | 45.35 | 26.74 | 33.72 | 40.12 | 27.91 | 38.95 |
| gsm100 | 92 | 78 | 27 | 11 | 8 | 19 | 5 | 8 |
| quality | 81.19 | 62.87 | 44.55 | 11.39 | 33.66 | 45.54 | 29.7 | 41.09 |
| tpo | 72.93 | 74.72 | 56.51 | 17.47 | 44.61 | 60.59 | 17.1 | 56.51 |
| topic_retrieval | 100 | 79.33 | 44.67 | 24.67 | 1.33 | 0 | 25.33 | 1.33 |
| | | | | | | | | |
| financialqa | 53.49 | 50.32 | 35.41 | 44.59 | 39.28 | 25.09 | 34.07 | 17.82 |
| gov_report | 50.84 | 50.48 | 42.97 | 48.17 | 38.52 | 31.29 | 36.52 | 41.88 |
| legal_contract_qa | 31.23 | 27.97 | 34.21 | 24.25 | 21.36 | 19.28 | 13.32 | 17.59 |
| meeting_summ | 31.44 | 33.54 | 29.13 | 28.52 | 27.96 | 17.56 | 22.32 | 15.98 |
| multidocqa | 37.81 | 35.84 | 28.6 | 26.88 | 24.41 | 22.43 | 21.85 | 19.66 |
| narrativeqa | 25.87 | 25.73 | 18.24 | 20.58 | 16.87 | 13.81 | 16.87 | 1.16 |
| nq | 67.36 | 66.91 | 41.06 | 36.44 | 29.43 | 16.42 | 35.02 | 0.92 |
| news_summ | 34.52 | 40.41 | 32.72 | 33.98 | 26.87 | 22.48 | 30.33 | 29.51 |
| paper_assistant | 42.26 | 41.76 | 34.59 | 35.83 | 25.39 | 28.25 | 30.42 | 30.43 |
| patent_summ | 48.61 | 50.62 | 46.04 | 48.87 | 46.53 | 30.3 | 41.6 | 41.25 |
| review_summ | 31.98 | 33.37 | 21.88 | 29.21 | 26.85 | 16.61 | 20.02 | 19.68 |
| scientificqa | 49.76 | 48.32 | 31.27 | 31 | 27.43 | 33.01 | 20.98 | 13.61 |
| tvshow_summ | 34.84 | 31.36 | 23.97 | 27.88 | 26.6 | 14.55 | 25.09 | 19.45 |
| LongBench | GPT-4 | GPT-3.5-turbo-16k | chatglm2-6b-32k | longchat-7b-v1.5-32k | vicuna-7b-v1.5-16k | internlm-chat-7b-8k | chatglm2-6b | xgen-7b-8k |
| ------------------- | ----- | ----------------- | --------------- | -------------------- | ------------------ | ------------------- | ----------- | ---------- |
| NarrativeQA | 31.2 | 25.79 | 19.27 | 19.19 | 23.65 | 12.24 | 13.09 | 18.85 |
| Qasper | 42.77 | 43.4 | 33.93 | 30.36 | 31.45 | 24.81 | 22.52 | 20.18 |
| MultiFieldQA-en | 55.1 | 54.35 | 45.58 | 44.6 | 43.38 | 25.41 | 38.09 | 37 |
| MultiFieldQA-zh | 64.4 | 61.92 | 52.94 | 32.35 | 44.65 | 36.13 | 37.67 | 14.7 |
| | | | | | | | | |
| HotpotQA | 59.85 | 52.49 | 46.41 | 34.43 | 34.17 | 27.42 | 27.35 | 28.78 |
| 2WikiMQA | 67.52 | 41.7 | 33.63 | 23.06 | 20.45 | 26.24 | 22.83 | 20.13 |
| Musique | 37.53 | 27.5 | 21.57 | 12.42 | 13.92 | 9.75 | 7.26 | 11.34 |
| DuReader (zh) | 38.65 | 29.37 | 38.53 | 20.25 | 20.42 | 11.11 | 17.18 | 8.57 |
| | | | | | | | | |
| GovReport | 32.09 | 29.92 | 32.47 | 29.83 | 29.27 | 18.38 | 22.86 | 23.37 |
| QMSum | 24.37 | 23.67 | 23.19 | 22.71 | 23.37 | 18.45 | 21.23 | 21.12 |
| Multi_news | 28.52 | 27.05 | 25.12 | 26.1 | 27.83 | 24.52 | 24.7 | 23.69 |
| VCSUM (zh) | 15.54 | 16.88 | 15.95 | 13.46 | 15.76 | 12.91 | 14.07 | 0.98 |
| | | | | | | | | |
| TREC | 78.5 | 73.5 | 30.96 | 29.23 | 32.06 | 39 | 24.46 | 29.31 |
| TriviaQA | 92.19 | 92.75 | 80.64 | 64.19 | 46.53 | 79.55 | 64.19 | 69.58 |
| SAMSum | 46.32 | 43.16 | 29.49 | 25.23 | 25.23 | 43.05 | 20.22 | 16.05 |
| LSHT (zh) | 41.5 | 34.5 | 22.75 | 20 | 24.75 | 20.5 | 16 | 18.67 |
| | | | | | | | | |
| Passage Count | 8.5 | 3 | 3 | 1 | 3 | 1.76 | 3 | 1 |
| PassageRetrieval-en | 75 | 73 | 57.5 | 20.5 | 16.5 | 7 | 5.5 | 12 |
| PassageRetrieval-zh | 96 | 82.5 | 58 | 15 | 21 | 2.29 | 5 | 3.75 |
| | | | | | | | | |
| LCC | 59.25 | 53.49 | 53.3 | 51.46 | 49.3 | 49.32 | 46.59 | 44.1 |
| RepoBench-P | 55.42 | 55.95 | 46.66 | 52.18 | 41.49 | 35.86 | 41.97 | 41.83 |
================================================
FILE: docs/en/advanced_guides/math_verify.md
================================================
# General Math Evaluation Guidance
## Introduction
Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.
## Dataset Format
The math evaluation dataset should be in either JSON Lines (.jsonl) or CSV format. Each problem should contain at least:
- A problem statement
- A solution/answer (typically in LaTeX format with the final answer in \\boxed{})
Example JSONL format:
```json
{"problem": "Find the value of x if 2x + 3 = 7", "solution": "Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"}
```
Example CSV format:
```csv
problem,solution
"Find the value of x if 2x + 3 = 7","Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"
```
## Configuration
To evaluate mathematical reasoning, you'll need to set up three main components:
1. Dataset Reader Configuration
```python
math_reader_cfg = dict(
input_columns=['problem'], # Column name for the question
output_column='solution' # Column name for the answer
)
```
2. Inference Configuration
```python
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
```
3. Evaluation Configuration
```python
math_eval_cfg = dict(
evaluator=dict(type=MATHVerifyEvaluator),
)
```
## Using CustomDataset
Here's how to set up a complete configuration for math evaluation:
```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
math_datasets = [
dict(
type=CustomDataset,
abbr='my-math-dataset', # Dataset abbreviation
path='path/to/your/dataset', # Path to your dataset file
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg,
)
]
```
## MATHVerifyEvaluator
The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
The MATHVerifyEvaluator implements:
1. Extracts answers from both predictions and references using LaTeX extraction
2. Handles various LaTeX formats and environments
3. Verifies mathematical equivalence between predicted and reference answers
4. Provides detailed evaluation results including:
- Accuracy score
- Detailed comparison between predictions and references
- Parse results of both predicted and reference answers
The evaluator supports:
- Basic arithmetic operations
- Fractions and decimals
- Algebraic expressions
- Trigonometric functions
- Roots and exponents
- Mathematical symbols and operators
Example evaluation output:
```python
{
'accuracy': 85.0, # Percentage of correct answers
'details': [
{
'predictions': 'x = 2', # Parsed prediction
'references': 'x = 2', # Parsed reference
'correct': True # Whether they match
},
# ... more results
]
}
```
## Complete Example
Here's a complete example of how to set up math evaluation:
```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
# Dataset reader configuration
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
# Inference configuration
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
math_eval_cfg = dict(
evaluator=dict(type=MATHVerifyEvaluator),
)
# Dataset configuration
math_datasets = [
dict(
type=CustomDataset,
abbr='my-math-dataset',
path='path/to/your/dataset.jsonl', # or .csv
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg,
)
]
# Model configuration
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='your-model-name',
path='your/model/path',
# ... other model configurations
)
]
# Output directory
work_dir = './outputs/math_eval'
```
================================================
FILE: docs/en/advanced_guides/needleinahaystack_eval.md
================================================
# Needle In A Haystack Evaluation
## Introduction to the Needle In A Haystack Test
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method where key information is randomly inserted into long texts to form the prompt for large language models (LLMs). This test aims to assess whether LLMs can extract critical information from long texts, thereby evaluating their fundamental ability to comprehend and process long-context documents.
## Task Overview
Within the `OpenCompass` framework, under `NeedleBench`, we designed a series of progressively challenging evaluation tasks to comprehensively assess LLMs' long-text information extraction and reasoning capabilities. For a complete description, please refer to our [technical report](https://arxiv.org/abs/2407.11963).
- **Single-Needle Retrieval Task (S-RT)**: Evaluates the LLM's ability to retrieve a single piece of key information from a long text, testing precise recall of specific details within extensive narratives. This corresponds to the **original Needle In A Haystack test** setup.
- **Multi-Needle Retrieval Task (M-RT)**: Explores the LLM's ability to retrieve multiple relevant pieces of information from long texts, simulating complex queries over comprehensive documents.
- **Multi-Needle Reasoning Task (M-RS)**: Assesses LLMs' abilities to integrate multiple key pieces of information extracted from long texts for reasoning, requiring a comprehensive understanding of content.
- **Ancestral Trace Challenge (ATC)**: Tests LLMs' capabilities in handling multi-layer logical challenges within realistic long-text contexts through "kinship trace needles." In the ATC task, no irrelevant (haystack) texts are added; every piece of text is critical, and models must reason through all details for accurate answers.
> **Note:** NeedleBench (v2) includes several optimizations and adjustments in dataset construction and task details. For a detailed comparison between the old and new versions, as well as a summary of updates, please refer to [opencompass/configs/datasets/needlebench_v2/readme.md](https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/needlebench_v2/readme.md).
## Evaluation Steps
> Note: In the latest `OpenCompass` codebase, the NeedleBench dataset is automatically loaded from the [Huggingface interface](https://huggingface.co/datasets/opencompass/NeedleBench), with no need for manual download or configuration.
### `OpenCompass` Environment Setup
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
```
### Dataset Configuration
We have pre-configured various long-context settings (4k, 8k, 32k, 128k, 200k, 1000k) in `opencompass/configs/datasets/needlebench_v2`, and you can flexibly define your parameters by adjusting the configuration files.
### Evaluation Example
#### Evaluating with `VLLM` Deployed `Qwen2-5-7B` Model
To evaluate the `Qwen2-5-7B` model deployed with `VLLM` on all tasks under NeedleBench-128K, use the following command. This leverages pre-defined model and dataset configuration files without needing additional configuration:
##### Local Evaluation
If evaluating locally, the command will use all available GPUs. You can control GPU visibility using `CUDA_VISIBLE_DEVICES`:
```bash
# Local evaluation
python run.py --datasets needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer
```
##### Evaluation on Slurm Cluster
For Slurm environments, you can add options like `--slurm -p partition_name -q reserved --max-num-workers 16`:
```bash
# Slurm evaluation
python run.py --datasets needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
##### Evaluating Specific Subsets
If you only want to test the original Needle In A Haystack task (e.g., single-needle 128k), adjust the dataset parameter:
```bash
python run.py --datasets needlebench_v2_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
To evaluate only Chinese versions, specify the subset dataset after `/`:
```bash
python run.py --datasets needlebench_v2_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
Ensure `VLLM` is installed beforehand:
```bash
# Install vLLM with CUDA 12.4.
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
pip install vllm
```
#### Evaluating Other `Huggingface` Models
For other models, it is recommended to write your own config file (such as `examples/eval_needlebench_v2.py`) to adjust `max_seq_len` and `max_out_len`, so that the model can process the full context.
You can then run evaluation with:
```bash
python run.py examples/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
```
No need to manually specify `--datasets`, `--models`, or `--summarizer` again.
### Visualization
NeedleBench's latest version has built-in visualization integrated into the summarizer. You can find corresponding visualizations in the `plots` directory under the output folder without needing additional scripts.
### Citation
If you use NeedleBench, please cite us:
```bibtex
@misc{li2025needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
year={2025},
eprint={2407.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.11963},
}
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished={\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@misc{LLMTest_NeedleInAHaystack,
title={LLMTest Needle In A Haystack - Pressure Testing LLMs},
author={gkamradt},
year={2023},
howpublished={\url{https://github.com/gkamradt/LLMTest_NeedleInAHaystack}}
}
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei L\"u and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
================================================
FILE: docs/en/advanced_guides/new_dataset.md
================================================
# Add a dataset
Although OpenCompass has already included most commonly used datasets, users need to follow the steps below to support a new dataset if wanted:
1. Add a dataset script `mydataset.py` to the `opencompass/datasets` folder. This script should include:
- The dataset and its loading method. Define a `MyDataset` class that implements the data loading method `load` as a static method. This method should return data of type `datasets.Dataset`. We use the Hugging Face dataset as the unified interface for datasets to avoid introducing additional logic. Here's an example:
```python
import datasets
from .base import BaseDataset
class MyDataset(BaseDataset):
@staticmethod
def load(**kwargs) -> datasets.Dataset:
pass
```
- (Optional) If the existing evaluators in OpenCompass do not meet your needs, you need to define a `MyDatasetEvaluator` class that implements the scoring method `score`. This method should take `predictions` and `references` as input and return the desired dictionary. Since a dataset may have multiple metrics, the method should return a dictionary containing the metrics and their corresponding scores. Here's an example:
```python
from opencompass.openicl.icl_evaluator import BaseEvaluator
class MyDatasetEvaluator(BaseEvaluator):
def score(self, predictions: List, references: List) -> dict:
pass
```
- (Optional) If the existing postprocessors in OpenCompass do not meet your needs, you need to define the `mydataset_postprocess` method. This method takes an input string and returns the corresponding postprocessed result string. Here's an example:
```python
def mydataset_postprocess(text: str) -> str:
pass
```
2. After defining the dataset loading, data postprocessing, and evaluator methods, you need to add the following configurations to the configuration file:
```python
from opencompass.datasets import MyDataset, MyDatasetEvaluator, mydataset_postprocess
mydataset_eval_cfg = dict(
evaluator=dict(type=MyDatasetEvaluator),
pred_postprocessor=dict(type=mydataset_postprocess))
mydataset_datasets = [
dict(
type=MyDataset,
...,
reader_cfg=...,
infer_cfg=...,
eval_cfg=mydataset_eval_cfg)
]
```
- To facilitate the access of your datasets to other users, you need to specify the channels for downloading the datasets in the configuration file. Specifically, you need to first fill in a dataset name given by yourself in the `path` field in the `mydataset_datasets` configuration, and this name will be mapped to the actual download path in the `opencompass/utils/datasets_info.py` file. Here's an example:
```python
mmlu_datasets = [an
dict(
...,
path='opencompass/mmlu',
...,
)
]
```
- Next, you need to create a dictionary key in `opencompass/utils/datasets_info.py` with the same name as the one you provided above. If you have already hosted the dataset on HuggingFace or Modelscope, please add a dictionary key to the `DATASETS_MAPPING` dictionary and fill in the HuggingFace or Modelscope dataset address in the `hf_id` or `ms_id` key, respectively. You can also specify a default local address. Here's an example:
```python
"opencompass/mmlu": {
"ms_id": "opencompass/mmlu",
"hf_id": "opencompass/mmlu",
"local": "./data/mmlu/",
}
```
- If you wish for the provided dataset to be directly accessible from the OpenCompass OSS repository when used by others, you need to submit the dataset files in the Pull Request phase. We will then transfer the dataset to the OSS on your behalf and create a new dictionary key in the `DATASET_URL`.
- To ensure the optionality of data sources, you need to improve the method `load` in the dataset script `mydataset.py`. Specifically, you need to implement a functionality to switch among different download sources based on the setting of the environment variable `DATASET_SOURCE`. It should be noted that if the environment variable `DATASET_SOURCE` is not set, the dataset will default to being downloaded from the OSS repository. Here's an example from `opencompass/dataset/cmmlu.py`:
```python
def load(path: str, name: str, **kwargs):
...
if environ.get('DATASET_SOURCE') == 'ModelScope':
...
else:
...
return dataset
```
3. After completing the dataset script and config file, you need to register the information of your new dataset in the file `dataset-index.yml` at the main directory, so that it can be added to the dataset statistics list on the OpenCompass website.
- The keys that need to be filled in include `name`: the name of your dataset, `category`: the category of your dataset, `paper`: the URL of the paper or project, and `configpath`: the path to the dataset config file. Here's an example:
```
- mydataset:
name: MyDataset
category: Understanding
paper: https://arxiv.org/pdf/xxxxxxx
configpath: opencompass/configs/datasets/MyDataset
```
Detailed dataset configuration files and other required configuration files can be referred to in the [Configuration Files](../user_guides/config.md) tutorial. For guides on launching tasks, please refer to the [Quick Start](../get_started/quick_start.md) tutorial.
================================================
FILE: docs/en/advanced_guides/new_model.md
================================================
# Add a Model
Currently, we support HF models, some model APIs, and some third-party models.
## Adding API Models
To add a new API-based model, you need to create a new file named `mymodel_api.py` under `opencompass/models` directory. In this file, you should inherit from `BaseAPIModel` and implement the `generate` method for inference and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file.
```python
from ..base_api import BaseAPIModel
class MyModelAPI(BaseAPIModel):
is_api: bool = True
def __init__(self,
path: str,
max_seq_len: int = 2048,
query_per_second: int = 1,
retry: int = 2,
**kwargs):
super().__init__(path=path,
max_seq_len=max_seq_len,
meta_template=meta_template,
query_per_second=query_per_second,
retry=retry)
...
def generate(
self,
inputs,
max_out_len: int = 512,
temperature: float = 0.7,
) -> List[str]:
"""Generate results given a list of inputs."""
pass
def get_token_len(self, prompt: str) -> int:
"""Get lengths of the tokenized string."""
pass
```
## Adding Third-Party Models
To add a new third-party model, you need to create a new file named `mymodel.py` under `opencompass/models` directory. In this file, you should inherit from `BaseModel` and implement the `generate` method for generative inference, the `get_ppl` method for discriminative inference, and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file.
```python
from ..base import BaseModel
class MyModel(BaseModel):
def __init__(self,
pkg_root: str,
ckpt_path: str,
tokenizer_only: bool = False,
meta_template: Optional[Dict] = None,
**kwargs):
...
def get_token_len(self, prompt: str) -> int:
"""Get lengths of the tokenized strings."""
pass
def generate(self, inputs: List[str], max_out_len: int) -> List[str]:
"""Generate results given a list of inputs. """
pass
def get_ppl(self,
inputs: List[str],
mask_length: Optional[List[int]] = None) -> List[float]:
"""Get perplexity scores given a list of inputs."""
pass
```
================================================
FILE: docs/en/advanced_guides/objective_judgelm_evaluation.md
================================================
# Using Large Models as JudgeLLM for Objective Evaluation
## Introduction
Traditional objective evaluations often rely on standard answers for reference. However, in practical applications, the predicted results of models may vary due to differences in the model's instruction-following capabilities or imperfections in post-processing functions. This can lead to incorrect extraction of answers and comparison with standard answers, resulting in potentially inaccurate evaluation outcomes. To address this issue, we have adopted a process similar to subjective evaluations by introducing JudgeLLM post-prediction to assess the consistency between model responses and standard answers. ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
Currently, all models supported by the opencompass repository can be directly used as JudgeLLM. Additionally, we are planning to support dedicated JudgeLLMs.
## Currently Supported Objective Evaluation Datasets
1. MATH ([https://github.com/hendrycks/math](https://github.com/hendrycks/math))
## Custom JudgeLLM Objective Dataset Evaluation
OpenCompass currently supports most datasets that use `GenInferencer` for inference. The specific process for custom JudgeLLM objective evaluation includes:
1. Building evaluation configurations using API models or open-source models for inference of question answers.
2. Employing a selected evaluation model (JudgeLLM) to assess the outputs of the model.
### Step One: Building Evaluation Configurations, Using MATH as an Example
Below is the Config for evaluating the MATH dataset with JudgeLLM, with the evaluation model being *Llama3-8b-instruct* and the JudgeLLM being *Llama3-70b-instruct*. For more detailed config settings, please refer to `examples/eval_math_llm_judge.py`. The following is a brief version of the annotations to help users understand the meaning of the configuration file.
```python
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
from mmengine.config import read_base
with read_base():
from .models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
from .models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model # noqa: F401, F403
from .datasets.math.math_llm_judge import math_datasets # noqa: F401, F403
from opencompass.datasets import math_judement_preprocess
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import AllObjSummarizer
from opencompass.openicl.icl_evaluator import LMEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
# ------------- Prompt Settings ----------------------------------------
# Evaluation template, please modify the template as needed, JudgeLLM typically uses [Yes] or [No] as the response. For the MATH dataset, the evaluation template is as follows:
eng_obj_prompt = """
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
[Yes]
Expression 1: 3/2
Expression 2: 1.5
[Yes]
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
[No]
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
[Yes]
Expression 1: 3245/5
Expression 2: 649
[No]
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
[Yes]
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2:
[No]
(only mark as equivalent if both expressions are nonempty)
---
YOUR TASK
Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
Expression 1: {obj_gold}
Expression 2: {prediction}
"""
# ------------- Inference Phase ----------------------------------------
# Models to be evaluated
models = [*hf_llama3_8b_instruct_model]
# Evaluation models
judge_models = hf_llama3_70b_instruct_model
eng_datasets = [*math_datasets]
chn_datasets = []
datasets = eng_datasets + chn_datasets
for d in eng_datasets:
d['eval_cfg']= dict(
evaluator=dict(
type=LMEvaluator,
# If you need to preprocess model predictions before judging,
# you can specify a pred_postprocessor function here
pred_postprocessor=dict(type=math_judement_preprocess),
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt = eng_obj_prompt
),
]),
),
),
pred_role="BOT",
)
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLInferTask)),
)
# ------------- Evaluation Configuration --------------------------------
eval = dict(
partitioner=dict(
type=SubjectiveSizePartitioner, max_task_size=80000, mode='singlescore', models=models, judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(
type=AllObjSummarizer
)
# Output folder
work_dir = 'outputs/obj_all/'
```
### Step Two: Launch Evaluation and Output Results
```shell
python run.py eval_math_llm_judge.py
```
This will initiate two rounds of evaluation. The first round involves model inference to obtain predicted answers to questions, and the second round involves JudgeLLM evaluating the consistency between the predicted answers and the standard answers, and scoring them.
- The results of model predictions will be saved in `output/.../timestamp/predictions/xxmodel/xxx.json`
- The JudgeLLM's evaluation responses will be saved in `output/.../timestamp/results/xxmodel/xxx.json`
- The evaluation report will be output to `output/.../timestamp/summary/timestamp/xxx.csv`
## Results
Using the Llama3-8b-instruct as the evaluation model and the Llama3-70b-instruct as the evaluator, the MATH dataset was assessed with the following results:
| Model | JudgeLLM Evaluation | Naive Evaluation |
| ------------------- | ------------------- | ---------------- |
| llama-3-8b-instruct | 27.7 | 27.8 |
================================================
FILE: docs/en/advanced_guides/persistence.md
================================================
# Evaluation Results Persistence
## Introduction
Normally, the evaluation results of OpenCompass will be saved to your work directory. But in some cases, there may be a need for data sharing among users or quickly browsing existing public evaluation results. Therefore, we provide an interface that can quickly transfer evaluation results to external public data stations, and on this basis, provide functions such as uploading, overwriting, and reading.
## Quick Start
### Uploading
By adding `args` to the evaluation command or adding configuration in the Eval script, the results of evaluation can be stored in the path you specify. Here are the examples:
(Approach 1) Add an `args` option to the command and specify your public path address.
```bash
opencompass ... -sp '/your_path'
```
(Approach 2) Add configuration in the Eval script.
```pythonE
station_path = '/your_path'
```
### Overwriting
The above storage method will first determine whether the same task result already exists in the data station based on the `abbr` attribute in the model and dataset configuration before uploading data. If results already exists, cancel this storage. If you need to update these results, please add the `station-overwrite` option to the command, here is an example:
```bash
opencompass ... -sp '/your_path' --station-overwrite
```
### Reading
You can directly read existing results from the data station to avoid duplicate evaluation tasks. The read results will directly participate in the 'summarize' step. When using this configuration, only tasks that do not store results in the data station will be initiated. Here is an example:
```bash
opencompass ... -sp '/your_path' --read-from-station
```
### Command Combination
1. Only upload the results under your latest working directory to the data station, without supplementing tasks that missing results:
```bash
opencompass ... -sp '/your_path' -r latest -m viz
```
## Storage Format of the Data Station
In the data station, the evaluation results are stored as `json` files for each `model-dataset` pair. The specific directory form is `/your_path/dataset_name/model_name.json `. Each `json` file stores a dictionary corresponding to the results, including `predictions`, `results`, and `cfg`, here is an example:
```pythonE
Result = {
'predictions': List[Dict],
'results': Dict,
'cfg': Dict = {
'models': Dict,
'datasets': Dict,
(Only subjective datasets)'judge_models': Dict
}
}
```
Among this three keys, `predictions` records the predictions of the model on each item of data in the dataset. `results` records the total score of the model on the dataset. `cfg` records detailed configurations of the model and the dataset in this evaluation task.
================================================
FILE: docs/en/advanced_guides/prompt_attack.md
================================================
# Prompt Attack
We support prompt attack following the idea of [PromptBench](https://github.com/microsoft/promptbench). The main purpose here is to evaluate the robustness of prompt instruction, which means when attack/modify the prompt to instruct the task, how well can this task perform as the original task.
## Set up environment
Some components are necessary to prompt attack experiment, therefore we need to set up environments.
```shell
git clone https://github.com/microsoft/promptbench.git
pip install textattack==0.3.8
export PYTHONPATH=$PYTHONPATH:promptbench/
```
## How to attack
### Add a dataset config
We will use GLUE-wnli dataset as example, most configuration settings can refer to [config.md](../user_guides/config.md) for help.
First we need support the basic dataset config, you can find the existing config files in `configs` or support your own config according to [new-dataset](./new_dataset.md)
Take the following `infer_cfg` as example, we need to define the prompt template. `adv_prompt` is the basic prompt placeholder to be attacked in the experiment. `sentence1` and `sentence2` are the input columns of this dataset. The attack will only modify the `adv_prompt` here.
Then, we should use `AttackInferencer` with `original_prompt_list` and `adv_key` to tell the inferencer where to attack and what text to be attacked.
More details can refer to `configs/datasets/promptbench/promptbench_wnli_gen_50662f.py` config file.
```python
original_prompt_list = [
'Are the following two sentences entailment or not_entailment? Answer me with "A. entailment" or "B. not_entailment", just one word. ',
"Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'.",
...,
]
wnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role="HUMAN",
prompt="""{adv_prompt}
Sentence 1: {sentence1}
Sentence 2: {sentence2}
Answer:"""),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(
type=AttackInferencer,
original_prompt_list=original_prompt_list,
adv_key='adv_prompt'))
```
### Add a eval config
We should use `OpenICLAttackTask` here for attack task. Also `NaivePartitioner` should be used because the attack experiment will run the whole dataset repeatedly for nearly hurdurds times to search the best attack, we do not want to split the dataset for convenience.
```note
Please choose a small dataset(example < 1000) for attack, due to the aforementioned repeated search, otherwise the time cost is enumerous.
```
There are several other options in `attack` config:
- `attack`: attack type, available options includes `textfooler`, `textbugger`, `deepwordbug`, `bertattack`, `checklist`, `stresstest`;
- `query_budget`: upper boundary of queries, which means the total numbers of running the dataset;
- `prompt_topk`: number of topk prompt to be attacked. In most case, the original prompt list is great than 10, running the whole set is time consuming.
```python
# Please run whole dataset at a time, aka use `NaivePartitioner` only
# Please use `OpenICLAttackTask` if want to perform attack experiment
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=SlurmRunner,
max_num_workers=8,
task=dict(type=OpenICLAttackTask),
retry=0),
)
attack = dict(
attack='textfooler',
query_budget=100,
prompt_topk=2,
)
```
### Run the experiment
Please use `--mode infer` when run the attack experiment, and set `PYTHONPATH` env.
```shell
python run.py examples/eval_attack.py --mode infer
```
All the results will be saved in `attack` folder.
The content includes the original prompt accuracy and the attacked prompt with dropped accuracy of `topk` prompt, for instance:
```
Prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'., acc: 59.15%
Prompt: Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'., acc: 57.75%
Prompt: Analyze the two provided sentences and decide if their relationship is 'A. entailment' or 'B. not_entailment'., acc: 56.34%
Prompt: Identify whether the given pair of sentences demonstrates entailment or not_entailment. Answer with 'A. entailment' or 'B. not_entailment'., acc: 54.93%
...
Original prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'.
Attacked prompt: b"Assess the attach between the following sentences and sorted it as 'A. entailment' or 'B. not_entailment'."
Original acc: 59.15%, attacked acc: 40.85%, dropped acc: 18.31%
```
================================================
FILE: docs/en/advanced_guides/subjective_evaluation.md
================================================
# Subjective Evaluation Guidance
## Introduction
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
A popular evaluation method involves
- Compare Mode: comparing model responses pairwise to calculate their win rate
- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
## Currently Supported Subjective Evaluation Datasets
1. AlignBench Chinese Scoring Dataset (https://github.com/THUDM/AlignBench)
2. MTBench English Scoring Dataset, two-turn dialogue (https://github.com/lm-sys/FastChat)
3. MTBench101 English Scoring Dataset, multi-turn dialogue (https://github.com/mtbench101/mt-bench-101)
4. AlpacaEvalv2 English Compare Dataset (https://github.com/tatsu-lab/alpaca_eval)
5. ArenaHard English Compare Dataset, mainly focused on coding (https://github.com/lm-sys/arena-hard/tree/main)
6. Fofo English Scoring Dataset (https://github.com/SalesforceAIResearch/FoFo/)
7. Wildbench English Score and Compare Dataset(https://github.com/allenai/WildBench)
## Initiating Subjective Evaluation
Similar to existing objective evaluation methods, you can configure related settings in `examples/eval_subjective.py`.
### Basic Parameters: Specifying models, datasets, and judgemodels
Similar to objective evaluation, import the models and datasets that need to be evaluated, for example:
```
with read_base():
from .datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from .datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import subjective_datasets as alpacav2
from .models.qwen.hf_qwen_7b import models
```
It is worth noting that since the model setup parameters for subjective evaluation are often different from those for objective evaluation, it often requires setting up `do_sample` for inference instead of `greedy`. You can modify the relevant parameters in the configuration file as needed, for example:
```
models = [
dict(
type=HuggingFaceChatGLM3,
abbr='chatglm3-6b-hf2',
path='THUDM/chatglm3-6b',
tokenizer_path='THUDM/chatglm3-6b',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
generation_kwargs=dict(
do_sample=True,
),
meta_template=api_meta_template,
max_out_len=2048,
max_seq_len=4096,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
```
The judgemodel is usually set to a powerful model like GPT4, and you can directly enter your API key according to the configuration in the config file, or use a custom model as the judgemodel.
### Specifying Other Parameters
In addition to the basic parameters, you can also modify the `infer` and `eval` fields in the config to set a more appropriate partitioning method. The currently supported partitioning methods mainly include three types: NaivePartitioner, SizePartitioner, and NumberWorkPartitioner. You can also specify your own workdir to save related files.
## Subjective Evaluation with Custom Dataset
The specific process includes:
1. Data preparation
2. Model response generation
3. Evaluate the response with a JudgeLLM
4. Generate JudgeLLM's response and calculate the metric
### Step-1: Data Preparation
This step requires preparing the dataset file and implementing your own dataset class under `Opencompass/datasets/subjective/`, returning the read data in the format of `list of dict`.
Actually, you can prepare the data in any format you like (csv, json, jsonl, etc.). However, to make it easier to get started, it is recommended to construct the data according to the format of the existing subjective datasets or according to the following json format.
We provide mini test-set for **Compare Mode** and **Score Mode** as below:
```python
###COREV2
[
{
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
"capability": "知识-社会常识",
"others": {
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
"evaluating_guidance": "",
"reference_answer": "上"
}
},...]
###CreationV0.1
[
{
"question": "请你扮演一个邮件管家,我让你给谁发送什么主题的邮件,你就帮我扩充好邮件正文,并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题,来斟酌用词,并使用合适的敬语。现在请给导师发送邮件,询问他是否可以下周三下午15:00进行科研同步会,大约200字。",
"capability": "邮件通知",
"others": ""
},
```
The json must includes the following fields:
- 'question': Question description
- 'capability': The capability dimension of the question.
- 'others': Other needed information.
If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
### Step-2: Evaluation Configuration(Compare Mode)
Taking Alignbench as an example, `configs/datasets/subjective/alignbench/alignbench_judgeby_critiquellm.py`:
1. First, you need to set `subjective_reader_cfg` to receive the relevant fields returned from the custom Dataset class and specify the output fields when saving files.
2. Then, you need to specify the root path `data_path` of the dataset and the dataset filename `subjective_all_sets`. If there are multiple sub-files, you can add them to this list.
3. Specify `subjective_infer_cfg` and `subjective_eval_cfg` to configure the corresponding inference and evaluation prompts.
4. Specify additional information such as `mode` at the corresponding location. Note that the fields required for different subjective datasets may vary.
5. Define post-processing and score statistics. For example, the postprocessing function `alignbench_postprocess` located under `opencompass/opencompass/datasets/subjective/alignbench`.
### Step-3: Launch the Evaluation
```shell
python run.py config/eval_subjective_score.py -r
```
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.
## Multi-round Subjective Evaluation in OpenCompass
In OpenCompass, we also support subjective multi-turn dialogue evaluation. For instance, the evaluation of MT-Bench can be referred to in `configs/datasets/subjective/multiround`.
In the multi-turn dialogue evaluation, you need to organize the data format into the following dialogue structure:
```
"dialogue": [
{
"role": "user",
"content": "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position? Where is the person you just overtook?"
},
{
"role": "assistant",
"content": ""
},
{
"role": "user",
"content": "If the \"second person\" is changed to \"last person\" in the above question, what would the answer be?"
},
{
"role": "assistant",
"content": ""
}
],
```
It's important to note that due to the different question types in MTBench having different temperature settings, we need to divide the original data files into three different subsets according to the temperature for separate inference. For different subsets, we can set different temperatures. For specific settings, please refer to `configs\datasets\subjective\multiround\mtbench_single_judge_diff_temp.py`.
================================================
FILE: docs/en/conf.py
================================================
# flake8: noqa
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import subprocess
import sys
import pytorch_sphinx_theme
from sphinx.builders.html import StandaloneHTMLBuilder
sys.path.insert(0, os.path.abspath('../../'))
# -- Project information -----------------------------------------------------
project = 'OpenCompass'
copyright = '2023, OpenCompass'
author = 'OpenCompass Authors'
# The full version, including alpha/beta/rc tags
version_file = '../../opencompass/__init__.py'
def get_version():
with open(version_file, 'r') as f:
exec(compile(f.read(), version_file, 'exec'))
return locals()['__version__']
release = get_version()
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.intersphinx',
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'myst_parser',
'sphinx_copybutton',
'sphinx_tabs.tabs',
'notfound.extension',
'sphinxcontrib.jquery',
'sphinx_design',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
source_suffix = {
'.rst': 'restructuredtext',
'.md': 'markdown',
}
language = 'en'
# The master toctree document.
root_doc = 'index'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'pytorch_sphinx_theme'
html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
# yapf: disable
html_theme_options = {
'menu': [
{
'name': 'GitHub',
'url': 'https://github.com/open-compass/opencompass'
},
],
# Specify the language of shared menu
'menu_lang': 'en',
# Disable the default edit on GitHub
'default_edit_on_github': False,
}
# yapf: enable
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_css_files = [
'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
'css/readthedocs.css'
]
html_js_files = [
'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
'js/custom.js'
]
html_context = {
'github_version': 'main',
}
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'opencompassdoc'
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(root_doc, 'opencompass.tex', 'OpenCompass Documentation', author,
'manual'),
]
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(root_doc, 'opencompass', 'OpenCompass Documentation', [author],
1)]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(root_doc, 'opencompass', 'OpenCompass Documentation', author,
'OpenCompass Authors', 'AGI evaluation toolbox and benchmark.',
'Miscellaneous'),
]
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
# set priority when building html
StandaloneHTMLBuilder.supported_image_types = [
'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
]
# -- Extension configuration -------------------------------------------------
# Ignore >>> when copying code
copybutton_prompt_text = r'>>> |\.\.\. '
copybutton_prompt_is_regexp = True
# Auto-generated header anchors
myst_heading_anchors = 3
# Enable "colon_fence" extension of myst.
myst_enable_extensions = ['colon_fence', 'dollarmath']
# Configuration for intersphinx
intersphinx_mapping = {
'python': ('https://docs.python.org/3', None),
'numpy': ('https://numpy.org/doc/stable', None),
'torch': ('https://pytorch.org/docs/stable/', None),
'mmengine': ('https://mmengine.readthedocs.io/en/latest/', None),
'transformers':
('https://huggingface.co/docs/transformers/main/en/', None),
}
napoleon_custom_sections = [
# Custom sections for data elements.
('Meta fields', 'params_style'),
('Data fields', 'params_style'),
]
# Disable docstring inheritance
autodoc_inherit_docstrings = False
# Mock some imports during generate API docs.
autodoc_mock_imports = ['rich', 'attr', 'einops']
# Disable displaying type annotations, these can be very verbose
autodoc_typehints = 'none'
# The not found page
notfound_template = '404.html'
def builder_inited_handler(app):
subprocess.run(['./statis.py'])
def setup(app):
app.connect('builder-inited', builder_inited_handler)
================================================
FILE: docs/en/docutils.conf
================================================
[html writers]
table_style: colwidths-auto
================================================
FILE: docs/en/get_started/faq.md
================================================
# FAQ
## General
### What are the differences and connections between `ppl` and `gen`?
`ppl` stands for perplexity, an index used to evaluate a model's language modeling capabilities. In the context of OpenCompass, it generally refers to a method of answering multiple-choice questions: given a context, the model needs to choose the most appropriate option from multiple choices. In this case, we concatenate the n options with the context to form n sequences, then calculate the model's perplexity for these n sequences. We consider the option corresponding to the sequence with the lowest perplexity as the model's reasoning result for this question. This evaluation method is simple and direct in post-processing, with high certainty.
`gen` is an abbreviation for generate. In the context of OpenCompass, it refers to the model's continuation writing result given a context as the reasoning result for a question. Generally, the string obtained from continuation writing requires a heavier post-processing process to extract reliable answers and complete the evaluation.
In terms of usage, multiple-choice questions and some multiple-choice-like questions of the base model use `ppl`, while the base model's multiple-selection and non-multiple-choice questions use `gen`. All questions of the chat model use `gen`, as many commercial API models do not expose the `ppl` interface. However, there are exceptions, such as when we want the base model to output the problem-solving process (e.g., Let's think step by step), we will also use `gen`, but the overall usage is as shown in the following table:
| | ppl | gen |
| ---------- | -------------- | -------------------- |
| Base Model | Only MCQ Tasks | Tasks Other Than MCQ |
| Chat Model | None | All Tasks |
Similar to `ppl`, conditional log probability (`clp`) calculates the probability of the next token given a context. It is also only applicable to multiple-choice questions, and the range of probability calculation is limited to the tokens corresponding to the option numbers. The option corresponding to the token with the highest probability is considered the model's reasoning result. Compared to `ppl`, `clp` calculation is more efficient, requiring only one inference, whereas `ppl` requires n inferences. However, the drawback is that `clp` is subject to the tokenizer. For example, the presence or absence of space symbols before and after an option can change the tokenizer's encoding result, leading to unreliable test results. Therefore, `clp` is rarely used in OpenCompass.
### How does OpenCompass control the number of shots in few-shot evaluations?
In the dataset configuration file, there is a retriever field indicating how to recall samples from the dataset as context examples. The most commonly used is `FixKRetriever`, which means using a fixed k samples, hence k-shot. There is also `ZeroRetriever`, which means not using any samples, which in most cases implies 0-shot.
On the other hand, in-context samples can also be directly specified in the dataset template. In this case, `ZeroRetriever` is also used, but the evaluation is not 0-shot and needs to be determined based on the specific template. Refer to [prompt](../prompt/prompt_template.md) for more details
### How does OpenCompass allocate GPUs?
OpenCompass processes evaluation requests using the unit termed as "task". Each task is an independent combination of model(s) and dataset(s). The GPU resources needed for a task are determined entirely by the model being evaluated, specifically by the `num_gpus` parameter.
During evaluation, OpenCompass deploys multiple workers to execute tasks in parallel. These workers continuously try to secure GPU resources and run tasks until they succeed. As a result, OpenCompass always strives to leverage all available GPU resources to their maximum capacity.
For instance, if you're using OpenCompass on a local machine equipped with 8 GPUs, and each task demands 4 GPUs, then by default, OpenCompass will employ all 8 GPUs to concurrently run 2 tasks. However, if you adjust the `--max-num-workers` setting to 1, then only one task will be processed at a time, utilizing just 4 GPUs.
### Why doesn't the GPU behavior of HuggingFace models align with my expectations?
This is a complex issue that needs to be explained from both the supply and demand sides:
The supply side refers to how many tasks are being run. A task is a combination of a model and a dataset, and it primarily depends on how many models and datasets need to be tested. Additionally, since OpenCompass splits a larger task into multiple smaller tasks, the number of data entries per sub-task (`--max-partition-size`) also affects the number of tasks. (The `--max-partition-size` is proportional to the actual number of data entries, but the relationship is not 1:1).
The demand side refers to how many workers are running. Since OpenCompass instantiates multiple models for inference simultaneously, we use `--hf-num-gpus` to specify how many GPUs each instance uses. Note that `--hf-num-gpus` is a parameter specific to HuggingFace models and setting this parameter for non-HuggingFace models will not have any effect. We also use `--max-num-workers` to indicate the maximum number of instances running at the same time. Lastly, due to issues like GPU memory and insufficient load, OpenCompass also supports running multiple instances on the same GPU, which is managed by the parameter `--max-num-workers-per-gpu`. Therefore, it can be generally assumed that we will use a total of `--hf-num-gpus` * `--max-num-workers` / `--max-num-workers-per-gpu` GPUs.
In summary, when tasks run slowly or the GPU load is low, we first need to check if the supply is sufficient. If not, consider reducing `--max-partition-size` to split the tasks into finer parts. Next, we need to check if the demand is sufficient. If not, consider increasing `--max-num-workers` and `--max-num-workers-per-gpu`. Generally, **we set `--hf-num-gpus` to the minimum value that meets the demand and do not adjust it further.**
### How do I control the number of GPUs that OpenCompass occupies?
Currently, there isn't a direct method to specify the number of GPUs OpenCompass can utilize. However, the following are some indirect strategies:
**If evaluating locally:**
You can limit OpenCompass's GPU access by setting the `CUDA_VISIBLE_DEVICES` environment variable. For instance, using `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` will only expose the first four GPUs to OpenCompass, ensuring it uses no more than these four GPUs simultaneously.
**If using Slurm or DLC:**
Although OpenCompass doesn't have direct access to the resource pool, you can adjust the `--max-num-workers` parameter to restrict the number of evaluation tasks being submitted simultaneously. This will indirectly manage the number of GPUs that OpenCompass employs. For instance, if each task requires 4 GPUs, and you wish to allocate a total of 8 GPUs, then you should set `--max-num-workers` to 2.
### `libGL.so.1` not foune
opencv-python depends on some dynamic libraries that are not present in the environment. The simplest solution is to uninstall opencv-python and then install opencv-python-headless.
```bash
pip uninstall opencv-python
pip install opencv-python-headless
```
Alternatively, you can install the corresponding dependency libraries according to the error message
```bash
sudo apt-get update
sudo apt-get install -y libgl1 libglib2.0-0
```
## Network
### My tasks failed with error: `('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))` or `urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443)`
Because of HuggingFace's implementation, OpenCompass requires network (especially the connection to HuggingFace) for the first time it loads some datasets and models. Additionally, it connects to HuggingFace each time it is launched. For a successful run, you may:
- Work behind a proxy by specifying the environment variables `http_proxy` and `https_proxy`;
- Use the cache files from other machines. You may first run the experiment on a machine that has access to the Internet, and then copy the cached files to the offline one. The cached files are located at `~/.cache/huggingface/` by default ([doc](https://huggingface.co/docs/datasets/cache#cache-directory)). When the cached files are ready, you can start the evaluation in offline mode:
```python
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 HF_EVALUATE_OFFLINE=1 python run.py ...
```
With which no more network connection is needed for the evaluation. However, error will still be raised if the files any dataset or model is missing from the cache.
- Use mirror like [hf-mirror](https://hf-mirror.com/)
```python
HF_ENDPOINT=https://hf-mirror.com python run.py ...
```
### My server cannot connect to the Internet, how can I use OpenCompass?
Use the cache files from other machines, as suggested in the answer to [Network-Q1](#my-tasks-failed-with-error-connection-aborted-connectionreseterror104-connection-reset-by-peer-or-urllib3exceptionsmaxretryerror-httpsconnectionpoolhostcdn-lfshuggingfaceco-port443).
### In evaluation phase, I'm running into an error saying that `FileNotFoundError: Couldn't find a module script at opencompass/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.`
HuggingFace tries to load the metric (e.g. `accuracy`) as an module online, and it could fail if the network is unreachable. Please refer to [Network-Q1](#my-tasks-failed-with-error-connection-aborted-connectionreseterror104-connection-reset-by-peer-or-urllib3exceptionsmaxretryerror-httpsconnectionpoolhostcdn-lfshuggingfaceco-port443) for guidelines to fix your network issue.
The issue has been fixed in the latest version of OpenCompass, so you might also consider pull from the latest version.
## Efficiency
### Why does OpenCompass partition each evaluation request into tasks?
Given the extensive evaluation time and the vast quantity of datasets, conducting a comprehensive linear evaluation on LLM models can be immensely time-consuming. To address this, OpenCompass divides the evaluation request into multiple independent "tasks". These tasks are then dispatched to various GPU groups or nodes, achieving full parallelism and maximizing the efficiency of computational resources.
### How does task partitioning work?
Each task in OpenCompass represents a combination of specific model(s) and portions of the dataset awaiting evaluation. OpenCompass offers a variety of task partitioning strategies, each tailored for different scenarios. During the inference stage, the prevalent partitioning method seeks to balance task size, or computational cost. This cost is heuristically derived from the dataset size and the type of inference.
### Why does it take more time to evaluate LLM models on OpenCompass?
There is a tradeoff between the number of tasks and the time to load the model. For example, if we partition an request that evaluates a model against a dataset into 100 tasks, the model will be loaded 100 times in total. When resources are abundant, these 100 tasks can be executed in parallel, so the additional time spent on model loading can be ignored. However, if resources are limited, these 100 tasks will operate more sequentially, and repeated loadings can become a bottleneck in execution time.
Hence, if users find that the number of tasks greatly exceeds the available GPUs, we advise setting the `--max-partition-size` to a larger value.
## Model
### How to use the downloaded huggingface models?
If you have already download the checkpoints of the model, you can specify the local path of the model. For example
```bash
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path /path/to/model
```
## Dataset
### How to build a new dataset?
- For building new objective dataset: [new_dataset](../advanced_guides/new_dataset.md)
- For building new subjective dataset: [subjective_evaluation](../advanced_guides/subjective_evaluation.md)
================================================
FILE: docs/en/get_started/installation.md
================================================
# Installation
## Basic Installation
1. Prepare the OpenCompass runtime environment using Conda:
```conda create --name opencompass python=3.10 -y
# conda create --name opencompass_lmdeploy python=3.10 -y
conda activate opencompass
```
If you want to customize the PyTorch version or related CUDA version, please refer to the [official documentation](https://pytorch.org/get-started/locally/) to set up the PyTorch environment. Note that OpenCompass requires `pytorch>=1.13`.
2. Install OpenCompass:
- pip Installation
```bash
# For support of most datasets and models
pip install -U opencompass
# Complete installation (supports more datasets)
# pip install "opencompass[full]"
# API Testing (e.g., OpenAI, Qwen)
# pip install "opencompass[api]"
```
- Building from Source Code If you want to use the latest features of OpenCompass
```bash
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
```
## Other Installations
### Inference Backends
```bash
# Model inference backends. Since these backends often have dependency conflicts,
# we recommend using separate virtual environments to manage them.
pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"
```
- LMDeploy
You can check if the inference backend has been installed successfully with the following command. For more information, refer to the [official documentation](https://lmdeploy.readthedocs.io/en/latest/get_started.html)
```bash
lmdeploy chat internlm/internlm2_5-1_8b-chat --backend turbomind
```
- vLLM
You can check if the inference backend has been installed successfully with the following command. For more information, refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
```bash
vllm serve facebook/opt-125m
```
### API
OpenCompass supports different commercial model API calls, which you can install via pip or by referring to the [API dependencies](https://github.com/open-compass/opencompass/blob/main/requirements/api.txt) for specific API model dependencies.
```bash
pip install "opencompass[api]"
# pip install openai # GPT-3.5-Turbo / GPT-4-Turbo / GPT-4 / GPT-4o (API)
# pip install anthropic # Claude (API)
# pip install dashscope # Qwen (API)
# pip install volcengine-python-sdk # ByteDance Volcano Engine (API)
# ...
```
### Datasets
The basic installation supports most fundamental datasets. For certain datasets (e.g., Alpaca-eval, Longbench, etc.), additional dependencies need to be installed.
You can install these through pip or refer to the [additional dependencies](<(https://github.com/open-compass/opencompass/blob/main/requirements/extra.txt)>) for specific dependencies.
```bash
pip install "opencompass[full]"
```
For HumanEvalX / HumanEval+ / MBPP+, you need to manually clone the Git repository and install it.
```bash
git clone --recurse-submodules git@github.com:open-compass/human-eval.git
cd human-eval
pip install -e .
pip install -e evalplus
```
Some agent evaluations require installing numerous dependencies, which may conflict with existing runtime environments. We recommend creating separate conda environments to manage these.
```bash
# T-Eval
pip install lagent==0.1.2
# CIBench
pip install -r requirements/agent.txt
```
# Dataset Preparation
The datasets supported by OpenCompass mainly include three parts:
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
Translate the paragraph into English:
2. ModelScope Datasets: [ModelScope OpenCompass Dataset](https://modelscope.cn/organization/opencompass) supports automatic downloading of datasets from ModelScope.
To enable this feature, set the environment variable: `export DATASET_SOURCE=ModelScope`. The available datasets include (sourced from OpenCompassData-core.zip):
```plain
humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ceval, math, LCSTS, Xsum, winogrande, openbookqa, AGIEval, gsm8k, nq, race, siqa, mbpp, mmlu, hellaswag, ARC, BBH, xstory_cloze, summedits, GAOKAO-BENCH, OCNLI, cmnli
```
3. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
Run the following commands to download and place the datasets in the `${OpenCompass}/data` directory can complete dataset preparation.
```bash
# Run in the OpenCompass directory
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
```
If you need to use the more comprehensive dataset (~500M) provided by OpenCompass, You can download and `unzip` it using the following command:
```bash
# For proxy and resumable downloads, try `aria2c -x16 -s16 -k1M "http://ghfast.top/https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-complete-20240207.zip" `
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-complete-20240207.zip
unzip OpenCompassData-complete-20240207.zip
cd ./data
find . -name "*.zip" -exec unzip "{}" \;
```
The list of datasets included in both `.zip` can be found [here](https://github.com/open-compass/opencompass/releases/tag/0.2.2.rc1)
OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.
For next step, please read [Quick Start](./quick_start.md).
================================================
FILE: docs/en/get_started/quick_start.md
================================================
# Quick Start

## Overview
OpenCompass provides a streamlined workflow for evaluating a model, which consists of the following stages: **Configure** -> **Inference** -> **Evaluation** -> **Visualization**.
**Configure**: This is your starting point. Here, you'll set up the entire evaluation process, choosing the model(s) and dataset(s) to assess. You also have the option to select an evaluation strategy, the computation backend, and define how you'd like the results displayed.
**Inference & Evaluation**: OpenCompass efficiently manages the heavy lifting, conducting parallel inference and evaluation on your chosen model(s) and dataset(s). The **Inference** phase is all about producing outputs from your datasets, whereas the **Evaluation** phase measures how well these outputs align with the gold standard answers. While this procedure is broken down into multiple "tasks" that run concurrently for greater efficiency, be aware that working with limited computational resources might introduce some unexpected overheads, and resulting in generally slower evaluation. To understand this issue and know how to solve it, check out [FAQ: Efficiency](faq.md#efficiency).
**Visualization**: Once the evaluation is done, OpenCompass collates the results into an easy-to-read table and saves them as both CSV and TXT files. If you need real-time updates, you can activate lark reporting and get immediate status reports in your Lark clients.
Coming up, we'll walk you through the basics of OpenCompass, showcasing evaluations of pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on the [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winograd_wsc) benchmark tasks. Their configuration files can be found at [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py).
Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/open-compass/opencompass/tree/main/configs).
## Configuring an Evaluation Task
In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is `run.py`. Users can select the model and dataset to be tested either via command line or configuration files.
`````{tabs}
````{tab} Command Line (Custom HF Model)
For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the `facebook/opt-125m` model, you can evaluate it with the following command:
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-type base \
--hf-path facebook/opt-125m
```
Note that in this way, OpenCompass only evaluates one model at a time, while other ways can evaluate multiple models at once.
```{caution}
`--hf-num-gpus` does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. [More](faq.md#how-does-opencompass-allocate-gpus)
```
:::{dropdown} More detailed example
:animate: fade-in-slide-down
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-type base \ # HuggingFace model type, base or chat
--hf-path facebook/opt-125m \ # HuggingFace model path
--tokenizer-path facebook/opt-125m \ # HuggingFace tokenizer path (if the same as the model path, can be omitted)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # Arguments to construct the tokenizer
--model-kwargs device_map='auto' \ # Arguments to construct the model
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--max-out-len 100 \ # Maximum number of tokens to generate
--min-out-len 100 \ # Minimum number of tokens to generate
--batch-size 64 \ # Batch size
--hf-num-gpus 1 # Number of GPUs required to run the model
```
```{seealso}
For all HuggingFace related parameters supported by `run.py`, please read [Launching Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task).
```
:::
````
````{tab} Command Line
Users can combine the models and datasets they want to test using `--models` and `--datasets`.
```bash
python run.py --models hf_opt_125m hf_opt_350m --datasets siqa_gen winograd_ppl
```
The models and datasets are pre-stored in the form of configuration files in `configs/models` and `configs/datasets`. Users can view or filter the currently available model and dataset configurations using `tools/list_configs.py`.
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
:::{dropdown} More about `list_configs`
:animate: fade-in-slide-down
Running `python tools/list_configs.py llama mmlu` gives the output like:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| ... | ... |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| ... | ... |
+-------------------+---------------------------------------------------+
```
Users can use the names in the first column as input parameters for `--models` and `--datasets` in `python run.py`. For datasets, the same name with different suffixes generally indicates that its prompts or evaluation methods are different.
:::
:::{dropdown} Model not on the list?
:animate: fade-in-slide-down
If you want to evaluate other models, please check out the "Command Line (Custom HF Model)" tab for the way to construct a custom HF model without a configuration file, or "Configuration File" tab to learn the general way to prepare your model configurations.
:::
````
````{tab} Configuration File
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. The configuration file is organized in Python format and must include the `datasets` and `models` fields.
The test configuration for this time is [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](../user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
```python
from mmengine.config import read_base
with read_base():
from .datasets.siqa.siqa_gen import siqa_datasets
from .datasets.winograd.winograd_ppl import winograd_datasets
from .models.opt.hf_opt_125m import opt125m
from .models.opt.hf_opt_350m import opt350m
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
```
When running tasks, we just need to pass the path of the configuration file to `run.py`:
```bash
python run.py configs/eval_demo.py
```
:::{dropdown} More about `models`
:animate: fade-in-slide-down
OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](https://github.com/open-compass/opencompass/blob/main/configs/models/opt/hf_opt_350m.py) (`configs/models/opt/hf_opt_350m.py`):
```python
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceBaseModel`
from opencompass.models import HuggingFaceBaseModel
models = [
# OPT-350M
dict(
type=HuggingFaceBaseModel,
# Initialization parameters for `HuggingFaceBaseModel`
path='facebook/opt-350m',
# Below are common parameters for all models, not specific to HuggingFaceBaseModel
abbr='opt-350m-hf', # Model abbreviation
max_out_len=1024, # Maximum number of generated tokens
batch_size=32, # Batch size
run_cfg=dict(num_gpus=1), # The required GPU numbers for this model
)
]
```
When using configurations, we can specify the relevant files through the command-line argument ` --models` or import the model configurations into the `models` list in the configuration file using the inheritance mechanism.
```{seealso}
More information about model configuration can be found in [Prepare Models](../user_guides/models.md).
```
:::
:::{dropdown} More about `datasets`
:animate: fade-in-slide-down
Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance
Below is a dataset-related configuration snippet from `configs/eval_demo.py`:
```python
from mmengine.config import read_base # Use mmengine.read_base() to read the base configuration
with read_base():
# Directly read the required dataset configurations from the preset dataset configurations
from .datasets.winograd.winograd_ppl import winograd_datasets # Read Winograd configuration, evaluated based on PPL (perplexity)
from .datasets.siqa.siqa_gen import siqa_datasets # Read SIQA configuration, evaluated based on generation
datasets = [*siqa_datasets, *winograd_datasets] # The final config needs to contain the required evaluation dataset list 'datasets'
```
Dataset configurations are typically of two types: 'ppl' and 'gen', indicating the evaluation method used. Where `ppl` means discriminative evaluation and `gen` means generative evaluation.
Moreover, [configs/datasets/collections](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections) houses various dataset collections, making it convenient for comprehensive evaluations. OpenCompass often uses [`base_medium.py`](/configs/datasets/collections/base_medium.py) for full-scale model testing. To replicate results, simply import that file, for example:
```bash
python run.py --models hf_llama_7b --datasets base_medium
```
```{seealso}
You can find more information from [Dataset Preparation](../user_guides/datasets.md).
```
:::
````
`````
```{warning}
OpenCompass usually assumes network is available. If you encounter network issues or wish to run OpenCompass in an offline environment, please refer to [FAQ - Network - Q1](./faq.md#network) for solutions.
```
The following sections will use configuration-based method as an example to explain the other features.
## Launching Evaluation
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation in `--debug` mode for the first run and check if there is any problem. In `--debug` mode, the tasks will be executed sequentially and output will be printed in real time.
```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
If everything is fine, you should see "Starting inference process" on screen:
```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
Then you can press `ctrl+c` to interrupt the program, and run the following command in normal mode:
```bash
python run.py configs/eval_demo.py -w outputs/demo
```
In normal mode, the evaluation tasks will be executed parallelly in the background, and their output will be redirected to the output directory `outputs/demo/{TIMESTAMP}`. The progress bar on the frontend only indicates the number of completed tasks, regardless of their success or failure. **Any backend task failures will only trigger a warning message in the terminal.**
:::{dropdown} More parameters in `run.py`
:animate: fade-in-slide-down
Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:
- `-w outputs/demo`: Work directory to save evaluation logs and results. In this case, the experiment result will be saved to `outputs/demo/{TIMESTAMP}`.
- `-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
- `--mode all`: Specify a specific stage of the task.
- all: (Default) Perform a complete evaluation, including inference and evaluation.
- infer: Perform inference on each dataset.
- eval: Perform evaluation based on the inference results.
- viz: Display evaluation results only.
- `--max-partition-size 2000`: Dataset partition size. Some datasets may be large, and using this parameter can split them into multiple sub-tasks to efficiently utilize resources. However, if the partition is too fine, the overall speed may be slower due to longer model loading times.
- `--max-num-workers 32`: Maximum number of parallel tasks. In distributed environments such as Slurm, this parameter specifies the maximum number of submitted tasks. In a local environment, it specifies the maximum number of tasks executed in parallel. Note that the actual number of parallel tasks depends on the available GPU resources and may not be equal to this number.
If you are not performing the evaluation on your local machine but using a Slurm cluster, you can specify the following parameters:
- `--slurm`: Submit tasks using Slurm on the cluster.
- `--partition(-p) my_part`: Slurm cluster partition.
- `--retry 2`: Number of retries for failed tasks.
```{seealso}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task) for details.
```
:::
## Visualizing Evaluation Results
After the evaluation is complete, the evaluation results table will be printed as follows:
```text
dataset version metric mode opt350m opt125m
--------- --------- -------- ------ --------- ---------
siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
```
All run outputs will be directed to `outputs/demo/` directory with following structure:
```text
outputs/default/
├── 20200220_120000
├── 20230220_183030 # one experiment pre folder
│ ├── configs # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
│ ├── logs # log files for both inference and evaluation stages
│ │ ├── eval
│ │ └── infer
│ ├── predictions # Prediction results for each task
│ ├── results # Evaluation results for each task
│ └── summary # Summarized evaluation results for a single experiment
├── ...
```
The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).
More information about obtaining evaluation results can be found in [Results Summary](../user_guides/summarizer.md).
## Additional Tutorials
To learn more about using OpenCompass, explore the following tutorials:
- [Prepare Datasets](../user_guides/datasets.md)
- [Prepare Models](../user_guides/models.md)
- [Task Execution and Monitoring](../user_guides/experimentation.md)
- [Understand Prompts](../prompt/overview.md)
- [Results Summary](../user_guides/summarizer.md)
- [Learn about Config](../user_guides/config.md)
================================================
FILE: docs/en/index.rst
================================================
Welcome to OpenCompass' documentation!
==========================================
Getting started with OpenCompass
-------------------------------
To help you quickly familiarized with OpenCompass, we recommend you to walk through the following documents in order:
- First read the GetStarted_ section set up the environment, and run a mini experiment.
- Then learn its basic usage through the UserGuides_.
- If you want to tune the prompts, refer to the Prompt_.
- If you want to customize some modules, like adding a new dataset or model, we have provided the AdvancedGuides_.
- There are more handy tools, such as prompt viewer and lark bot reporter, all presented in Tools_.
We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
.. _GetStarted:
.. toctree::
:maxdepth: 1
:caption: Get Started
get_started/installation.md
get_started/quick_start.md
get_started/faq.md
.. _UserGuides:
.. toctree::
:maxdepth: 1
:caption: User Guides
user_guides/framework_overview.md
user_guides/config.md
user_guides/datasets.md
user_guides/models.md
user_guides/evaluation.md
user_guides/experimentation.md
user_guides/metrics.md
user_guides/deepseek_r1.md
user_guides/interns1.md
.. _Prompt:
.. toctree::
:maxdepth: 1
:caption: Prompt
prompt/overview.md
prompt/prompt_template.md
prompt/meta_template.md
prompt/chain_of_thought.md
.. _AdvancedGuides:
.. toctree::
:maxdepth: 1
:caption: Advanced Guides
advanced_guides/new_dataset.md
advanced_guides/custom_dataset.md
advanced_guides/new_model.md
advanced_guides/evaluation_lmdeploy.md
advanced_guides/accelerator_intro.md
advanced_guides/math_verify.md
advanced_guides/llm_judge.md
advanced_guides/code_eval.md
advanced_guides/code_eval_service.md
advanced_guides/subjective_evaluation.md
advanced_guides/persistence.md
.. _Tools:
.. toctree::
:maxdepth: 1
:caption: Tools
tools.md
.. _Dataset List:
.. toctree::
:maxdepth: 1
:caption: Dataset List
dataset_statistics.md
.. _Notes:
.. toctree::
:maxdepth: 1
:caption: Notes
notes/contribution_guide.md
notes/academic.md
Indexes & Tables
==================
* :ref:`genindex`
* :ref:`search`
================================================
FILE: docs/en/notes/academic.md
================================================
# Guide to Reproducing CompassAcademic Leaderboard Results
To provide users with a quick and intuitive overview of the performance of mainstream open-source and commercial models on widely-used datasets, we maintain the [CompassAcademic Leaderboard](https://rank.opencompass.org.cn/leaderboard-llm-academic/?m=REALTIME) for LLMs on our official website, updating it typically every two weeks.
Given the continuous iteration of models and datasets, along with ongoing upgrades to the OpenCompass, the configuration settings for the CompassAcademic leaderboard may evolve. Specifically, we adhere to the following update principles:
- Newly released models are promptly included, while models published six months to one year (or more) ago are removed from the leaderboard.
- New datasets are incorporated, while datasets nearing performance saturation are phased out.
- Existing evaluation results on the leaderboard are updated in sync with changes to the evaluation configuration.
To support rapid reproducibility, OpenCompass provides the real-time configuration files used in the academic leaderboard.
## CompassAcademic Leaderboard Reproduction
[eval_academic_leaderboard_REALTIME.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_academic_leaderboard_REALTIME.py) contains the configuration currently used for academic ranking evaluation. You can replicate the evaluation by following the steps as follows.
### 1: Model Configs
Firstly, modify the Model List code block in [eval_academic_leaderboard_REALTIME.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_academic_leaderboard_REALTIME.py) to include the model you wish to evaluate.
```python
# Models (add your models here)
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model
```
The original example calls an lmdeploy-based model configuration in OpenCompass.
You can also build your new model configuration based on [this document](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/models.html).
An example of a configuration that calls the deployed service of Qwen3-235B-A22B based on OpenAISDK is as follows:
```python
from opencompass.models import OpenAISDK
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
qwen3_235b_a22b_model = dict(
abbr="qwen_3_235b_a22b_thinking", # Used to identify the model configuration
key="YOUR_SERVE_API_KEY",
openai_api_base="YOUR_SERVE_API_URL",
type=OpenAISDK, # The model configuration types, commonly used such as OpenAISDK, TurboMindModelwithChatTemplate, HuggingFacewithChatTemplate
path="Qwen/Qwen3-235B-A22B",
temperature=0.6,
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
),
query_per_second=1,
max_out_len=32000,
max_seq_len=32768,
batch_size=8,
retry=10,
extra_body={
'chat_template_kwargs': {'enable_thinking': True},
}, # Additional configurations of the model, such as the option in Qwen3 series to control whether they thinks or not
pred_postprocessor=dict(type=extract_non_reasoning_content), # adding this pred_postprocessor can extract the non-reasoning content from models that output with a think tag
)
models = [
qwen3_235b_a22b_model,
]
```
Here are the commonly used parameters for reference.
- `max_seq_len` = 65536 or 32768
- `max_out_len` = 64000 or 32000
- `temperature` = 0.6
- `top_p` = 0.95
### 2: Verifier Configs
Complete your verifier model information in `judge_cfg`.
For detailed information about LLM verifiers, please refer to [this document](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/llm_judge.html).
At present, CompassAcademic use [CompassVerifier-32B](https://huggingface.co/opencompass/CompassVerifier-32B), here is the config example using OpenAISDK:
```python
judge_cfg = dict(
abbr='CompassVerifier',
type=OpenAISDK,
path='opencompass/CompassVerifier-32B',
key='YOUR_API_KEY',
openai_api_base='YOUR_API_BASE',
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]),
query_per_second=1,
batch_size=8,
temperature=0.001,
max_out_len=8192,
max_seq_len=32768,
mode='mid',
)
```
### 3: Execute evaluation
After completing the above configuration file, you can enter the following content in the CLI to start the evaluation:
```bash
opencompass examples/eval_academic_leaderboard_REALTIME.py
```
For more detailed CLI parameters, please refer to [this document](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/experimentation.html)。
================================================
FILE: docs/en/notes/contribution_guide.md
================================================
# Contributing to OpenCompass
- [Contributing to OpenCompass](#contributing-to-opencompass)
- [What is PR](#what-is-pr)
- [Basic Workflow](#basic-workflow)
- [Procedures in detail](#procedures-in-detail)
- [1. Get the most recent codebase](#1-get-the-most-recent-codebase)
- [2. Checkout a new branch from `main` branch](#2-checkout-a-new-branch-from-main-branch)
- [3. Commit your changes](#3-commit-your-changes)
- [4. Push your changes to the forked repository and create a PR](#4-push-your-changes-to-the-forked-repository-and-create-a-pr)
- [5. Discuss and review your code](#5-discuss-and-review-your-code)
- [6. Merge your branch to `main` branch and delete the branch](#6--merge-your-branch-to-main-branch-and-delete-the-branch)
- [Code style](#code-style)
- [Python](#python)
- [About Contributing Test Datasets](#about-contributing-test-datasets)
Thanks for your interest in contributing to OpenCompass! All kinds of contributions are welcome, including but not limited to the following.
- Fix typo or bugs
- Add documentation or translate the documentation into other languages
- Add new features and components
## What is PR
`PR` is the abbreviation of `Pull Request`. Here's the definition of `PR` in the [official document](https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) of Github.
```
Pull requests let you tell others about changes you have pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.
```
## Basic Workflow
1. Get the most recent codebase
2. Checkout a new branch from `main` branch.
3. Commit your changes ([Don't forget to use pre-commit hooks!](#3-commit-your-changes))
4. Push your changes and create a PR
5. Discuss and review your code
6. Merge your branch to `main` branch
## Procedures in detail
### 1. Get the most recent codebase
- When you work on your first PR
Fork the OpenCompass repository: click the **fork** button at the top right corner of Github page

Clone forked repository to local
```bash
git clone git@github.com:XXX/opencompass.git
```
Add source repository to upstream
```bash
git remote add upstream git@github.com:InternLM/opencompass.git
```
- After your first PR
Checkout the latest branch of the local repository and pull the latest branch of the source repository.
```bash
git checkout main
git pull upstream main
```
### 2. Checkout a new branch from `main` branch
```bash
git checkout main -b branchname
```
### 3. Commit your changes
- If you are a first-time contributor, please install and initialize pre-commit hooks from the repository root directory first.
```bash
pip install -U pre-commit
pre-commit install
```
- Commit your changes as usual. Pre-commit hooks will be triggered to stylize your code before each commit.
```bash
# coding
git add [files]
git commit -m 'messages'
```
```{note}
Sometimes your code may be changed by pre-commit hooks. In this case, please remember to re-stage the modified files and commit again.
```
### 4. Push your changes to the forked repository and create a PR
- Push the branch to your forked remote repository
```bash
git push origin branchname
```
- Create a PR

- Revise PR message template to describe your motivation and modifications made in this PR. You can also link the related issue to the PR manually in the PR message (For more information, checkout the [official guidance](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue)).
- You can also ask a specific person to review the changes you've proposed.
### 5. Discuss and review your code
- Modify your codes according to reviewers' suggestions and then push your changes.
### 6. Merge your branch to `main` branch and delete the branch
- After the PR is merged by the maintainer, you can delete the branch you created in your forked repository.
```bash
git branch -d branchname # delete local branch
git push origin --delete branchname # delete remote branch
```
## Code style
### Python
We adopt [PEP8](https://www.python.org/dev/peps/pep-0008/) as the preferred code style.
We use the following tools for linting and formatting:
- [flake8](https://github.com/PyCQA/flake8): A wrapper around some linter tools.
- [isort](https://github.com/timothycrosley/isort): A Python utility to sort imports.
- [yapf](https://github.com/google/yapf): A formatter for Python files.
- [codespell](https://github.com/codespell-project/codespell): A Python utility to fix common misspellings in text files.
- [mdformat](https://github.com/executablebooks/mdformat): Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
- [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/OpenCompass/blob/main/setup.cfg).
## About Contributing Test Datasets
- Submitting Test Datasets
- Please implement logic for automatic dataset downloading in the code; or provide a method for obtaining the dataset in the PR. The OpenCompass maintainers will follow up accordingly. If the dataset is not yet public, please indicate so.
- Submitting Data Configuration Files
- Provide a README in the same directory as the data configuration. The README should include, but is not limited to:
- A brief description of the dataset
- The official link to the dataset
- Some test examples from the dataset
- Evaluation results of the dataset on relevant models
- Citation of the dataset
- (Optional) Summarizer of the dataset
- (Optional) If the testing process cannot be achieved simply by concatenating the dataset and model configuration files, a configuration file for conducting the test is also required.
- (Optional) If necessary, please add a description of the dataset in the relevant documentation sections. This is very necessary to help users understand the testing scheme. You can refer to the following types of documents in OpenCompass:
- [Circular Evaluation](../advanced_guides/circular_eval.md)
- [Code Evaluation](../advanced_guides/code_eval.md)
- [Contamination Assessment](../advanced_guides/contamination_eval.md)
================================================
FILE: docs/en/notes/news.md
================================================
# News
- **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
- **\[2024.04.30\]** We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an [external corpora](configs/datasets/llm_compression/README.md) ([official paper](https://github.com/hkust-nlp/llm-compression-intelligence)). Check out the [llm-compression](configs/eval_llm_compression.py) evaluation config now! 🔥🔥🔥
- **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
- **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
- **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py) welcome to try!🔥🔥🔥.
- **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
- **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html)
- **\[2024.01.30\]** We release OpenCompass 2.0. Click [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information !
- **\[2024.01.17\]** We supported the evaluation of [InternLM2](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_keyset.py) and [InternLM2-Chat](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_chat_keyset.py), InternLM2 showed extremely strong performance in these tests, welcome to try!
- **\[2024.01.17\]** We supported the needle in a haystack test with multiple needles, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html#id8).
- **\[2023.12.28\]** We have enabled seamless evaluation of all models developed using [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory), a powerful toolkit for comprehensive LLM development.
- **\[2023.12.22\]** We have released [T-Eval](https://github.com/open-compass/T-Eval), a step-by-step evaluation benchmark to gauge your LLMs on tool utilization. Welcome to our [Leaderboard](https://open-compass.github.io/T-Eval/leaderboard.html) for more details!
- **\[2023.12.10\]** We have released [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), a toolkit for evaluating vision-language models (VLMs), currently support 20+ VLMs and 7 multi-modal benchmarks (including MMBench series).
- **\[2023.12.10\]** We have supported Mistral AI's MoE LLM: **Mixtral-8x7B-32K**. Welcome to [MixtralKit](https://github.com/open-compass/MixtralKit) for more details about inference and evaluation.
- **\[2023.11.22\]** We have supported many API-based models, include **Baidu, ByteDance, Huawei, 360**. Welcome to [Models](https://opencompass.readthedocs.io/en/latest/user_guides/models.html) section for more details.
- **\[2023.11.20\]** Thanks [helloyongyang](https://github.com/helloyongyang) for supporting the evaluation with [LightLLM](https://github.com/ModelTC/lightllm) as backent. Welcome to [Evaluation With LightLLM](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lightllm.html) for more details.
- **\[2023.11.13\]** We are delighted to announce the release of OpenCompass v0.1.8. This version enables local loading of evaluation benchmarks, thereby eliminating the need for an internet connection. Please note that with this update, **you must re-download all evaluation datasets** to ensure accurate and up-to-date results.
- **\[2023.11.06\]** We have supported several API-based models, include **ChatGLM Pro@Zhipu, ABAB-Chat@MiniMax and Xunfei**. Welcome to [Models](https://opencompass.readthedocs.io/en/latest/user_guides/models.html) section for more details.
- **\[2023.10.24\]** We release a new benchmark for evaluating LLMs’ capabilities of having multi-turn dialogues. Welcome to [BotChat](https://github.com/open-compass/BotChat) for more details.
- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
- **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.06\]** [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass.
- **\[2023.08.25\]** [**TigerBot**](https://github.com/TigerResearch/TigerBot) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- **\[2023.08.21\]** [**Lagent**](https://github.com/InternLM/lagent) has been released, which is a lightweight framework for building LLM-based agents. We are working with Lagent team to support the evaluation of general tool-use capability, stay tuned!
- **\[2023.08.18\]** We have supported evaluation for **multi-modality learning**, include **MMBench, SEED-Bench, COCO-Caption, Flickr-30K, OCR-VQA, ScienceQA** and so on. Leaderboard is on the road. Feel free to try multi-modality evaluation with OpenCompass !
- **\[2023.08.18\]** [Dataset card](https://opencompass.org.cn/dataset-detail/MMLU) is now online. Welcome new evaluation benchmark OpenCompass !
- **\[2023.08.11\]** [Model comparison](https://opencompass.org.cn/model-compare/GPT-4,ChatGPT,LLaMA-2-70B,LLaMA-65B) is now online. We hope this feature offers deeper insights!
- **\[2023.08.11\]** We have supported [LEval](https://github.com/OpenLMLab/LEval).
- **\[2023.08.10\]** OpenCompass is compatible with [LMDeploy](https://github.com/InternLM/lmdeploy). Now you can follow this [instruction](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lmdeploy.html#) to evaluate the accelerated models provide by the **Turbomind**.
- **\[2023.08.10\]** We have supported [Qwen-7B](https://github.com/QwenLM/Qwen-7B) and [XVERSE-13B](https://github.com/xverse-ai/XVERSE-13B) ! Go to our [leaderboard](https://opencompass.org.cn/leaderboard-llm) for more results! More models are welcome to join OpenCompass.
- **\[2023.08.09\]** Several new datasets(**CMMLU, TydiQA, SQuAD2.0, DROP**) are updated on our [leaderboard](https://opencompass.org.cn/leaderboard-llm)! More datasets are welcomed to join OpenCompass.
- **\[2023.08.07\]** We have added a [script](tools/eval_mmbench.py) for users to evaluate the inference results of [MMBench](https://opencompass.org.cn/MMBench)-dev.
- **\[2023.08.05\]** We have supported [GPT-4](https://openai.com/gpt-4)! Go to our [leaderboard](https://opencompass.org.cn/leaderboard-llm) for more results! More models are welcome to join OpenCompass.
- **\[2023.07.27\]** We have supported [CMMLU](https://github.com/haonan-li/CMMLU)! More datasets are welcome to join OpenCompass.
================================================
FILE: docs/en/prompt/chain_of_thought.md
================================================
# Chain of Thought
## Background
During the process of reasoning, CoT (Chain of Thought) method is an efficient way to help LLMs deal complex questions, for example: math problem and relation inference. In OpenCompass, we support multiple types of CoT method.

## 1. Zero Shot CoT
You can change the `PromptTemplate` of the dataset config, by simply add *Let's think step by step* to realize a Zero-Shot CoT prompt for your evaluation:
```python
qa_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="Answer the question:\nQ: {question}?\nLet's think step by step:\n"
),
retriever=dict(type=ZeroRetriever)
)
```
## 2. Few Shot CoT
Few-shot CoT can make LLMs easy to follow your instructions and get better answers. For few-shot CoT, add your CoT template to `PromptTemplate` like following config to create a one-shot prompt:
```python
qa_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=
'''Question: Mark's basketball team scores 25 2 pointers, 8 3 pointers and 10 free throws. Their opponents score double the 2 pointers but half the 3 pointers and free throws. What's the total number of points scored by both teams added together?
Let's think step by step
Answer:
Mark's team scores 25 2 pointers, meaning they scored 25*2= 50 points in 2 pointers.
His team also scores 6 3 pointers, meaning they scored 8*3= 24 points in 3 pointers
They scored 10 free throws, and free throws count as one point so they scored 10*1=10 points in free throws.
All together his team scored 50+24+10= 84 points
Mark's opponents scored double his team's number of 2 pointers, meaning they scored 50*2=100 points in 2 pointers.
His opponents scored half his team's number of 3 pointers, meaning they scored 24/2= 12 points in 3 pointers.
They also scored half Mark's team's points in free throws, meaning they scored 10/2=5 points in free throws.
All together Mark's opponents scored 100+12+5=117 points
The total score for the game is both team's scores added together, so it is 84+117=201 points
The answer is 201
Question: {question}\nLet's think step by step:\n{answer}
'''),
retriever=dict(type=ZeroRetriever)
)
```
## 3. Self-Consistency
The SC (Self-Consistency) method is proposed in [this paper](https://arxiv.org/abs/2203.11171), which will sample multiple reasoning paths for the question, and make majority voting to the generated answers for LLMs. This method displays remarkable proficiency among reasoning tasks with high accuracy but may consume more time and resources when inferencing, because of the majority voting strategy. In OpenCompass, You can easily implement the SC method by replacing `GenInferencer` with `SCInferencer` in the dataset configuration and setting the corresponding parameters like:
```python
# This SC gsm8k config can be found at: opencompass.configs.datasets.gsm8k.gsm8k_gen_a3e34a.py
gsm8k_infer_cfg = dict(
inferencer=dict(
type=SCInferencer, # Replace GenInferencer with SCInferencer.
generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40), # Set sample parameters to make sure model generate various output, only works for models load from HuggingFace now.
infer_type='SC',
sc_size = SAMPLE_SIZE
)
)
gsm8k_eval_cfg = dict(sc_size=SAMPLE_SIZE)
```
```{note}
OpenCompass defaults to use argmax for sampling the next token. Therefore, if the sampling parameters are not specified, the model's inference results will be completely consistent each time, and multiple rounds of evaluation will be ineffective.
```
Where `SAMPLE_SIZE` is the number of reasoning paths in Self-Consistency, higher value usually outcome higher performance. The following figure from the original SC paper demonstrates the relation between reasoning paths and performance in several reasoning tasks:

From the figure, it can be seen that in different reasoning tasks, performance tends to improve as the number of reasoning paths increases. However, for some tasks, increasing the number of reasoning paths may reach a limit, and further increasing the number of paths may not bring significant performance improvement. Therefore, it is necessary to conduct experiments and adjustments on specific tasks to find the optimal number of reasoning paths that best suit the task.
## 4. Tree-of-Thoughts
In contrast to the conventional CoT approach that considers only a single reasoning path, Tree-of-Thoughts (ToT) allows the language model to explore multiple diverse reasoning paths simultaneously. The model evaluates the reasoning process through self-assessment and makes global choices by conducting lookahead or backtracking when necessary. Specifically, this process is divided into the following four stages:
**1. Thought Decomposition**
Based on the nature of the problem, break down the problem into multiple intermediate steps. Each step can be a phrase, equation, or writing plan, depending on the nature of the problem.
**2. Thought Generation**
Assuming that solving the problem requires k steps, there are two methods to generate reasoning content:
- Independent sampling: For each state, the model independently extracts k reasoning contents from the CoT prompts, without relying on other reasoning contents.
- Sequential generation: Sequentially use "prompts" to guide the generation of reasoning content, where each reasoning content may depend on the previous one.
**3. Heuristic Evaluation**
Use heuristic methods to evaluate the contribution of each generated reasoning content to problem-solving. This self-evaluation is based on the model's self-feedback and involves designing prompts to have the model score multiple generated results.
**4. Search Algorithm Selection**
Based on the methods of generating and evaluating reasoning content, select an appropriate search algorithm. For example, you can use breadth-first search (BFS) or depth-first search (DFS) algorithms to systematically explore the thought tree, conducting lookahead and backtracking.
In OpenCompass, ToT parameters need to be set according to the requirements. Below is an example configuration for the 24-Point game from the [official paper](https://arxiv.org/pdf/2305.10601.pdf). Currently, ToT inference is supported only with Huggingface models:
```python
# This ToT Game24 config can be found at: opencompass/configs/datasets/game24/game24_gen_8dfde3.py.
from opencompass.datasets import (Game24Dataset, game24_postprocess,
Game24Evaluator, Game24PromptWrapper)
generation_kwargs = dict(temperature=0.7)
game24_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{input}'), # Directly pass the input content, as the Prompt needs to be specified in steps
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=ToTInferencer, # Replace GenInferencer with ToTInferencer
generation_kwargs=generation_kwargs,
method_generate='propose', # Method for generating reasoning content, can be independent sampling (sample) or sequential generation (propose)
method_evaluate='value', # Method for evaluating reasoning content, can be voting (vote) or scoring (value)
method_select='greedy', # Method for selecting reasoning content, can be greedy (greedy) or random (sample)
n_evaluate_sample=3,
n_select_sample=5,
task_wrapper=dict(type=Game24PromptWrapper) # This Wrapper class includes the prompts for each step and methods for generating and evaluating reasoning content, needs customization according to the task
))
```
If you want to use the ToT method on a custom dataset, you'll need to make additional configurations in the `opencompass.datasets.YourDataConfig.py` file to set up the `YourDataPromptWrapper` class. This is required for handling the thought generation and heuristic evaluation step within the ToT framework. For reasoning tasks similar to the game 24-Point, you can refer to the implementation in `opencompass/datasets/game24.py` for guidance.
================================================
FILE: docs/en/prompt/meta_template.md
================================================
# Meta Template
## Background
In the Supervised Fine-Tuning (SFT) process of Language Model Learning (LLM), we often inject some predefined strings into the conversation according to actual requirements, in order to prompt the model to output content according to certain guidelines. For example, in some `chat` model fine-tuning, we may add system-level instructions at the beginning of each dialogue, and establish a format to represent the conversation between the user and the model. In a conversation, the model may expect the text format to be as follows:
```bash
Meta instruction: You are now a helpful and harmless AI assistant.
HUMAN: Hi!\n
Bot: Hello! How may I assist you?\n
```
During evaluation, we also need to enter questions according to the agreed format for the model to perform its best.
In addition, similar situations exist in API models. General API dialogue models allow users to pass in historical dialogues when calling, and some models also allow the input of SYSTEM level instructions. To better evaluate the ability of API models, we hope to make the data as close as possible to the multi-round dialogue template of the API model itself during the evaluation, rather than stuffing all the content into an instruction.
Therefore, we need to specify different parsing templates for different models. In OpenCompass, we call this set of parsing templates **Meta Template**. Meta Template is tied to the model's configuration and is combined with the dialogue template of the dataset during runtime to ultimately generate the most suitable prompt for the current model.
```python
# When specifying, just pass the meta_template field into the model
models = [
dict(
type='AnyModel',
meta_template = ..., # meta template
)
]
```
Next, we will introduce how to configure Meta Template on two types of models.
You are recommended to read [here](./prompt_template.md#dialogue-prompt) for the basic syntax of the dialogue template before reading this chapter.
```{note}
In some cases (such as testing the base station), we don't need to inject any instructions into the normal dialogue, in which case we can leave the meta template empty. In this case, the prompt received by the model is defined only by the dataset configuration and is a regular string. If the dataset configuration uses a dialogue template, speeches from different roles will be concatenated with \n.
```
## Application on Language Models
The following figure shows several situations where the data is built into a prompt through the prompt template and meta template from the dataset in the case of 2-shot learning. Readers can use this figure as a reference to help understand the following sections.

We will explain how to define the meta template with several examples.
Suppose that according to the dialogue template of the dataset, the following dialogue was produced:
```python
PromptList([
dict(role='HUMAN', prompt='1+1=?'),
dict(role='BOT', prompt='2'),
dict(role='HUMAN', prompt='2+2=?'),
dict(role='BOT', prompt='4'),
])
```
We want to pass this dialogue to a model that has already gone through SFT. The model's agreed dialogue begins with the speech of different roles with `:` and ends with a special token and \\n. Here is the complete string the model expects to receive:
```Plain
: 1+1=?: 2: 2+2=?: 4
```
In the meta template, we only need to abstract the format of each round of dialogue into the following configuration:
```python
# model meta template
meta_template = dict(
round=[
dict(role='HUMAN', begin=': ', end='\n'),
dict(role='BOT', begin=': ', end='\n'),
],
)
```
______________________________________________________________________
Some datasets may introduce SYSTEM-level roles:
```python
PromptList([
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following math questions'),
dict(role='HUMAN', prompt='1+1=?'),
dict(role='BOT', prompt='2'),
dict(role='HUMAN', prompt='2+2=?'),
dict(role='BOT', prompt='4'),
])
```
Assuming the model also accepts the SYSTEM role, and expects the input to be:
```
: Solve the following math questions\n
: 1+1=?\n
: 2\n
: 2+2=?\n
: 4\n
end of conversation
```
We can put the definition of the SYSTEM role into `reserved_roles`. Roles in `reserved_roles` will not appear in regular conversations, but they allow the dialogue template of the dataset configuration to call them in `begin` or `end`.
```python
# model meta template
meta_template = dict(
round=[
dict(role='HUMAN', begin=': ', end='\n'),
dict(role='BOT', begin=': ', end='\n'),
],
reserved_roles=[dict(role='SYSTEM', begin=': ', end='\n'),],
),
```
If the model does not accept the SYSTEM role, it is not necessary to configure this item, and it can still run normally. In this case, the string received by the model becomes:
```
: Solve the following math questions\n
: 1+1=?\n
: 2\n
: 2+2=?\n
: 4\n
end of conversation
```
This is because in the predefined datasets in OpenCompass, each `SYSTEM` speech has a `fallback_role='HUMAN'`, that is, if the `SYSTEM` role in the meta template does not exist, the speaker will be switched to the `HUMAN` role.
______________________________________________________________________
Some models may need to consider embedding other strings at the beginning or end of the conversation, such as system instructions:
```
Meta instruction: You are now a helpful and harmless AI assistant.
: Solve the following math questions\n
: 1+1=?\n
: 2\n
: 2+2=?\n
: 4\n
end of conversation
```
In this case, we can specify these strings by specifying the begin and end parameters.
```python
meta_template = dict(
round=[
dict(role='HUMAN', begin=': ', end='\n'),
dict(role='BOT', begin=': ', end='\n'),
],
reserved_roles=[dict(role='SYSTEM', begin=': ', end='\n'),],
begin="Meta instruction: You are now a helpful and harmless AI assistant.",
end="end of conversation",
),
```
______________________________________________________________________
In **generative** task evaluation, we will not directly input the answer to the model, but by truncating the prompt, while retaining the previous text, we leave the answer output by the model blank.
```
Meta instruction: You are now a helpful and harmless AI assistant.
: Solve the following math questions\n
: 1+1=?\n
: 2\n
: 2+2=?\n
:
```
We only need to set the `generate` field in BOT's configuration to True, and OpenCompass will automatically leave the last utterance of BOT blank:
```python
# model meta template
meta_template = dict(
round=[
dict(role='HUMAN', begin=': ', end='\n'),
dict(role='BOT', begin=': ', end='\n', generate=True),
],
reserved_roles=[dict(role='SYSTEM', begin=': ', end='\n'),],
begin="Meta instruction: You are now a helpful and harmless AI assistant.",
end="end of conversation",
),
```
Note that `generate` only affects generative inference. When performing discriminative inference, the prompt received by the model is still complete.
### Full Definition
```bash
models = [
dict(meta_template = dict(
begin="Meta instruction: You are now a helpful and harmless AI assistant.",
round=[
dict(role='HUMAN', begin='HUMAN: ', end='\n'), # begin and end can be a list of strings or integers.
dict(role='THOUGHTS', begin='THOUGHTS: ', end='\n', prompt='None'), # Here we can set the default prompt, which may be overridden by the specific dataset
dict(role='BOT', begin='BOT: ', generate=True, end='\n'),
],
end="end of conversion",
reserved_roles=[dict(role='SYSTEM', begin='SYSTEM: ', end='\n'),],
eos_token_id=10000,
),
)
]
```
The `meta_template` is a dictionary that can contain the following fields:
- `begin`, `end`: (str, optional) The beginning and ending of the prompt, typically some system-level instructions.
- `round`: (list) The template format of each round of dialogue. The content of the prompt for each round of dialogue is controlled by the dialogue template configured in the dataset.
- `reserved_roles`: (list, optional) Specify roles that do not appear in `round` but may be used in the dataset configuration, such as the `SYSTEM` role.
- `eos_token_id`: (int, optional): Specifies the ID of the model's eos token. If not set, it defaults to the eos token id in the tokenizer. Its main role is to trim the output of the model in generative tasks, so it should generally be set to the first token id of the end corresponding to the item with generate=True.
The `round` of the `meta_template` specifies the format of each role's speech in a round of dialogue. It accepts a list of dictionaries, each dictionary's keys are as follows:
- `role` (str): The name of the role participating in the dialogue. This string does not affect the actual prompt.
- `begin`, `end` (str): Specifies the fixed beginning or end when this role speaks.
- `prompt` (str): The role's prompt. It is allowed to leave it blank in the meta template, but in this case, it must be specified in the prompt of the dataset configuration.
- `generate` (bool): When specified as True, this role is the one the model plays. In generation tasks, the prompt received by the model will be cut off at the `begin` of this role, and the remaining content will be filled by the model.
## Application to API Models
The meta template of the API model is similar to the meta template of the general model, but the configuration is simpler. Users can, as per their requirements, directly use one of the two configurations below to evaluate the API model in a multi-turn dialogue manner:
```bash
# If the API model does not support system instructions
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True)
],
)
# If the API model supports system instructions
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True)
],
reserved_roles=[
dict(role='SYSTEM', api_role='SYSTEM'),
],
)
```
### Principle
Even though different API models accept different data structures, there are commonalities overall. Interfaces that accept dialogue history generally allow users to pass in prompts from the following three roles:
- User
- Robot
- System (optional)
In this regard, OpenCompass has preset three `api_role` values for API models: `HUMAN`, `BOT`, `SYSTEM`, and stipulates that in addition to regular strings, the input accepted by API models includes a middle format of dialogue represented by `PromptList`. The API model will repackage the dialogue in a multi-turn dialogue format and send it to the backend. However, to activate this feature, users need to map the roles `role` in the dataset prompt template to the corresponding `api_role` in the above meta template. The following figure illustrates the relationship between the input accepted by the API model and the Prompt Template and Meta Template.

## Debugging
If you need to debug the prompt, it is recommended to use the `tools/prompt_viewer.py` script to preview the actual prompt received by the model after preparing the configuration file. Read [here](../tools.md#prompt-viewer) for more.
================================================
FILE: docs/en/prompt/overview.md
================================================
# Prompt Overview
The prompt is the input to the Language Model (LLM), used to guide the model to generate text or calculate perplexity (PPL). The selection of prompts can significantly impact the accuracy of the evaluated model. The process of converting the dataset into a series of prompts is defined by templates.
In OpenCompass, we split the template into two parts: the data-side template and the model-side template. When evaluating a model, the data will pass through both the data-side template and the model-side template, ultimately transforming into the input required by the model.
The data-side template is referred to as [prompt_template](./prompt_template.md), which represents the process of converting the fields in the dataset into prompts.
The model-side template is referred to as [meta_template](./meta_template.md), which represents how the model transforms these prompts into its expected input.
We also offer some prompting examples regarding [Chain of Thought](./chain_of_thought.md).
================================================
FILE: docs/en/prompt/prompt_template.md
================================================
# Prompt Template
## Background
In language model evaluation, we often construct prompts from the original dataset according to certain rules to enable the model to answer questions as required.
Typically, we place instructions at the beginning of the prompt, followed by several in-context examples, and finally, we include the question. For example:
```text
Solve the following questions.
1+1=?
2
3+9=?
12
5+6=?
```
Extensive experiments have shown that even with the same original test questions, different ways of constructing the prompt can affect the model's performance. Factors that may influence this include:
- The composition of the prompt itself, including instructions, in-context examples, and the format of the question.
- The selection of in-context examples, including the number and method of selection.
- The manner in which the prompt is used. Should the model complete the prompt based on the given context, or should it choose the best prompt among the candidate prompts?
OpenCompass defines the prompt construction strategy in the `infer_cfg` section of the dataset configuration. A typical `infer_cfg` is shown below:
```python
infer_cfg = dict(
ice_template=dict( # Template used to construct In Context Examples (ice).
type=PromptTemplate,
template='{question}\n{answer}'
),
prompt_template=dict( # Template used to construct the main prompt.
type=PromptTemplate,
template='Solve the following questions.\n{question}\n{answer}',
ice_token=""
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 1]), # Definition of how to retrieve in-context examples.
inferencer=dict(type=GenInferencer), # Method used to generate predictions.
)
```
In this document, we will mainly introduce the definitions of `ice_template`, `prompt_template`, and `inferencer`. For information on the `retriever`, please refer to other documents.
Let's start by introducing the basic syntax of the prompt.
## String-Based Prompt
String-based prompt is a classic form of template. Consider the following template:
```python
prompt_template=dict(
type=PromptTemplate,
template="{anything}\nQuestion: {question}\nAnswer: {answer}"
)
```
At runtime, the fields within the `{}` will be replaced with corresponding fields from the data sample. If a field does not exist in the data sample, it will be kept as is in the output.
For example, let's consider a data example as follows:
```python
example = {
'question': '1+1=?',
'answer': '2', # Assume the answer is in the reader_cfg.output_column
'irrelevant_infos': 'blabla',
}
```
After filling in the template, the result will be:
```text
{anything}
Question: 1+1=?
Answer:
```
As you can see, the actual answer for the question, represented by the field `answer`, does not appear in the generated result. This is because OpenCompass will mask fields that are written in `reader_cfg.output_column` to prevent answer leakage. For detailed explanations on `reader_cfg`, please refer to the relevant documentation on dataset configuration.
## Dialogue-Based Prompt
In practical testing, making models perform simple completions may not effectively test the performance of chat-based models. Therefore, we prefer prompts that take the form of dialogues. Additionally, different models have varying definitions of dialogue formats. Hence, we need prompts generated from the dataset to be more versatile, and the specific prompts required by each model can be generated during testing.
To achieve this, OpenCompass extends the string-based prompt to dialogue-based prompt. Dialogue-based prompt is more flexible, as it can combine with different [meta_templates](./meta_template.md) on the model side to generate prompts in various dialogue formats. It is applicable to both base and chat models, but their definitions are relatively complex.
Now, let's assume we have a data sample as follows:
```python
example = {
'question': '1+1=?',
'answer': '2', # Assume the answer is in the reader_cfg.output_column
'irrelavent_infos': 'blabla',
}
```
Next, let's showcase a few examples:
`````{tabs}
````{tab} Single-round Dialogue
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role="HUMAN", prompt="Question: {question}"),
dict(role="BOT", prompt="Answer: {answer}"),
]
)
)
```
The intermediate result obtained by OpenCompass after filling the data into the template is:
```python
PromptList([
dict(role='HUMAN', prompt='Question: 1+1=?'),
dict(role='BOT', prompt='Answer: '),
])
```
````
````{tab} Multi-round Dialogue
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role="HUMAN", prompt="Question: 2+2=?"),
dict(role="BOT", prompt="Answer: 4"),
dict(role="HUMAN", prompt="Question: 3+3=?"),
dict(role="BOT", prompt="Answer: 6"),
dict(role="HUMAN", prompt="Question: {question}"),
dict(role="BOT", prompt="Answer: {answer}"),
]
)
)
```
The intermediate result obtained by OpenCompass after filling the data into the template is:
```python
PromptList([
dict(role='HUMAN', prompt='Question: 2+2=?'),
dict(role='BOT', prompt='Answer: 4'),
dict(role='HUMAN', prompt='Question: 3+3=?'),
dict(role='BOT', prompt='Answer: 6'),
dict(role='HUMAN', prompt='Question: 1+1=?'),
dict(role='BOT', prompt='Answer: '),
])
```
````
````{tab} Dialogue with sys instruction
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
],
round=[
dict(role="HUMAN", prompt="Question: {question}"),
dict(role="BOT", prompt="Answer: {answer}"),
]
)
)
```
The intermediate result obtained by OpenCompass after filling the data into the template is:
```python
PromptList([
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
dict(role='HUMAN', prompt='Question: 1+1=?'),
dict(role='BOT', prompt='Answer: '),
])
```
During the processing of a specific meta template, if the definition includes the SYSTEM role, the template designated for the SYSTEM role will be used for processing. On the other hand, if the SYSTEM role is not defined, the template assigned to the fallback_role role will be utilized, which, in this example, corresponds to the HUMAN role.
````
`````
In dialogue-based templates, prompts are organized in the form of conversations between different roles (`role`). In the current predefined dataset configuration of OpenCompass, some commonly used roles in a prompt include:
- `HUMAN`: Represents a human, usually the one asking questions.
- `BOT`: Represents the language model, usually the one providing answers.
- `SYSTEM`: Represents the system, typically used at the beginning of prompts to give instructions.
Furthermore, unlike string-based templates, the prompts generated by dialogue-based templates are transformed into an intermediate structure called PromptList. This structure will be further combined with the model-side [meta_templates](./meta_template.md) to assemble the final prompt. If no meta template is specified, the prompts in the PromptList will be directly concatenated into a single string.
```{note}
The content within the PromptList in the example above is not the final input to the model and depends on the processing of the meta template. One potential source of misunderstanding is that in generative evaluations, the prompt of the last `BOT` role, `Answer: `, **will not** be inputted to the model. This is because API models generally cannot customize the initial part of model-generated responses. Therefore, this setting ensures consistency in the evaluation behavior between language models and API models. For more information, please refer to the documentation on [meta template](./meta_template.md).
```
Expand the complete parameter descriptions
- `begin`, `end`: (list, optional) The beginning and end of the prompt, typically containing system-level instructions. Each item inside can be **a dictionary or a string**.
- `round`: (list) The format of the dialogue in the template. Each item in the list must be a dictionary.
Each dictionary has the following parameters:
- `role` (str): The role name participating in the dialogue. It is used to associate with the names in meta_template but does not affect the actual generated prompt.
- `fallback_role` (str): The default role name to use in case the associated role is not found in the meta_template. Defaults to None.
- `prompt` (str): The dialogue content for the role.
## Prompt Templates and `inferencer`
Once we understand the basic definition of prompt templates, we also need to organize them according to the type of `inferencer`.
OpenCompass mainly supports two types of inferencers: `GenInferencer` and `PPLInferencer`, corresponding to two different inference methods.
`GenInferencer` corresponds to generative inference. During inference, the model is asked to continue generating text based on the input prompt. In this case, the `template` represents a single template for each sentence, for example:
`````{tabs}
````{group-tab} String-based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template='Solve the following questions.\n{question}\n{answer}'
)
```
````
````{group-tab} Dialogue-Based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
],
round=[
dict(role="HUMAN", prompt="{question}"),
dict(role="BOT", prompt="{answer}"),
]
)
)
```
````
`````
Then, the model's inference result will be a continuation of the concatenated string.
For `PPLInferencer`, it corresponds to discriminative inference. During inference, the model is asked to compute the perplexity (PPL) for each input string and select the item with the lowest perplexity as the model's inference result. In this case, `template` is a `dict` representing the template for each sentence, for example:
`````{tabs}
````{group-tab} String-based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
"A": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: A",
"B": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: B",
"C": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: C",
"UNK": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: None of them is true.",
)
)
```
````
````{group-tab} Dialogue-Based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
"A": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: A"),
]
),
"B": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: B"),
]
),
"C": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: C"),
]
),
"UNK": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: None of them is true."),
]
),
)
)
```
````
`````
In this case, the model's inference result will be one of the four keys in the `template` ("A" / "B" / "C" / "UNK").
## `ice_template` and `prompt_template`
In OpenCompass, for 0-shot evaluation, we usually only need to define the `prompt_template` field to complete prompt construction. However, for few-shot evaluation, we also need to define the `ice_template` field, which manages the prompt templates corresponding to the in-context examples during context learning.
Both `ice_template` and `prompt_template` follow the same syntax and rules. The complete prompt construction process can be represented using the following pseudo-code:
```python
def build_prompt():
ice = ice_template.format(*ice_example)
prompt = prompt_template.replace(prompt_template.ice_token, ice).format(*prompt_example)
return prompt
```
Now, let's assume there are two training data (ex1, ex2) and one testing data (ex3):
```python
ex1 = {
'question': '2+2=?',
'answer': '4',
'irrelavent_infos': 'blabla',
}
ex2 = {
'question': '3+3=?',
'answer': '6',
'irrelavent_infos': 'blabla',
}
ex3 = {
'question': '1+1=?',
'answer': '2', # Assume the answer is in the reader_cfg.output_column
'irrelavent_infos': 'blabla',
}
```
Next, let's take a look at the actual effects of different prompt construction methods:
`````{tabs}
````{group-tab} String-based Prompt
Template configurations are as follows:
```python
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template='{question}\n{answer}'
),
prompt_template=dict(
type=PromptTemplate,
template='Solve the following questions.\n{question}\n{answer}'
ice_token='',
)
)
```
The resulting strings are as follows:
```text
Solve the following questions.
2+2=?
4
3+3=?
6
1+1=?
```
````
````{group-tab} Dialogue-Based Prompt
Template configurations are as follows:
```python
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role="HUMAN", prompt="{question}"),
dict(role="BOT", prompt="{answer}"),
]
)
),
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
'',
],
round=[
dict(role="HUMAN", prompt="{question}"),
dict(role="BOT", prompt="{answer}"),
],
),
ice_token='',
)
)
```
The intermediate results obtained by OpenCompass after filling the data into the templates are as follows:
```python
PromptList([
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
dict(role='HUMAN', prompt='2+2=?'),
dict(role='BOT', prompt='4'),
dict(role='HUMAN', prompt='3+3=?'),
dict(role='BOT', prompt='6'),
dict(role='HUMAN', prompt='1+1=?'),
dict(role='BOT', prompt=''),
])
```
````
`````
### Abbreviated Usage
It is worth noting that, for the sake of simplicity in the configuration file, the `prompt_template` field can be omitted. When the `prompt_template` field is omitted, the `ice_template` will be used as the `prompt_template` as well, to assemble the complete prompt. The following two `infer_cfg` configurations are equivalent:
More generally, even in the case of 0-shot learning (i.e., when `retriever` is `ZeroRetriver`), this mechanism still applies. Therefore, the following configuration is also valid:
```python
datasets = [
dict(
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template="Q: {question}\nA: {answer}",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
),
]
```
## Usage Suggestion
It is suggested to use the [Prompt Viewer](../tools.md) tool to visualize the completed prompts, confirm the correctness of the templates, and ensure that the results meet expectations.
================================================
FILE: docs/en/statis.py
================================================
#! /usr/bin/env python
from pathlib import Path
import yaml
from tabulate import tabulate
OC_ROOT = Path(__file__).absolute().parents[2]
GITHUB_PREFIX = 'https://github.com/open-compass/opencompass/tree/main/'
DATASETZOO_TEMPLATE = """\
# Dataset Statistics
On this page, we have listed all the datasets supported by OpenCompass.
You can use sorting and search functions to find the dataset you need.
We provide recommended running configurations for each dataset,
and in some datasets also offer recommended configurations based on LLM Judge.
You can quickly start evaluation tasks based on the recommended configurations.
However, please note that these configurations may be updated over time.
"""
with open('dataset_statistics.md', 'w') as f:
f.write(DATASETZOO_TEMPLATE)
load_path = str(OC_ROOT / 'dataset-index.yml')
with open(load_path, 'r') as f2:
data_list = yaml.load(f2, Loader=yaml.FullLoader)
HEADER = ['name', 'category', 'paper', 'configpath', 'configpath_llmjudge']
recommanded_dataset_list = [
'ifeval', 'aime2024', 'bbh', 'bigcodebench', 'cmmlu', 'drop', 'gpqa',
'hellaswag', 'humaneval', 'korbench', 'livecodebench', 'math', 'mmlu',
'mmlu_pro', 'musr', 'math500'
]
def table_format(data_list):
table_format_list = []
for i in data_list:
table_format_list_sub = []
for j in i:
if j in recommanded_dataset_list:
link_token = '[link]('
else:
link_token = '[link(TBD)]('
for index in HEADER:
if index == 'paper':
table_format_list_sub.append('[link](' + i[j][index] + ')')
elif index == 'configpath_llmjudge':
if i[j][index] == '':
table_format_list_sub.append(i[j][index])
elif isinstance(i[j][index], list):
sub_list_text = ''
for k in i[j][index]:
sub_list_text += (link_token + GITHUB_PREFIX + k +
') / ')
table_format_list_sub.append(sub_list_text[:-2])
else:
table_format_list_sub.append(link_token +
GITHUB_PREFIX +
i[j][index] + ')')
elif index == 'configpath':
if isinstance(i[j][index], list):
sub_list_text = ''
for k in i[j][index]:
sub_list_text += (link_token + GITHUB_PREFIX + k +
') / ')
table_format_list_sub.append(sub_list_text[:-2])
else:
table_format_list_sub.append(link_token +
GITHUB_PREFIX +
i[j][index] + ')')
else:
table_format_list_sub.append(i[j][index])
table_format_list.append(table_format_list_sub)
return table_format_list
data_format_list = table_format(data_list)
def generate_table(data_list, title=None):
with open('dataset_statistics.md', 'a') as f:
if title is not None:
f.write(f'\n{title}')
f.write("""\n```{table}\n:class: dataset\n""")
header = [
'Name', 'Category', 'Paper or Repository', 'Recommended Config',
'Recommended Config (LLM Judge)'
]
table_cfg = dict(tablefmt='pipe',
floatfmt='.2f',
numalign='right',
stralign='center')
f.write(tabulate(data_list, header, **table_cfg))
f.write('\n```\n')
generate_table(
data_list=data_format_list,
title='## Supported Dataset List',
)
================================================
FILE: docs/en/tools.md
================================================
# Useful Tools
## Prompt Viewer
This tool allows you to directly view the generated prompt without starting the full training process. If the passed configuration is only the dataset configuration (such as `configs/datasets/nq/nq_gen.py`), it will display the original prompt defined in the dataset configuration. If it is a complete evaluation configuration (including the model and the dataset), it will display the prompt received by the selected model during operation.
Running method:
```bash
python tools/prompt_viewer.py CONFIG_PATH [-n] [-a] [-p PATTERN]
```
- `-n`: Do not enter interactive mode, select the first model (if any) and dataset by default.
- `-a`: View the prompts received by all models and all dataset combinations in the configuration.
- `-p PATTERN`: Do not enter interactive mode, select all datasets that match the input regular expression.
## Case Analyzer (To be updated)
Based on existing evaluation results, this tool produces inference error samples and full samples with annotation information.
Running method:
```bash
python tools/case_analyzer.py CONFIG_PATH [-w WORK_DIR]
```
- `-w`: Work path, default is `'./outputs/default'`.
## Lark Bot
Users can configure the Lark bot to implement real-time monitoring of task status. Please refer to [this document](https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN?lang=zh-CN#7a28964d) for setting up the Lark bot.
Configuration method:
- Open the `configs/secrets.py` file, and add the following line to the file:
```python
lark_bot_url = 'YOUR_WEBHOOK_URL'
```
- Normally, the Webhook URL format is like https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx .
- Inherit this file in the complete evaluation configuration
- To avoid the bot sending messages frequently and causing disturbance, the running status will not be reported automatically by default. If necessary, you can start status reporting through `-l` or `--lark`:
```bash
python run.py configs/eval_demo.py -l
```
## API Model Tester
This tool can quickly test whether the functionality of the API model is normal.
Running method:
```bash
python tools/test_api_model.py [CONFIG_PATH] -n
```
## Prediction Merger
This tool can merge patitioned predictions.
Running method:
```bash
python tools/prediction_merger.py CONFIG_PATH [-w WORK_DIR]
```
- `-w`: Work path, default is `'./outputs/default'`.
## List Configs
This tool can list or search all available model and dataset configurations. It supports fuzzy search, making it convenient for use in conjunction with `run.py`.
Usage:
```bash
python tools/list_configs.py [PATTERN1] [PATTERN2] [...]
```
If executed without any parameters, it will list all model configurations in the `configs/models` and `configs/dataset` directories by default.
Users can also pass any number of parameters. The script will list all configurations related to the provided strings, supporting fuzzy search and the use of the * wildcard. For example, the following command will list all configurations related to `mmlu` and `llama`:
```bash
python tools/list_configs.py mmlu llama
```
Its output could be:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| hf_llama2_7b | configs/models/hf_llama2_7b.py |
| hf_llama_13b | configs/models/hf_llama_13b.py |
| hf_llama_30b | configs/models/hf_llama_30b.py |
| hf_llama_65b | configs/models/hf_llama_65b.py |
| hf_llama_7b | configs/models/hf_llama_7b.py |
| llama2_13b_chat | configs/models/llama2_13b_chat.py |
| llama2_70b_chat | configs/models/llama2_70b_chat.py |
| llama2_7b_chat | configs/models/llama2_7b_chat.py |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| cmmlu_ppl | configs/datasets/cmmlu/cmmlu_ppl.py |
| cmmlu_ppl_fd1f2f | configs/datasets/cmmlu/cmmlu_ppl_fd1f2f.py |
| mmlu_gen | configs/datasets/mmlu/mmlu_gen.py |
| mmlu_gen_23a9a9 | configs/datasets/mmlu/mmlu_gen_23a9a9.py |
| mmlu_gen_5d1409 | configs/datasets/mmlu/mmlu_gen_5d1409.py |
| mmlu_gen_79e572 | configs/datasets/mmlu/mmlu_gen_79e572.py |
| mmlu_gen_a484b3 | configs/datasets/mmlu/mmlu_gen_a484b3.py |
| mmlu_ppl | configs/datasets/mmlu/mmlu_ppl.py |
| mmlu_ppl_ac766d | configs/datasets/mmlu/mmlu_ppl_ac766d.py |
+-------------------+---------------------------------------------------+
```
## Dataset Suffix Updater
This tool can quickly modify the suffixes of configuration files located under the `configs/dataset` directory, aligning them with the naming conventions based on prompt hash.
How to run:
```bash
python tools/update_dataset_suffix.py
```
================================================
FILE: docs/en/user_guides/config.md
================================================
# Learn About Config
OpenCompass uses the OpenMMLab modern style configuration files. If you are familiar with the OpenMMLab style
configuration files, you can directly refer to
[A Pure Python style Configuration File (Beta)](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta)
to understand the differences between the new-style and original configuration files. If you have not
encountered OpenMMLab style configuration files before, I will explain the usage of configuration files using
a simple example. Make sure you have installed the latest version of MMEngine to support the
new-style configuration files.
## Basic Format
OpenCompass configuration files are in Python format, following basic Python syntax. Each configuration item
is specified by defining variables. For example, when defining a model, we use the following configuration:
```python
# model_cfg.py
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
path='huggyllama/llama-7b',
model_kwargs=dict(device_map='auto'),
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
max_out_len=50,
run_cfg=dict(num_gpus=8, num_procs=1),
)
]
```
When reading the configuration file, use `Config.fromfile` from MMEngine for parsing:
```python
>>> from mmengine.config import Config
>>> cfg = Config.fromfile('./model_cfg.py')
>>> print(cfg.models[0])
{'type': HuggingFaceCausalLM, 'path': 'huggyllama/llama-7b', 'model_kwargs': {'device_map': 'auto'}, ...}
```
## Inheritance Mechanism
OpenCompass configuration files use Python's import mechanism for file inheritance. Note that when inheriting
configuration files, we need to use the `read_base` context manager.
```python
# inherit.py
from mmengine.config import read_base
with read_base():
from .model_cfg import models # Inherits the 'models' from model_cfg.py
```
Parse the configuration file using `Config.fromfile`:
```python
>>> from mmengine.config import Config
>>> cfg = Config.fromfile('./inherit.py')
>>> print(cfg.models[0])
{'type': HuggingFaceCausalLM, 'path': 'huggyllama/llama-7b', 'model_kwargs': {'device_map': 'auto'}, ...}
```
## Evaluation Configuration Example
```python
# configs/llama7b.py
from mmengine.config import read_base
with read_base():
# Read the required dataset configurations directly from the preset dataset configurations
from .datasets.piqa.piqa_ppl import piqa_datasets
from .datasets.siqa.siqa_gen import siqa_datasets
# Concatenate the datasets to be evaluated into the datasets field
datasets = [*piqa_datasets, *siqa_datasets]
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
# Initialization parameters for `HuggingFaceCausalLM`
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
# Common parameters for all models, not specific to HuggingFaceCausalLM's initialization parameters
abbr='llama-7b', # Model abbreviation for result display
max_out_len=100, # Maximum number of generated tokens
batch_size=16,
run_cfg=dict(num_gpus=1), # Run configuration for specifying resource requirements
)
]
```
## Dataset Configuration File Example
In the above example configuration file, we directly inherit the dataset-related configurations. Next, we will
use the PIQA dataset configuration file as an example to demonstrate the meanings of each field in the dataset
configuration file. If you do not intend to modify the prompt for model testing or add new datasets, you can
skip this section.
The PIQA dataset [configuration file](https://github.com/open-compass/opencompass/blob/main/configs/datasets/piqa/piqa_ppl_1cf9f0.py) is as follows.
It is a configuration for evaluating based on perplexity (PPL) and does not use In-Context Learning.
```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
# Reading configurations
# The loaded dataset is usually organized as dictionaries, specifying the input fields used to form the prompt
# and the output field used as the answer in each sample
piqa_reader_cfg = dict(
input_columns=['goal', 'sol1', 'sol2'],
output_column='label',
test_split='validation',
)
# Inference configurations
piqa_infer_cfg = dict(
# Prompt generation configuration
prompt_template=dict(
type=PromptTemplate,
# Prompt template, the template format matches the inferencer type specified later
# Here, to calculate PPL, we need to specify the prompt template for each answer
template={
0: 'The following makes sense: \nQ: {goal}\nA: {sol1}\n',
1: 'The following makes sense: \nQ: {goal}\nA: {sol2}\n'
}),
# In-Context example configuration, specifying `ZeroRetriever` here, which means not using in-context example.
retriever=dict(type=ZeroRetriever),
# Inference method configuration
# - PPLInferencer uses perplexity (PPL) to obtain answers
# - GenInferencer uses the model's generated results to obtain answers
inferencer=dict(type=PPLInferencer))
# Metric configuration, using Accuracy as the evaluation metric
piqa_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
# Dataset configuration, where all the above variables are parameters for this configuration
# It is a list used to specify the configurations of different evaluation subsets of a dataset.
piqa_datasets = [
dict(
type=HFDataset,
path='piqa',
reader_cfg=piqa_reader_cfg,
infer_cfg=piqa_infer_cfg,
eval_cfg=piqa_eval_cfg)
```
For detailed configuration of the **Prompt generation configuration**, you can refer to the [Prompt Template](../prompt/prompt_template.md).
## Advanced Evaluation Configuration
In OpenCompass, we support configuration options such as task partitioner and runner for more flexible and
efficient utilization of computational resources.
By default, we use size-based partitioning for inference tasks. You can specify the sample number threshold
for task partitioning using `--max-partition-size` when starting the task. Additionally, we use local
resources for inference and evaluation tasks by default. If you want to use Slurm cluster resources, you can
use the `--slurm` parameter and the `--partition` parameter to specify the Slurm runner backend when starting
the task.
Furthermore, if the above functionalities do not meet your requirements for task partitioning and runner
backend configuration, you can provide more detailed configurations in the configuration file. Please refer to
[Efficient Evaluation](./evaluation.md) for more information.
================================================
FILE: docs/en/user_guides/corebench.md
================================================
# Performance of Common Benchmarks
We have identified several well-known benchmarks for evaluating large language models (LLMs), and provide detailed performance results of famous LLMs on these datasets.
| Model | Version | Metric | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 |
| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- |
| MMLU | - | naive_average | gen | 83.6 | 84.2 | 84.6 | 80.5 | 77.2 |
| CMMLU | - | naive_average | gen | 71.9 | 72.4 | 74.2 | 70.1 | 59.7 |
| CEval-Test | - | naive_average | gen | 69.7 | 70.5 | 71.7 | 66.9 | 58.7 |
| GaokaoBench | - | weighted_average | gen | 74.8 | 76.0 | 74.2 | 67.8 | 60.0 |
| Triviaqa_wiki(1shot) | 01cf41 | score | gen | 73.1 | 82.9 | 82.4 | 89.8 | 89.7 |
| NQ_open(1shot) | eaf81e | score | gen | 27.9 | 30.4 | 39.4 | 40.1 | 46.8 |
| Race-High | 9a54b6 | accuracy | gen | 89.3 | 89.6 | 90.8 | 89.4 | 84.8 |
| WinoGrande | 6447e6 | accuracy | gen | 80.7 | 83.3 | 84.1 | 69.7 | 76.6 |
| HellaSwag | e42710 | accuracy | gen | 92.7 | 93.5 | 94.6 | 87.7 | 86.1 |
| BBH | - | naive_average | gen | 82.7 | 78.5 | 78.5 | 80.5 | 79.1 |
| GSM-8K | 1d7fe4 | accuracy | gen | 80.5 | 79.7 | 87.7 | 90.2 | 88.3 |
| Math | 393424 | accuracy | gen | 61.9 | 71.2 | 60.2 | 47.1 | 50 |
| TheoremQA | ef26ca | accuracy | gen | 28.4 | 23.3 | 29.6 | 25.4 | 13 |
| HumanEval | 8e312c | humaneval_pass@1 | gen | 74.4 | 82.3 | 76.2 | 72.6 | 72.0 |
| MBPP(sanitized) | 1e1056 | score | gen | 78.6 | 77.0 | 76.7 | 71.6 | 68.9 |
| GPQA_diamond | 4baadb | accuracy | gen | 40.4 | 48.5 | 46.5 | 38.9 | 36.4 |
| IFEval | 3321a3 | Prompt-level-strict-accuracy | gen | 71.9 | 79.9 | 80.0 | 77.1 | 65.8 |
================================================
FILE: docs/en/user_guides/datasets.md
================================================
# Configure Datasets
This tutorial mainly focuses on selecting datasets supported by OpenCompass and preparing their configs files. Please make sure you have downloaded the datasets following the steps in [Dataset Preparation](../get_started/installation.md#dataset-preparation).
## Directory Structure of Dataset Configuration Files
First, let's introduce the structure under the `configs/datasets` directory in OpenCompass, as shown below:
```
configs/datasets/
├── agieval
├── apps
├── ARC_c
├── ...
├── CLUE_afqmc # dataset
│ ├── CLUE_afqmc_gen_901306.py # different version of config
│ ├── CLUE_afqmc_gen.py
│ ├── CLUE_afqmc_ppl_378c5b.py
│ ├── CLUE_afqmc_ppl_6507d7.py
│ ├── CLUE_afqmc_ppl_7b0c1e.py
│ └── CLUE_afqmc_ppl.py
├── ...
├── XLSum
├── Xsum
└── z_bench
```
In the `configs/datasets` directory structure, we flatten all datasets directly, and there are multiple dataset configurations within the corresponding folders for each dataset.
The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`.
In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.
## Dataset Selection
In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
```python
afqmc_datasets = [
dict(
abbr="afqmc-dev",
type=AFQMCDatasetV2,
path="./data/CLUE/AFQMC/dev.json",
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
```
And `cmnli_datasets` in `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`.
```python
cmnli_datasets = [
dict(
type=HFDataset,
abbr='cmnli',
path='json',
split='train',
data_files='./data/CLUE/cmnli/cmnli_public/dev.json',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg)
]
```
Take these two datasets as examples. If users want to evaluate these two datasets at the same time, they can create a new configuration file in the `configs` directory. We use the import mechanism in the `mmengine` configuration to build the part of the dataset parameters in the evaluation script, as shown below:
```python
from mmengine.config import read_base
with read_base():
from .datasets.CLUE_afqmc.CLUE_afqmc_gen_db509b import afqmc_datasets
from .datasets.CLUE_cmnli.CLUE_cmnli_ppl_b78ad4 import cmnli_datasets
datasets = []
datasets += afqmc_datasets
datasets += cmnli_datasets
```
Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.
For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.
### Multiple Evaluations on the Dataset
In the dataset configuration, you can set the parameter `n` to perform multiple evaluations on the same dataset and return the average metrics, for example:
```python
afqmc_datasets = [
dict(
abbr="afqmc-dev",
type=AFQMCDatasetV2,
path="./data/CLUE/AFQMC/dev.json",
n=10, # Perform 10 evaluations
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
```
Additionally, for binary evaluation metrics (such as accuracy, pass-rate, etc.), you can also set the parameter `k` in conjunction with `n` for [G-Pass@k](http://arxiv.org/abs/2412.13147) evaluation. The formula for G-Pass@k is:
```{math}
\text{G-Pass@}k_\tau=E_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],
```
where $n$ is the number of evaluations, and $c$ is the number of times that passed or were correct out of $n$ runs. An example configuration is as follows:
```python
aime2024_datasets = [
dict(
abbr='aime2024',
type=Aime2024Dataset,
path='opencompass/aime2024',
k=[2, 4], # Return results for G-Pass@2 and G-Pass@4
n=12, # 12 evaluations
...
)
]
```
================================================
FILE: docs/en/user_guides/deepseek_r1.md
================================================
# Tutorial for Evaluating Reasoning Models
OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).
- At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
- For datasets with limited samples, we employ multiple evaluation runs and take the average
- For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation
## Installation and Preparation
Please follow OpenCompass's installation guide.
## Evaluation Configuration Setup
We provide example configurations in `examples/eval_deepseek_r1.py`. Below is the configuration explanation:
### Configuration Interpretation
#### 1. Dataset and Validator Configuration
```python
# Configuration supporting multiple runs (example)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # Replace with actual path
key='YOUR_API_KEY', # Use real API key
openai_api_base=['http://your-api-endpoint'], # Replace with API endpoint
query_per_second=16,
batch_size=1024,
temperature=0.001,
max_out_len=16384
)
# Apply validator to all datasets
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
```
#### 2. Model Configuration
We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)
```python
# LMDeploy model configuration example
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768
),
max_seq_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# Extendable 14B/32B configurations...
]
```
#### 3. Evaluation Process Configuration
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
# Evaluation configuration
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
```
#### 4. Summary Configuration
```python
# Multiple runs results average configuration
summary_groups = [
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
# Other dataset average configurations...
]
summarizer = dict(
dataset_abbrs=[
['AIME2024-Aveage8', 'naive_average'],
# Other dataset metrics...
],
summary_groups=summary_groups
)
# Work directory configuration
work_dir = "outputs/deepseek_r1_reasoning"
```
## Evaluation Execution
### Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU
```bash
opencompass examples/eval_deepseek_r1.py --debug --dump-eval-details
```
Evaluation logs will be output in the command line.
### Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs
You need to modify the `infer` configuration in the configuration file and set `num_worker` to 8
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
At the same time, remove the `--debug` parameter from the evaluation command
```bash
opencompass examples/eval_deepseek_r1.py --dump-eval-details
```
In this mode, OpenCompass will use multithreading to start `$num_worker` tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under `$work_dir`.
### Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs
Note that in the model configuration, `num_gpus` in `run_cfg` needs to be set to 2 (if using an inference backend, parameters such as `tp` in LMDeploy also need to be modified accordingly to 2), and at the same time, set `num_worker` in the `infer` configuration to 4
```python
models += [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-14b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=128,
run_cfg=dict(num_gpus=2),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
```
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
### Evaluation Results
The evaluation results are displayed as follows:
```bash
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
```
## Performance Baseline
Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.
| Model | Dataset | Metric | Value |
| ---------------------------- | -------- | -------- | ----- |
| DeepSeek-R1-Distill-Qwen-7B | AIME2024 | Accuracy | 56.3 |
| DeepSeek-R1-Distill-Qwen-14B | AIME2024 | Accuracy | 74.2 |
| DeepSeek-R1-Distill-Qwen-32B | AIME2024 | Accuracy | 74.2 |
================================================
FILE: docs/en/user_guides/evaluation.md
================================================
# Efficient Evaluation
OpenCompass supports custom task partitioners (`Partitioner`), which enable flexible division of evaluation tasks. In conjunction with `Runner`, which controls the platform for task execution, such as a local machine or a cluster, OpenCompass can distribute large evaluation tasks to a vast number of computing nodes. This helps utilize computational resources efficiently and significantly accelerates the evaluation process.
By default, OpenCompass hides these details from users and automatically selects the recommended execution strategies. But users can still customize these strategies of the workflows according to their needs, just by adding the `infer` and/or `eval` fields to the configuration file:
```python
from opencompass.partitioners import SizePartitioner, NaivePartitioner
from opencompass.runners import SlurmRunner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=5000),
runner=dict(
type=SlurmRunner,
max_num_workers=64,
task=dict(type=OpenICLInferTask),
retry=5),
)
eval = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=32,
task=dict(type=OpenICLEvalTask)),
)
```
The example above demonstrates the way to configure the execution strategies for the inference and evaluation stages. At the inference stage, the task will be divided into several sub-tasks, each of 5000 samples, and then submitted to the Slurm cluster for execution, where there are at most 64 tasks running in parallel. At the evaluation stage, each single model-dataset pair forms a task, and 32 processes are launched locally to compute the metrics.
The following sections will introduce the involved modules in detail.
## Task Partition (Partitioner)
Due to the long inference time of large language models and the vast amount of evaluation datasets, serial execution of a single evaluation task can be quite time-consuming. OpenCompass allows custom task partitioners (`Partitioner`) to divide large evaluation tasks into numerous independent smaller tasks, thus fully utilizing computational resources via parallel execution. Users can configure the task partitioning strategies for the inference and evaluation stages via `infer.partitioner` and `eval.partitioner`. Below, we will introduce all the partitioning strategies supported by OpenCompass.
### `NaivePartitioner`
This partitioner dispatches each combination of a model and dataset as an independent task. It is the most basic partitioning strategy and does not have any additional parameters.
```python
from opencompass.partitioners import NaivePartitioner
infer = dict(
partitioner=dict(type=NaivePartitioner)
# ...
)
```
### `SizePartitioner`
```{warning}
For now, this partitioner is not suitable for evaluation tasks (`OpenICLEvalTask`).
```
This partitioner estimates the inference cost (time) of a dataset according to its size, multiplied by an empirical expansion coefficient. It then creates tasks by splitting larger datasets and merging smaller ones to ensure the inference costs of each sub-task are as equal as possible.
The commonly used parameters for this partitioner are as follows:
```python
from opencompass.partitioners import SizePartitioner
infer = dict(
partitioner=dict(
type=SizePartitioner,
max_task_size: int = 2000, # Maximum size of each task
gen_task_coef: int = 20, # Expansion coefficient for generative tasks
),
# ...
)
```
`SizePartitioner` estimates the inference cost of a dataset based on the type of the inference task and selects different expansion coefficients accordingly. For generative tasks, such as those using `GenInferencer`, a larger `gen_task_coef` is set; for discriminative tasks, like those using `PPLInferencer`, the number of labels in the prompt is used.
```{note}
Currently, this partitioning strategy is still rather crude and does not accurately reflect the computational difference between generative and discriminative tasks. We look forward to the community proposing better partitioning strategies :)
```
## Execution Backend (Runner)
In a multi-card, multi-machine cluster environment, if we want to implement parallel execution of multiple tasks, we usually need to rely on a cluster management system (like Slurm) for task allocation and scheduling. In OpenCompass, task allocation and execution are uniformly handled by the Runner. Currently, it supports both Slurm and PAI-DLC scheduling backends, and also provides a `LocalRunner` to directly launch tasks on the local machine.
### `LocalRunner`
`LocalRunner` is the most basic runner that can run tasks parallelly on the local machine.
```python
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=LocalRunner,
max_num_workers=16, # Maximum number of processes to run in parallel
task=dict(type=OpenICLEvalTask), # Task to be run
)
)
```
```{note}
The actual number of running tasks are both limited by the actual available GPU resources and the number of workers.
```
### `SlurmRunner`
`SlurmRunner` submits tasks to run on the Slurm cluster. The commonly used configuration fields are as follows:
```python
from opencompass.runners import SlurmRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=SlurmRunner,
task=dict(type=OpenICLEvalTask), # Task to be run
max_num_workers=16, # Maximum concurrent evaluation task count
retry=2, # Retry count for failed tasks, can prevent accidental errors
),
)
```
### `DLCRunner`
`DLCRunner` submits tasks to run on Alibaba's Deep Learning Center (DLC). This Runner depends on `dlc`. Firstly, you need to prepare `dlc` in the environment:
```bash
cd ~
wget https://dlc-cli.oss-cn-zhangjiakou.aliyuncs.com/light/binary/linux/amd64/dlc
chmod +x ./dlc
sudo ln -rs dlc /usr/local/bin
./dlc config
```
Fill in the necessary information according to the prompts and get the `dlc` configuration file (like `/user/.dlc/config`) to complete the preparation. Then, specify the `DLCRunner` configuration in the configuration file as per the format:
```python
from opencompass.runners import DLCRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=DLCRunner,
task=dict(type=OpenICLEvalTask), # Task to be run
max_num_workers=16, # Maximum concurrent evaluation task count
aliyun_cfg=dict(
bashrc_path="/user/.bashrc", # Path to the bashrc for initializing the running environment
conda_env_name='opencompass', # Conda environment for OpenCompass
dlc_config_path="/user/.dlc/config", # Configuration file for dlc
workspace_id='ws-xxx', # DLC workspace ID
worker_image='xxx', # Image url for running tasks
),
retry=2, # Retry count for failed tasks, can prevent accidental errors
),
)
```
## Task
A Task is a fundamental module in OpenCompass, a standalone script that executes the computationally intensive operations. Each task is designed to load a configuration file to determine parameter settings, and it can be executed in two distinct ways:
2. Instantiate a Task object, then call `task.run()`.
3. Call `get_command` method by passing in the config path and the command template string that contains `{task_cmd}` as a placeholder (e.g. `srun {task_cmd}`). The returned command string will be the full command and can be executed directly.
As of now, OpenCompass supports the following task types:
- `OpenICLInferTask`: Perform LM Inference task based on OpenICL framework.
- `OpenICLEvalTask`: Perform LM Evaluation task based on OpenEval framework.
In the future, more task types will be supported.
================================================
FILE: docs/en/user_guides/experimentation.md
================================================
# Task Execution and Monitoring
## Launching an Evaluation Task
The program entry for the evaluation task is `run.py`. The usage is as follows:
```shell
python run.py $EXP {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--debug] [-m MODE] [-r [REUSE]] [-w WORKDIR] [-l] [--dry-run] [--dump-eval-details]
```
Task Configuration (`$EXP`):
- `run.py` accepts a .py configuration file as task-related parameters, which must include the `datasets` and `models` fields.
```bash
python run.py configs/eval_demo.py
```
- If no configuration file is provided, users can also specify models and datasets using `--models MODEL1 MODEL2 ...` and `--datasets DATASET1 DATASET2 ...`:
```bash
python run.py --models hf_opt_350m hf_opt_125m --datasets siqa_gen winograd_ppl
```
- For HuggingFace related models, users can also define a model quickly in the command line through HuggingFace parameters and then specify datasets using `--datasets DATASET1 DATASET2 ...`.
```bash
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path huggyllama/llama-7b
```
Complete HuggingFace parameter descriptions:
- `--hf-path`: HuggingFace model path
- `--peft-path`: PEFT model path
- `--tokenizer-path`: HuggingFace tokenizer path (if it's the same as the model path, it can be omitted)
- `--model-kwargs`: Parameters for constructing the model
- `--tokenizer-kwargs`: Parameters for constructing the tokenizer
- `--max-out-len`: Maximum generated token count
- `--max-seq-len`: Maximum sequence length the model can accept
- `--batch-size`: Batch size
- `--hf-num-gpus`: Number of GPUs required to run the model. Please note that this parameter is only used to determine the number of GPUs required to run the model, and does not affect the actual number of GPUs used for the task. Refer to [Efficient Evaluation](./evaluation.md) for more details.
Starting Methods:
- Running on local machine: `run.py $EXP`.
- Running with slurm: `run.py $EXP --slurm -p $PARTITION_name`.
- Running with dlc: `run.py $EXP --dlc --aliyun-cfg $AliYun_Cfg`
- Customized starting: `run.py $EXP`. Here, $EXP is the configuration file which includes the `eval` and `infer` fields. For detailed configurations, please refer to [Efficient Evaluation](./evaluation.md).
The parameter explanation is as follows:
- `-p`: Specify the slurm partition;
- `-q`: Specify the slurm quotatype (default is None), with optional values being reserved, auto, spot. This parameter may only be used in some slurm variants;
- `--debug`: When enabled, inference and evaluation tasks will run in single-process mode, and output will be echoed in real-time for debugging;
- `-m`: Running mode, default is `all`. It can be specified as `infer` to only run inference and obtain output results; if there are already model outputs in `{WORKDIR}`, it can be specified as `eval` to only run evaluation and obtain evaluation results; if the evaluation results are ready, it can be specified as `viz` to only run visualization, which summarizes the results in tables; if specified as `all`, a full run will be performed, which includes inference, evaluation, and visualization.
- `-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
- `-w`: Specify the working path, default is `./outputs/default`.
- `-l`: Enable status reporting via Lark bot.
- `--dry-run`: When enabled, inference and evaluation tasks will be dispatched but won't actually run for debugging.
- `--dump-eval-details`: Default enabled,evaluation under the `results` folder will include more details, such as the correctness of each sample. Set `--dump-eval-details False` to disable it。
Using run mode `-m all` as an example, the overall execution flow is as follows:
1. Read the configuration file, parse out the model, dataset, evaluator, and other configuration information
2. The evaluation task mainly includes three stages: inference `infer`, evaluation `eval`, and visualization `viz`. After task division by Partitioner, they are handed over to Runner for parallel execution. Individual inference and evaluation tasks are abstracted into `OpenICLInferTask` and `OpenICLEvalTask` respectively.
3. After each stage ends, the visualization stage will read the evaluation results in `results/` to generate a table.
## Task Monitoring: Lark Bot
Users can enable real-time monitoring of task status by setting up a Lark bot. Please refer to [this document](https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN?lang=zh-CN#7a28964d) for setting up the Lark bot.
Configuration method:
1. Open the `configs/lark.py` file, and add the following line:
```python
lark_bot_url = 'YOUR_WEBHOOK_URL'
```
Typically, the Webhook URL is formatted like this: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx .
2. Inherit this file in the complete evaluation configuration:
```python
from mmengine.config import read_base
with read_base():
from .lark import lark_bot_url
```
3. To avoid frequent messages from the bot becoming a nuisance, status updates are not automatically reported by default. You can start status reporting using `-l` or `--lark` when needed:
```bash
python run.py configs/eval_demo.py -p {PARTITION} -l
```
## Run Results
All run results will be placed in `outputs/default/` directory by default, the directory structure is shown below:
```
outputs/default/
├── 20200220_120000
├── ...
├── 20230220_183030
│ ├── configs
│ ├── logs
│ │ ├── eval
│ │ └── infer
│ ├── predictions
│ │ └── MODEL1
│ └── results
│ └── MODEL1
```
Each timestamp contains the following content:
- configs folder, which stores the configuration files corresponding to each run with this timestamp as the output directory;
- logs folder, which stores the output log files of the inference and evaluation phases, each folder will store logs in subfolders by model;
- predictions folder, which stores the inferred json results, with a model subfolder;
- results folder, which stores the evaluated json results, with a model subfolder.
Also, all `-r` without specifying a corresponding timestamp will select the newest folder by sorting as the output directory.
## Introduction of Summerizer (to be updated)
================================================
FILE: docs/en/user_guides/framework_overview.md
================================================
# Overview
## Evaluation Targets
The primary evaluation targets of this algorithm library are large language models. We introduce specific model types for evaluation using the large language model as an example.
- base Model: Typically obtained through training on massive textual data in a self-supervised manner (e.g., OpenAI's GPT-3, Meta's LLaMA). These models usually have powerful text continuation capabilities.
- Chat Model: Often built upon the base model and refined through directive fine-tuning or human preference alignment (e.g., OpenAI's ChatGPT, Shanghai AI Lab's Scholar Pu Tongue). These models can understand human instructions and have strong conversational skills.
## Tool Architecture

- Model Layer: This encompasses the primary model categories involved in large model evaluations. OpenCompass focuses on base models and chat models for in-depth evaluations.
- Capability Layer: OpenCompass evaluates models based on general capabilities and special features. In terms of general capabilities, models are evaluated on language, knowledge, understanding, reasoning, safety, and other dimensions. In terms of special capabilities, evaluations are based on long texts, code, tools, and knowledge enhancement.
- Method Layer: OpenCompass uses both objective and subjective evaluation methods. Objective evaluations can quickly assess a model's capability in tasks with definite answers (like multiple choice, fill in the blanks, closed-ended questions), while subjective evaluations measure user satisfaction with the model's replies. OpenCompass uses both model-assisted subjective evaluations and human feedback-driven subjective evaluations.
- Tool Layer: OpenCompass offers extensive functionalities for automated, efficient evaluations of large language models. This includes distributed evaluation techniques, prompt engineering, integration with evaluation databases, leaderboard publishing, report generation, and many more features.
## Capability Dimensions
### Design Philosophy
To accurately, comprehensively, and systematically assess the capabilities of large language models, OpenCompass takes a general AI perspective, integrating cutting-edge academic advancements and industrial best practices to propose an evaluation system tailored for real-world applications. OpenCompass's capability dimensions cover both general capabilities and special features.
### General Capabilities
General capabilities encompass examination, knowledge, language, understanding, reasoning, and safety, forming a comprehensive evaluation system across these six dimensions.
#### Examination Capability
This dimension aims to provide evaluation support from the perspective of human development, borrowing the classification logic from pedagogy. The core idea revolves around mandatory education, higher education, and vocational training, creating a comprehensive academic capability evaluation approach.
#### Knowledge Capability
Knowledge capability gauges the model's grasp on various knowledge types, including but not limited to general world knowledge and domain-specific expertise. This capability hopes that the model can answer a wide range of knowledge-based questions accurately and comprehensively.
#### Reasoning Capability
Reasoning is a crucial dimension for general AI. This evaluates the model's reasoning skills, including but not limited to mathematical computation, logical reasoning, causal inference, code generation and modification, and more.
#### Understanding Capability
This dimension evaluates the model's comprehension of text, including:
- Rhetorical techniques understanding and analysis: Grasping various rhetorical techniques used in text and analyzing and interpreting them.
- Text content summarization: Summarizing and extracting information from given content.
- Content creation: Open-ended or semi-open-ended content creation based on given themes or requirements.
#### Language Capability
This dimension evaluates the model's prior language knowledge, which includes but is not limited to:
- Word recognition and generation: Understanding language at the word level and tasks like word recognition, classification, definition, and generation.
- Grammar understanding and correction: Grasping grammar within the text and identifying and correcting grammatical errors.
- Cross-language translation: Translating given source language into target languages, assessing multilingual capabilities of current large models.
#### Safety Capability
In conjunction with the technical features of large language models, OpenCompass assesses the legality, compliance, and safety of model outputs, aiding the development of safe and responsible large models. This capability includes but is not limited to:
- Fairness
- Legality
- Harmlessness
- Ethical considerations
- Privacy protection
## Evaluation Methods
OpenCompass adopts a combination of objective and subjective evaluations. For capability dimensions and scenarios with definite answers, a comprehensive assessment of model capabilities is conducted using a well-constructed test set. For open-ended or semi-open-ended questions and model safety issues, a combination of objective and subjective evaluation methods is used.
### Objective Evaluation
For objective questions with standard answers, we can compare the discrepancy between the model's output and the standard answer using quantitative indicators. Given the high freedom in outputs of large language models, during evaluation, it's essential to standardize and design its inputs and outputs to minimize the influence of noisy outputs, ensuring a more comprehensive and objective assessment.
To better elicit the model's abilities in the evaluation domain and guide the model to output answers following specific templates, OpenCompass employs prompt engineering and in-context learning for objective evaluations.
In practice, we usually adopt the following two methods to evaluate model outputs:
- **Discriminative Evaluation**: This approach combines questions with candidate answers, calculates the model's perplexity on all combinations, and selects the answer with the lowest perplexity as the model's final output.
- **Generative Evaluation**: Used for generative tasks like language translation, code generation, logical analysis, etc. The question is used as the model's original input, leaving the answer area blank for the model to fill in. Post-processing of the output is often required to ensure it meets dataset requirements.
### Subjective Evaluation (Upcoming)
Language expression is lively and varied, and many scenarios and capabilities can't be judged solely by objective indicators. For evaluations like model safety and language capabilities, subjective evaluations based on human feelings better reflect the model's actual capabilities and align more with real-world applications.
OpenCompass's subjective evaluation approach relies on test subject's personal judgments to assess chat-capable large language models. In practice, we pre-construct a set of subjective test questions based on model capabilities and present different replies from various models to the same question to subjects, collecting their subjective scores. Given the high cost of subjective testing, this approach also uses high-performing large language models to simulate human subjective scoring. Actual evaluations will combine real human expert subjective evaluations with model-based subjective scores.
In conducting subjective evaluations, OpenCompass uses both **Single Model Reply Satisfaction Statistics** and **Multiple Model Satisfaction** Comparison methods.
================================================
FILE: docs/en/user_guides/interns1.md
================================================
# Tutorial for Evaluating Intern-S1
OpenCompass now provides the necessary configs for evaluating Intern-S1. Please perform the following steps to initiate the evaluation of Intern-S1.
## Model Download and Deployment
The Intern-S1 now has been open-sourced, which can be downloaded from [Huggingface](https://huggingface.co/internlm/Intern-S1).
After completing the model download, it is recommended to deploy it as an API service for calling.
You can deploy it based on LMdeploy/vlLM/sglang according to [this page](https://github.com/InternLM/Intern-S1/blob/main/README.md#Serving).
## Evaluation Configs
### Model Configs
We provide a config example in `opencompass/configs/models/interns1/intern_s1.py`.
Please make the changes according to your needs.
```python
models = [
dict(
abbr="intern-s1",
key="YOUR_API_KEY", # Fill in your API KEY here
openai_api_base="YOUR_API_BASE", # Fill in your API BASE here
type=OpenAISDK,
path="internlm/Intern-S1",
temperature=0.7,
meta_template=api_meta_template,
query_per_second=1,
batch_size=8,
max_out_len=64000,
max_seq_len=65536,
openai_extra_kwargs={
'top_p': 0.95,
},
retry=10,
extra_body={
"chat_template_kwargs": {"enable_thinking": True} # Control the thinking mode when deploying the model based on vllm or sglang
},
pred_postprocessor=dict(type=extract_non_reasoning_content), # Extract non-reasoning contents when opening the thinking mode
),
]
```
### Dataset Configs
We provide a config for datasets used for evaluating Intern-S1 in `examples/eval_bench_intern_s1.py`.
You can also add other datasets as needed.
In addition, you need to add the configuration of the LLM Judger in this config file, as shown in the following example:
```python
judge_cfg = dict(
abbr='YOUR_JUDGE_MODEL',
type=OpenAISDK,
path='YOUR_JUDGE_MODEL_PATH',
key='YOUR_API_KEY',
openai_api_base='YOUR_API_BASE',
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]),
query_per_second=1,
batch_size=1,
temperature=0.001,
max_out_len=8192,
max_seq_len=32768,
mode='mid',
)
```
## Start Evaluation
After completing the above configuration,
enter the following command to start the evaluation:
```bash
opencompass examples/eval_bench_intern_s1.py
```
================================================
FILE: docs/en/user_guides/metrics.md
================================================
# Metric Calculation
In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the **type of standard answer**, generally including the following types:
- **Choice**: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard-- `ACCEvaluator`.
- **Phrase**: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard--`EMEvaluator`.
- **Sentence**: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard--`BleuEvaluator`.
- **Paragraph**: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard--`RougeEvaluator`.
- **Code**: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and `pass@k` are usually used as the evaluation standard. At present, Opencompass supports `MBPPEvaluator` and `HumanEvalEvaluator`.
There is also a type of **scoring-type** evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports `ToxicEvaluator`, and currently, the realtoxicityprompts dataset uses this evaluation method.
## Supported Evaluation Metrics
Currently, in OpenCompass, commonly used Evaluators are mainly located in the [`opencompass/openicl/icl_evaluator`](https://github.com/open-compass/opencompass/tree/main/opencompass/openicl/icl_evaluator) folder. There are also some dataset-specific indicators that are placed in parts of [`opencompass/datasets`](https://github.com/open-compass/opencompass/tree/main/opencompass/datasets). Below is a summary:
| Evaluation Strategy | Evaluation Metrics | Common Postprocessing Method | Datasets |
| --------------------- | -------------------- | ---------------------------- | -------------------------------------------------------------------- |
| `ACCEvaluator` | Accuracy | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag |
| `EMEvaluator` | Match Rate | None, dataset-specific | drop, CLUE_CMRC, CLUE_DRCD |
| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs |
| `RougeEvaluator` | ROUGE | None, dataset-specific | truthfulqa, Xsum, XLSum |
| `JiebaRougeEvaluator` | ROUGE | None, dataset-specific | lcsts |
| `HumanEvalEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess |
| `MBPPEvaluator` | Execution Pass Rate | None | mbpp |
| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts |
| `AGIEvalEvaluator` | Accuracy | None | agieval |
| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments |
| `MATHEvaluator` | Accuracy | `math_postprocess` | math |
| `MccEvaluator` | Matthews Correlation | None | -- |
| `SquadEvaluator` | F1-scores | None | -- |
## How to Configure
The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to `dataset.infer_cfg` as an instantiation parameter.
Below is the definition of `govrepcrs_eval_cfg`, and you can refer to [configs/datasets/govrepcrs](https://github.com/open-compass/opencompass/tree/main/configs/datasets/govrepcrs).
```python
from opencompass.openicl.icl_evaluator import BleuEvaluator
from opencompass.datasets import GovRepcrsDataset
from opencompass.utils.text_postprocessors import general_cn_postprocess
govrepcrs_reader_cfg = dict(.......)
govrepcrs_infer_cfg = dict(.......)
# Configuration of evaluation metrics
govrepcrs_eval_cfg = dict(
evaluator=dict(type=BleuEvaluator), # Use the common translator evaluator BleuEvaluator
pred_role='BOT', # Accept 'BOT' role output
pred_postprocessor=dict(type=general_cn_postprocess), # Postprocessing of prediction results
dataset_postprocessor=dict(type=general_cn_postprocess)) # Postprocessing of dataset standard answers
govrepcrs_datasets = [
dict(
type=GovRepcrsDataset, # Dataset class name
path='./data/govrep/', # Dataset path
abbr='GovRepcrs', # Dataset alias
reader_cfg=govrepcrs_reader_cfg, # Dataset reading configuration file, configure its reading split, column, etc.
infer_cfg=govrepcrs_infer_cfg, # Dataset inference configuration file, mainly related to prompt
eval_cfg=govrepcrs_eval_cfg) # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing.
]
```
================================================
FILE: docs/en/user_guides/models.md
================================================
# Prepare Models
To support the evaluation of new models in OpenCompass, there are several ways:
1. HuggingFace-based models
2. API-based models
3. Custom models
## HuggingFace-based Models
In OpenCompass, we support constructing evaluation models directly from HuggingFace's
`AutoModel.from_pretrained` and `AutoModelForCausalLM.from_pretrained` interfaces. If the model to be
evaluated follows the typical generation interface of HuggingFace models, there is no need to write code. You
can simply specify the relevant configurations in the configuration file.
Here is an example configuration file for a HuggingFace-based model:
```python
# Use `HuggingFace` to evaluate models supported by AutoModel.
# Use `HuggingFaceCausalLM` to evaluate models supported by AutoModelForCausalLM.
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
# Parameters for `HuggingFaceCausalLM` initialization.
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
batch_padding=False,
# Common parameters shared by various models, not specific to `HuggingFaceCausalLM` initialization.
abbr='llama-7b', # Model abbreviation used for result display.
max_out_len=100, # Maximum number of generated tokens.
batch_size=16, # The size of a batch during inference.
run_cfg=dict(num_gpus=1), # Run configuration to specify resource requirements.
)
]
```
Explanation of some of the parameters:
- `batch_padding=False`: If set to False, each sample in a batch will be inferred individually. If set to True,
a batch of samples will be padded and inferred together. For some models, such padding may lead to
unexpected results. If the model being evaluated supports sample padding, you can set this parameter to True
to speed up inference.
- `padding_side='left'`: Perform padding on the left side. Not all models support padding, and padding on the
right side may interfere with the model's output.
- `truncation_side='left'`: Perform truncation on the left side. The input prompt for evaluation usually
consists of both the in-context examples prompt and the input prompt. If the right side of the input prompt
is truncated, it may cause the input of the generation model to be inconsistent with the expected format.
Therefore, if necessary, truncation should be performed on the left side.
During evaluation, OpenCompass will instantiate the evaluation model based on the `type` and the
initialization parameters specified in the configuration file. Other parameters are used for inference,
summarization, and other processes related to the model. For example, in the above configuration file, we will
instantiate the model as follows during evaluation:
```python
model = HuggingFaceCausalLM(
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
)
```
## API-based Models
Currently, OpenCompass supports API-based model inference for the following:
- OpenAI (`opencompass.models.OpenAI`)
- ChatGLM (`opencompass.models.ZhiPuAI`)
- ABAB-Chat from MiniMax (`opencompass.models.MiniMax`)
- XunFei from XunFei (`opencompass.models.XunFei`)
Let's take the OpenAI configuration file as an example to see how API-based models are used in the
configuration file.
```python
from opencompass.models import OpenAI
models = [
dict(
type=OpenAI, # Using the OpenAI model
# Parameters for `OpenAI` initialization
path='gpt-4', # Specify the model type
key='YOUR_OPENAI_KEY', # OpenAI API Key
max_seq_len=2048, # The max input number of tokens
# Common parameters shared by various models, not specific to `OpenAI` initialization.
abbr='GPT-4', # Model abbreviation used for result display.
max_out_len=512, # Maximum number of generated tokens.
batch_size=1, # The size of a batch during inference.
run_cfg=dict(num_gpus=0), # Resource requirements (no GPU needed)
),
]
```
We have provided several examples for API-based models. Please refer to
```bash
configs
├── eval_zhipu.py
├── eval_xunfei.py
└── eval_minimax.py
```
## Custom Models
If the above methods do not support your model evaluation requirements, you can refer to
[Supporting New Models](../advanced_guides/new_model.md) to add support for new models in OpenCompass.
================================================
FILE: docs/en/user_guides/summarizer.md
================================================
# Results Summary
After the evaluation is complete, the results need to be printed on the screen or saved. This process is controlled by the summarizer.
```{note}
If the summarizer appears in the overall config, all the evaluation results will be output according to the following logic.
If the summarizer does not appear in the overall config, the evaluation results will be output in the order they appear in the `dataset` config.
```
## Example
A typical summarizer configuration file is as follows:
```python
summarizer = dict(
dataset_abbrs = [
'race',
'race-high',
'race-middle',
],
summary_groups=[
{'name': 'race', 'subsets': ['race-high', 'race-middle']},
]
)
```
The output is:
```text
dataset version metric mode internlm-7b-hf
----------- --------- ------------- ------ ----------------
race - naive_average ppl 76.23
race-high 0c332f accuracy ppl 74.53
race-middle 0c332f accuracy ppl 77.92
```
The summarizer tries to read the evaluation scores from the `{work_dir}/results/` directory using the `models` and `datasets` in the config as the full set. It then displays them in the order of the `summarizer.dataset_abbrs` list. Moreover, the summarizer tries to compute some aggregated metrics using `summarizer.summary_groups`. The `name` metric is only generated if and only if all values in `subsets` exist. This means if some scores are missing, the aggregated metric will also be missing. If scores can't be fetched by the above methods, the summarizer will use `-` in the respective cell of the table.
In addition, the output consists of multiple columns:
- The `dataset` column corresponds to the `summarizer.dataset_abbrs` configuration.
- The `version` column is the hash value of the dataset, which considers the dataset's evaluation method, prompt words, output length limit, etc. Users can verify whether two evaluation results are comparable using this column.
- The `metric` column indicates the evaluation method of this metric. For specific details, [metrics](./metrics.md).
- The `mode` column indicates how the inference result is obtained. Possible values are `ppl` / `gen`. For items in `summarizer.summary_groups`, if the methods of obtaining `subsets` are consistent, its value will be the same as subsets, otherwise it will be `mixed`.
- The subsequent columns represent different models.
## Field Description
The fields of summarizer are explained as follows:
- `dataset_abbrs`: (list, optional) Display list items. If omitted, all evaluation results will be output.
- `summary_groups`: (list, optional) Configuration for aggregated metrics.
The fields in `summary_groups` are:
- `name`: (str) Name of the aggregated metric.
- `subsets`: (list) Names of the metrics that are aggregated. Note that it can not only be the original `dataset_abbr` but also the name of another aggregated metric.
- `weights`: (list, optional) Weights of the metrics being aggregated. If omitted, the default is to use unweighted averaging.
Please note that we have stored the summary groups of datasets like MMLU, C-Eval, etc., under the `configs/summarizers/groups` path. It's recommended to consider using them first.
================================================
FILE: docs/zh_cn/.readthedocs.yaml
================================================
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.8"
formats:
- epub
sphinx:
configuration: docs/zh_cn/conf.py
python:
install:
- requirements: requirements/docs.txt
================================================
FILE: docs/zh_cn/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/zh_cn/_static/css/readthedocs.css
================================================
.header-logo {
background-image: url("../image/logo.svg");
background-size: 275px 80px;
height: 80px;
width: 275px;
}
@media screen and (min-width: 1100px) {
.header-logo {
top: -25px;
}
}
pre {
white-space: pre;
}
@media screen and (min-width: 2000px) {
.pytorch-content-left {
width: 1200px;
margin-left: 30px;
}
article.pytorch-article {
max-width: 1200px;
}
.pytorch-breadcrumbs-wrapper {
width: 1200px;
}
.pytorch-right-menu.scrolling-fixed {
position: fixed;
top: 45px;
left: 1580px;
}
}
article.pytorch-article section code {
padding: .2em .4em;
background-color: #f3f4f7;
border-radius: 5px;
}
/* Disable the change in tables */
article.pytorch-article section table code {
padding: unset;
background-color: unset;
border-radius: unset;
}
table.autosummary td {
width: 50%
}
img.align-center {
display: block;
margin-left: auto;
margin-right: auto;
}
article.pytorch-article p.rubric {
font-weight: bold;
}
================================================
FILE: docs/zh_cn/_static/js/custom.js
================================================
var collapsedSections = ['数据集统计'];
$(document).ready(function () {
$('.dataset').DataTable({
"stateSave": false,
"lengthChange": false,
"pageLength": 20,
"order": [],
"language": {
"info": "显示 _START_ 至 _END_ 条目(总计 _TOTAL_ )",
"infoFiltered": "(筛选自 _MAX_ 条目)",
"search": "搜索:",
"zeroRecords": "没有找到任何条目",
"paginate": {
"next": "下一页",
"previous": "上一页"
},
}
});
});
================================================
FILE: docs/zh_cn/_templates/404.html
================================================
{% extends "layout.html" %}
{% block body %}
Page Not Found
The page you are looking for cannot be found.
If you just switched documentation versions, it is likely that the page you were on is moved. You can look for it in
the content table left, or go to the homepage.
{% endblock %}
================================================
FILE: docs/zh_cn/_templates/autosummary/class.rst
================================================
.. role:: hidden
:class: hidden-section
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
..
autogenerated from _templates/autosummary/class.rst
note it does not have :inherited-members:
================================================
FILE: docs/zh_cn/_templates/callable.rst
================================================
.. role:: hidden
:class: hidden-section
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
:special-members: __call__
..
autogenerated from _templates/callable.rst
note it does not have :inherited-members:
================================================
FILE: docs/zh_cn/advanced_guides/accelerator_intro.md
================================================
# 使用 vLLM 或 LMDeploy 来一键式加速评测推理
## 背景
在 OpenCompass 评测过程中,默认使用 Huggingface 的 transformers 库进行推理,这是一个非常通用的方案,但在某些情况下,我们可能需要更高效的推理方法来加速这一过程,比如借助 VLLM 或 LMDeploy。
- [LMDeploy](https://github.com/InternLM/lmdeploy) 是一个用于压缩、部署和服务大型语言模型(LLM)的工具包,由 [MMRazor](https://github.com/open-mmlab/mmrazor) 和 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队开发。
- [vLLM](https://github.com/vllm-project/vllm) 是一个快速且易于使用的 LLM 推理和服务库,具有先进的服务吞吐量、高效的 PagedAttention 内存管理、连续批处理请求、CUDA/HIP 图的快速模型执行、量化技术(如 GPTQ、AWQ、SqueezeLLM、FP8 KV Cache)以及优化的 CUDA 内核。
## 加速前准备
首先,请检查您要评测的模型是否支持使用 vLLM 或 LMDeploy 进行推理加速。其次,请确保您已经安装了 vLLM 或 LMDeploy,具体安装方法请参考它们的官方文档,下面是参考的安装方法:
### LMDeploy 安装方法
使用 pip (Python 3.8+) 或从 [源码](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md) 安装 LMDeploy:
```bash
pip install lmdeploy
```
### VLLM 安装方法
使用 pip 或从 [源码](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source) 安装 vLLM:
```bash
pip install vllm
```
## 评测时使用 VLLM 或 LMDeploy
### 方法1:使用命令行参数来变更推理后端
OpenCompass 提供了一键式的评测加速,可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型,以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码:
```python
# eval_gsm8k.py
from mmengine.config import read_base
with read_base():
# 选择一个数据集列表
from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
# 选择一个感兴趣的模型
from ..models.hf_llama.hf_llama3_8b_instruct import models
```
其中 `hf_llama3_8b_instruct` 为原版 Huggingface 模型配置,内容如下:
```python
from opencompass.models import HuggingFacewithChatTemplate
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='llama-3-8b-instruct-hf',
path='meta-llama/Meta-Llama-3-8B-Instruct',
max_out_len=1024,
batch_size=8,
run_cfg=dict(num_gpus=1),
stop_words=['<|end_of_text|>', '<|eot_id|>'],
)
]
```
默认 Huggingface 版本的 Llama3-8b-instruct 模型评测 GSM8k 数据集的方式如下:
```bash
python run.py config/eval_gsm8k.py
```
如果需要使用 vLLM 或 LMDeploy 进行加速评测,可以使用下面的脚本:
```bash
python run.py config/eval_gsm8k.py -a vllm
```
或
```bash
python run.py config/eval_gsm8k.py -a lmdeploy
```
### 方法2:通过部署推理加速服务API来加速评测
OpenCompass 还支持通过部署vLLM或LMDeploy的推理加速服务 API 来加速评测,参考步骤如下:
1. 安装openai包:
```bash
pip install openai
```
2. 部署 vLLM 或 LMDeploy 的推理加速服务 API,具体部署方法请参考它们的官方文档,下面以LMDeploy为例:
```bash
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
```
api_server 启动时的参数可以通过命令行`lmdeploy serve api_server -h`查看。 比如,--tp 设置张量并行,--session-len 设置推理的最大上下文窗口长度,--cache-max-entry-count 调整 k/v cache 的内存使用比例等等。
3. 服务部署成功后,修改评测脚本,将模型配置中的路径改为部署的服务地址,如下:
```python
from opencompass.models import OpenAISDK
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = [
dict(
abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
type=OpenAISDK,
key='EMPTY', # API key
openai_api_base='http://0.0.0.0:23333/v1', # 服务地址
path='Meta-Llama-3-8B-Instruct ', # 请求服务时的 model name
tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', # 请求服务时的 tokenizer name 或 path, 为None时使用默认tokenizer gpt-4
rpm_verbose=True, # 是否打印请求速率
meta_template=api_meta_template, # 服务请求模板
query_per_second=1, # 服务请求速率
max_out_len=1024, # 最大输出长度
max_seq_len=4096, # 最大输入长度
temperature=0.01, # 生成温度
batch_size=8, # 批处理大小
retry=3, # 重试次数
)
]
```
## 加速效果及性能对比
下面是使用 VLLM 或 LMDeploy 在单卡 A800 上 Llama-3-8B-Instruct 模型对 GSM8k 数据集进行加速评测的效果及性能对比表:
| 推理后端 | 精度(Accuracy) | 推理时间(分钟:秒) | 加速比(相对于 Huggingface) |
| ----------- | ---------------- | -------------------- | ---------------------------- |
| Huggingface | 74.22 | 24:26 | 1.0 |
| LMDeploy | 73.69 | 11:15 | 2.2 |
| VLLM | 72.63 | 07:52 | 3.1 |
================================================
FILE: docs/zh_cn/advanced_guides/circular_eval.md
================================================
# 循环评测
## 背景
对于选择题而言,当 LLM 给出正确的选项,并不一定代表着它能真正地理解题意并经过推理得出答案,它也有可能是蒙对的。为了将这两种情形区分开,同时也为了降低 LLM 对选项的偏见,我们可以尝试使用循环评测 (CircularEval)。我们会将一道选择题按照打乱选项的方式进行增广,若 LLM 可以在增广后的每道题上均得到正确的答案,那么我们认为在循环评测的意义下,这道题被做对了。
## 新增自己的循环评测数据集
一般来说,为了将一个数据集使用循环评测的方式进行评测,它的加载方式和评测方式是需要被重写的,OpenCompass 主库和配置文件均需要进行修改。后续我们以 C-Eval 为例进行讲解。
OpenCompass 主库:
```python
from opencompass.datasets.ceval import CEvalDataset
from opencompass.datasets.circular import CircularDatasetMeta
class CircularCEvalDataset(CEvalDataset, metaclass=CircularDatasetMeta):
# 被重载的数据集类
dataset_class = CEvalDataset
# 若原 load 方法得到一 DatasetDict,其哪些 split 需要被循环评测。CEvalDataset load 得到 [dev, val, test],我们只需要对 val 和 test 进行循环评测,dev 不需要
default_circular_splits = ['val', 'test']
# 需要被打乱的 key 列表
default_option_keys = ['A', 'B', 'C', 'D']
# 若 answer_key 的内容属于是 ['A', 'B', 'C', 'D'] 之一,并表示正确答案。该字段表示打乱选项后,需要如何更新正确答案。与 default_answer_key_switch_method 二选一
default_answer_key = 'answer'
# 如果 answer_key 的内容不属于 ['A', 'B', 'C', 'D'] 之一,那么可以使用函数的方式来指定打乱选项后的正确答案。与 default_answer_key 二选一
# def default_answer_key_switch_method(item, circular_pattern):
# # item 是原本的数据项
# # circular_pattern 是一个 tuple,表示打乱选项后的顺序,例如 ('D', 'A', 'B', 'C') 表示原来的 A 选项变成了 D,原来的 B 选项变成了 A,以此类推
# item['answer'] = circular_pattern['ABCD'.index(item['answer'])]
# return item
```
`CircularCEvalDataset` 会接受 `circular_pattern` 参数,它有两个取值:
- `circular`: 表示单项循环。默认为该值。ABCD 会被扩充为 ABCD, BCDA, CDAB, DABC, 共 4 种
- `all_possible`: 表示全排列。ABCD 会被扩充为 ABCD, ABDC, ACBD, ACDB, ADBC, ADCB, BACD, ..., 共 24 种
另外我们提供了一个 `CircularEvaluator` 用于替换 `AccEvaluator`,该 Evaluator 同样接受 `circular_pattern`,该参数应与上述保持一致。它会产出以下指标:
- `acc_{origin|circular|all_possible}`: 将打乱后选项顺序后的题目视作多道单独的题目,计算准确率
- `perf_{origin|circular|all_possible}`: 按照 circular 的逻辑,若选项打乱后的题目都回答正确,才会视为这道题正确,计算准确率
- `more_{num}_{origin|circular|all_possible}`: 按照 circular 的逻辑,若选项打乱后的题目回答正确的数量大于等于 num,就会视为这道题正确,计算准确率
OpenCompass 配置文件:
```python
from mmengine.config import read_base
from opencompass.datasets.circular import CircularCEvalDataset
with read_base():
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
for d in ceval_datasets:
# 重载 load 方法
d['type'] = CircularCEvalDataset
# 为了与非循环评测版本做区分而进行改名
d['abbr'] = d['abbr'] + '-circular-4'
# 重载评测方法
d['eval_cfg']['evaluator'] = {'type': CircularEvaluator}
# 上述操作后的 dataset 形如下:
# dict(
# type=CircularCEvalDataset,
# path='./data/ceval/formal_ceval', # 未改变
# name='computer_network', # 未改变
# abbr='ceval-computer_network-circular-4',
# reader_cfg=dict(...), # 未改变
# infer_cfg=dict(...), # 未改变
# eval_cfg=dict(evaluator=dict(type=CircularEvaluator), ...),
# )
```
另外评测时为了针对循环评测有更良好的结果呈现,建议考虑使用以下 summarizer
```python
from mmengine.config import read_base
from opencompass.summarizers import CircularSummarizer
with read_base():
from ...summarizers.groups.ceval import ceval_summary_groups
new_summary_groups = []
for item in ceval_summary_groups:
new_summary_groups.append(
{
'name': item['name'] + '-circular-4',
'subsets': [i + '-circular-4' for i in item['subsets']],
}
)
summarizer = dict(
type=CircularSummarizer,
# 选择具体看哪些指标
metric_types=['acc_origin', 'perf_circular'],
dataset_abbrs = [
'ceval-circular-4',
'ceval-humanities-circular-4',
'ceval-stem-circular-4',
'ceval-social-science-circular-4',
'ceval-other-circular-4',
],
summary_groups=new_summary_groups,
)
```
更多复杂的评测案例可以参考这个样例代码: https://github.com/open-compass/opencompass/tree/main/examples/eval_circular.py
================================================
FILE: docs/zh_cn/advanced_guides/code_eval.md
================================================
# 代码评测教程
这里以 `humaneval` 和 `mbpp` 为例,主要介绍如何评测模型的代码能力。
## pass@1
如果只需要生成单条回复来评测pass@1的性能,可以直接使用[configs/datasets/humaneval/humaneval_gen_8e312c.py](https://github.com/open-compass/opencompass/blob/main/configs/datasets/humaneval/humaneval_gen_8e312c.py) 和 [configs/datasets/mbpp/deprecated_mbpp_gen_1e1056.py](https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/deprecated_mbpp_gen_1e1056.py) 并参考通用的[快速上手教程](../get_started/quick_start.md)即可。
如果要进行多语言评测,可以参考[多语言代码评测教程](./code_eval_service.md)。
## pass@k
如果对于单个example需要生成多条回复来评测pass@k的性能,需要参考以下两种情况。这里以10回复为例子:
### 通常情况
对于绝大多数模型来说,模型支持HF的generation中带有`num_return_sequences` 参数,我们可以直接使用来获取多回复。可以参考以下配置文件。
```python
from opencompass.datasets import MBPPDatasetV2, MBPPPassKEvaluator
with read_base():
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
mbpp_datasets[0]['type'] = MBPPDatasetV2
mbpp_datasets[0]['eval_cfg']['evaluator']['type'] = MBPPPassKEvaluator
mbpp_datasets[0]['reader_cfg']['output_column'] = 'test_column'
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
models = [
dict(
type=HuggingFaceCausalLM,
...,
generation_kwargs=dict(
num_return_sequences=10,
do_sample=True,
top_p=0.95,
temperature=0.8,
),
...,
)
]
```
对于 `mbpp`,在数据集和评测上需要有新的变更,所以同步修改`type`, `eval_cfg.evaluator.type`, `reader_cfg.output_column` 字段来适应新的需求。
另外我们需要模型的回复有随机性,同步需要设置`generation_kwargs`参数。这里注意要设置`num_return_sequences`得到回复数。
注意:`num_return_sequences` 必须大于等于k,本身pass@k是计算的概率估计。
具体可以参考以下配置文件
[examples/eval_code_passk.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk.py)
### 模型不支持多回复
适用于一些没有设计好的API以及功能缺失的HF模型。这个时候我们需要重复构造数据集来达到多回复的效果。这里可以参考以下配置文件。
```python
from opencompass.datasets import MBPPDatasetV2, MBPPPassKEvaluator
with read_base():
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
humaneval_datasets[0]['abbr'] = 'openai_humaneval_pass10'
humaneval_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['abbr'] = 'mbpp_pass10'
mbpp_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['type'] = MBPPDatasetV2
mbpp_datasets[0]['eval_cfg']['evaluator']['type'] = MBPPPassKEvaluator
mbpp_datasets[0]['reader_cfg']['output_column'] = 'test_column'
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
models = [
dict(
type=HuggingFaceCausalLM,
...,
generation_kwargs=dict(
do_sample=True,
top_p=0.95,
temperature=0.8,
),
...,
)
]
```
由于数据集的prompt并没有修改,我们需要替换对应的字段来达到数据集重复的目的。
需要修改以下字段:
- `num_repeats`: 数据集重复的次数
- `abbr`: 数据集的缩写最好随着重复次数一并修改,因为数据集数量会发生变化,防止与`.cache/dataset_size.json` 中的数值出现差异导致一些潜在的问题。
对于 `mbpp`,同样修改`type`, `eval_cfg.evaluator.type`, `reader_cfg.output_column` 字段。
另外我们需要模型的回复有随机性,同步需要设置`generation_kwargs`参数。
具体可以参考以下配置文件
[examples/eval_code_passk_repeat_dataset.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_code_passk_repeat_dataset.py)
================================================
FILE: docs/zh_cn/advanced_guides/code_eval_service.md
================================================
# 代码评测Docker教程
为了完成LLM代码能力评测,我们需要搭建一套独立的评测环境,避免在开发环境执行错误代码从而造成不可避免的损失。目前 OpenCompass 使用的代码评测服务可参考[code-evaluator](https://github.com/open-compass/code-evaluator)项目。接下来将围绕代码评测服务介绍不同需要下的评测教程。
1. humaneval-x
多编程语言的数据集 [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x)
数据集[下载地址](https://github.com/THUDM/CodeGeeX2/tree/main/benchmark/humanevalx),请下载需要评测的语言(××.jsonl.gz)文件,并放入`./data/humanevalx`文件夹。
目前支持的语言有`python`, `cpp`, `go`, `java`, `js`。
2. DS1000
Python 多算法库数据集 [ds1000](https://github.com/xlang-ai/DS-1000)
数据集[下载地址](https://github.com/xlang-ai/DS-1000/blob/main/ds1000_data.zip)
目前支持的算法库有`Pandas`, `Numpy`, `Tensorflow`, `Scipy`, `Sklearn`, `Pytorch`, `Matplotlib`。
## 启动代码评测服务
1. 确保您已经安装了 docker,可参考[安装docker文档](https://docs.docker.com/engine/install/)
2. 拉取代码评测服务项目,并构建 docker 镜像
选择你需要的数据集对应的dockerfile,在下面命令中做替换 `humanevalx` 或者 `ds1000`。
```shell
git clone https://github.com/open-compass/code-evaluator.git
docker build -t code-eval-{your-dataset}:latest -f docker/{your-dataset}/Dockerfile .
```
3. 使用以下命令创建容器
```shell
# 输出日志格式
docker run -it -p 5000:5000 code-eval-{your-dataset}:latest python server.py
# 在后台运行程序
# docker run -itd -p 5000:5000 code-eval-{your-dataset}:latest python server.py
# 使用不同的端口
# docker run -itd -p 5001:5001 code-eval-{your-dataset}:latest python server.py --port 5001
```
**注:**
- 如在评测Go的过程中遇到timeout,请在创建容器时候使用以下命令
```shell
docker run -it -p 5000:5000 -e GO111MODULE=on -e GOPROXY=https://goproxy.io code-eval-{your-dataset}:latest python server.py
```
4. 为了确保您能够访问服务,通过以下命令检测推理环境和评测服务访问情况。 (如果推理和代码评测在同一主机中运行服务,就跳过这个操作)
```shell
ping your_service_ip_address
telnet your_service_ip_address your_service_port
```
## 本地代码评测
模型推理和代码评测服务在同一主机,或者同一局域网中,可以直接进行代码推理及评测。**注意:DS1000暂不支持,请走异地评测**
### 配置文件
我们已经提供了 huamaneval-x 在 codegeex2 上评估的\[配置文件\]作为参考(https://github.com/open-compass/opencompass/blob/main/examples/eval_codegeex2.py)。
其中数据集以及相关后处理的配置文件为这个[链接](https://github.com/open-compass/opencompass/tree/main/configs/datasets/humanevalx), 需要注意 humanevalx_eval_cfg_dict 中的evaluator 字段。
```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator
humanevalx_reader_cfg = dict(
input_columns=['prompt'], output_column='task_id', train_split='test')
humanevalx_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024))
humanevalx_eval_cfg_dict = {
lang : dict(
evaluator=dict(
type=HumanevalXEvaluator,
language=lang,
ip_address="localhost", # replace to your code_eval_server ip_address, port
port=5000), # refer to https://github.com/open-compass/code-evaluator to launch a server
pred_role='BOT')
for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now
}
humanevalx_datasets = [
dict(
type=HumanevalXDataset,
abbr=f'humanevalx-{lang}',
language=lang,
path='./data/humanevalx',
reader_cfg=humanevalx_reader_cfg,
infer_cfg=humanevalx_infer_cfg,
eval_cfg=humanevalx_eval_cfg_dict[lang])
for lang in ['python', 'cpp', 'go', 'java', 'js']
]
```
### 任务启动
参考[快速上手教程](../get_started.html)
## 异地代码评测
模型推理和代码评测服务分别在不可访问的不同机器中,需要先进行模型推理,收集代码推理结果。配置文件和推理流程都可以复用上面的教程。
### 收集推理结果(仅针对Humanevalx)
OpenCompass 在 `tools` 中提供了 `collect_code_preds.py` 脚本对推理结果进行后处理并收集,我们只需要提供启动任务时的配置文件,以及指定复用对应任务的工作目录,其配置与 `run.py` 中的 `-r` 一致,细节可参考[文档](https://opencompass.readthedocs.io/zh-cn/latest/get_started/quick_start.html#id4)。
```shell
python tools/collect_code_preds.py [config] [-r latest]
```
收集到的结果将会按照以下的目录结构保存到 `-r` 对应的工作目录中:
```
workdir/humanevalx
├── codegeex2-6b
│ ├── humanevalx_cpp.json
│ ├── humanevalx_go.json
│ ├── humanevalx_java.json
│ ├── humanevalx_js.json
│ └── humanevalx_python.json
├── CodeLlama-13b
│ ├── ...
├── CodeLlama-13b-Instruct
│ ├── ...
├── CodeLlama-13b-Python
│ ├── ...
├── ...
```
对于 DS1000 只需要拿到 `opencompasss` 对应生成的 prediction文件即可。
### 代码评测
#### 以下仅支持Humanevalx
确保代码评测服务启动的情况下,使用 `curl` 提交请求:
```shell
curl -X POST -F 'file=@{result_absolute_path}' -F 'dataset={dataset/language}' {your_service_ip_address}:{your_service_port}/evaluate
```
例如:
```shell
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' localhost:5000/evaluate
```
得到结果:
```
"{\"pass@1\": 37.19512195121951%}"
```
另外我们额外提供了 `with-prompt` 选项(默认为True),由于有些模型生成结果包含完整的代码(如WizardCoder),不需要 prompt + prediciton 的形式进行拼接,可以参考以下命令进行评测。
```shell
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' -H 'with-prompt: False' localhost:5000/evaluate
```
#### 以下仅支持DS1000
确保代码评测服务启动的情况下,使用 `curl` 提交请求:
```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' localhost:5000/evaluate
```
DS1000支持额外 debug 参数,注意开启之后会有大量log
- `full`: 额外打印每个错误样本的原始prediction,后处理后的predcition,运行程序以及最终报错。
- `half`: 额外打印每个错误样本的运行程序以及最终报错。
- `error`: 额外打印每个错误样本的最终报错。
```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' -F 'debug=error' localhost:5000/evaluate
```
另外还可以通过同样的方式修改`num_workers`来控制并行数。
## 进阶教程
除了评测已支持的 `humanevalx` 数据集以外,用户还可能有以下需求:
### 支持新数据集
可以参考[支持新数据集教程](./new_dataset.md)
### 修改后处理
1. 本地评测中,可以按照支持新数据集教程中的后处理部分来修改后处理方法;
2. 异地评测中,可以修改 `tools/collect_code_preds.py` 中的后处理部分;
3. 代码评测服务中,存在部分后处理也可以进行修改,详情参考下一部分教程;
### 代码评测服务 Debug
在支持新数据集或者修改后处理的过程中,可能会遇到需要修改原本的代码评测服务的情况,按照需求修改以下部分
1. 删除 `Dockerfile` 中安装 `code-evaluator` 的部分,在启动容器时将 `code-evaluator` 挂载
```shell
docker run -it -p 5000:5000 -v /local/path/of/code-evaluator:/workspace/code-evaluator code-eval:latest bash
```
2. 安装并启动代码评测服务,此时可以根据需要修改本地 `code-evaluator` 中的代码来进行调试
```shell
cd code-evaluator && pip install -r requirements.txt
python server.py
```
================================================
FILE: docs/zh_cn/advanced_guides/compassbench_intro.md
================================================
# CompassBench 介绍
## CompassBench 2.0 v1.3 版本
CompassBench(官方自建榜单)经历了多次更新迭代,从2024年7月起,OpenCompass将会公布自建榜单的评测规则(评测配置文件)和示例数据集文件,以帮助社区更好的了解自建榜单的评测逻辑和方法。
### 能力维度
2024年8月榜单将会包括以下能力维度:
| 能力 | 任务介绍 | 评测方式 | 示例数据地址 |
| -------- | -------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------------------------------ |
| 语言 | 评测模型在信息抽取、信息抽取、内容总结、对话、创作等多种任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/language |
| 推理 | 评测模型在逻辑推理、常识推理、表格推理等多种日常推理任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/reasoning |
| 知识 | 评测模型在理科、工科、人文社科等多个领域的知识水平 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/knowledge |
| 数学 | 评测模型在数值计算、高中及大学难度的数学问题上的能力 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/math |
| 代码 | 评测模型在代码生成、代码补全、代码注释、代码重构、代码改写、计算机知识综合问答上的能力 | 客观评测 + 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/code |
| 指令跟随 | 评测模型在基于各类语言、推理、知识等任务中,能否准确遵循复杂指令的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/instruct |
| 智能体 | 评估模型在复杂工具调用的能力,以及数据科学、数学等情况下使用代码解释器的能力 | 客观评测 | https://github.com/open-compass/T-Eval https://github.com/open-compass/CIBench |
### 评测方法
- 对于客观评测,将会采用0-shot + CoT的方式评测。
- OpenCompass在客观题评测的后处理上已进行较多优化,并在评测时在Prompt中对回答格式进行约束,对于因指令跟随问题带来的无法完成答案提取的情况,将视为回答错误
- 数学、智能体题目类型与给定的示例数据类似,但真实评测数据与开源数据不同
- 对于主观评测,将会采用基于大模型评价的方式进行评测。
- 我们对每一道问题均提供评测时的打分指引。
- 比较待测模型相对于参考回复的胜率,共设置为五档
- `A++`:回答A远胜于回答B。
- `A+`:回答A略优于回答B。
- `A=B`:回答A和回答B质量相同。
- `B+`:回答B略优于回答A。
- `B++`:回答B远胜于回答A。
- 主观评测配置文件
- [示例评测配置](https://github.com/open-compass/opencompass/blob/main/configs/eval_compassbench_v1_3_subjective.py)
- 主观评价提示词
```
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the \
responses generated by two AI models.
We will provide you with the user query and a pair of AI-generated \
responses (Response A and Response B).
You should first read the user query and the conversation history \
carefully for analyzing the task, and then evaluate the quality of the \
responses based on and rules provided below.
# Conversation between User and AI
## User Query
<|begin_of_query|>
{question}
<|end_of_query|>
## Response A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## Response B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# Evaluation
## Checklist
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
Please use this checklist to guide your evaluation, but do not limit your \
assessment to the checklist.
## Rules
You should compare the above two responses based on your analysis of the \
user queries and the conversation history.
You should first write down your analysis and the checklist that you used \
for the evaluation, and then provide your assessment according to the \
checklist.
There are five choices to give your final assessment: ["A++", "A+", \
"A=B", "B+", "B++"], which correspond to the following meanings:
- `A++`: Response A is much better than Response B.
- `A+`: Response A is only slightly better than Response B.
- `A=B`: Response A and B are of the same quality. Please use this \
choice sparingly.
- `B+`: Response B is only slightly better than Response A.
- `B++`: Response B is much better than Response A.
## Output Format
First, please output your analysis for each model response, and \
then summarize your assessment to three aspects: "reason A=B", \
"reason A>B", and "reason B>A", and finally make your choice for \
the final assessment.
Please provide your evaluation results in the following json \
format by filling in the placeholders in []:
{
"analysis of A": "[analysis of Response A]",
"analysis of B": "[analysis of Response B]",
"reason of A=B": "[where Response A and B perform equally well]",
"reason of A>B": "[where Response A is better than Response B]",
"reason of B>A": "[where Response B is better than Response A]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
# 指令
您是一位专业评估专家。您的任务是评估两个AI模型生成回答的质量。
我们将为您提供用户问题及一对AI生成的回答(回答A和回答B)。
您应当首先仔细阅读用户问题,然后根据以下提供的规则评估回答的质量。
# 用户与AI之间的对话
## 用户问题
<|begin_of_query|>
{question}
<|end_of_query|>
## 回答A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## 回答B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# 评估
## 检查清单
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
请参考此检查清单来评估回答的质量,但不要局限于此检查清单。
## 规则
您应当基于用户查询,分析比较上述两种回答。
您应当基于检查清单写下您的分析,然后提供您的评价。
有五个选项供您做出最终评估:["A++", "A+", "A=B", "B+", "B++"],它们对应如下含义:
- `A++`:回答A远胜于回答B。
- `A+`:回答A略优于回答B。
- `A=B`:回答A和回答B质量相同。请谨慎使用此选项。
- `B+`:回答B略优于回答A。
- `B++`:回答B远胜于回答A。
## 输出格式
首先,请输出您对每个模型回答的分析,
然后总结您的评估到三个方面:"A=B的理由","A优于B的理由",和 "B优于A的理由",
最后做出您对最终评估的选择。
请按照以下json格式提供您的评估结果,通过填充[]中的占位符:
{
"回答A的分析": "[回答A的分析]",
"回答B的分析": "[回答B的分析]",
"A=B的理由": "[A和B回答差不多的理由]",
"A优于B的理由": "[回答A优于B的理由]",
"B优于A的理由": "[回答B优于A的理由]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
```
================================================
FILE: docs/zh_cn/advanced_guides/compassbench_v2_0.md
================================================
# CompassBench 2.0 介绍
## v1.0介绍
为支持OpenCompass的年度榜单,本文将提供CompassBench的整体介绍。
本次评测将在语言、知识、创作、推理、数学、代码、长文本、智能体能力的多项任务上开展评测,现提供任务介绍和题目示例。
- 评测方式采样主观与客观相结合的方式,具体根据各个任务不同进行具体设计。
- 针对推理、数学、代码、智能体等任务,将会采用Few-shot + CoT的评测方式。
- 对于填空题,通过在Prompt中提供Few-shot和输出格式约束来协助抽取答案。
- 对于选择题,针对同一问题,通过变换提问方式,减少随机影响。
- 对于开放式问题的评测,对同一问题进行多次采样,并采用多维度打分的方式进行评价。
> OpenCompass在客观题评测的后处理上已进行较多优化,并在评测时在Prompt中对回答格式进行约束,对于因指令跟随问题带来的无法完成答案提取的情况,将视为回答错误。OpenCompass将会在下一期加入指令跟随能力的评测。
| 能力 | 任务 | 介绍 | 题目示例 |
| ---- | ---- | ---- | ---- |
| 语言 | 信息抽取 | 信息抽取是指从文本中提取出特定类型的信息。这类任务通常用于处理结构化数据、知识图谱、问答系统等场景。 | ```"question": "野马队在分区轮以 23–16 击败了匹兹堡钢人队,在比赛的最后三分钟拿下 11 分。然后他们在美式足球联合会 (AFC) 锦标赛上以 20–18 击败了第 49 届超级碗卫冕冠军新英格兰爱国者队,在比赛还剩 17 秒 时拦截了新英格兰队的两分转换传球。尽管曼宁在本赛季的拦截上有问题,但他在两场季后赛中未投任何球。\n野马队在 AFC 锦标赛中打败了谁?"``` |
| 语言 | 意图识别 | 意图识别是对用户输入的文本或语音进行分析,判断其意图或需求。这类任务应用于智能客服、语音助手、聊天机器人等场景。 | ```"question": "中国文化的天人合一思想\n中西文化的基本差异之一就是,在人与自然的关系问题上,中国文化比较重视人与自然的和谐统一,而西方文化则强调,人要征服自然、改造自然才能求得自己的生存和发展。中国文化的这种特色,有时通过“天人合一”的命题表述出来。中国古代思想家一般都反对把天与人割裂开来、对立起来,而主张天人协调、天人合一。\n天人合一问题,就其理论实质而言,是关于人与自然的统一问题,或者说是自然界和精神的统一问题。应当承认,中国传统文化中的天人合一思想,内容十分复杂,其中既有正确的观点,也有错误的观点,我们必须实事求是地予以分析。但是,从文化的民族性以及对民族文化的推进作用和深远影响看,我们应当大胆肯定。中国古代思想家关于天人合一的思想,其最基本的涵义,就是充分肯定自然界和精神的统一,关注人类行为与自然界的协调问题。从这个意思上说,天人合一思想的,是非常有价值的。\n恩格斯对自然和精神的统一问题,有过一系列精辟的论述。他说:“我们一天天地学会更加正确地理解自然规律,学会认识我们对于自然界的惯常行程的干涉所引起的比较近或比较远的影响。”他还说:“自然界和精神是统一的。自然界不能是无理性的……而理性是不能和自然界矛盾的。”“思维规律和自然规律,只要它们被正确地认识,必然是互相一致的。”恩格斯的这些论述,深刻地揭示了自然和精神统一问题的丰富内涵。根据恩格斯的这些论述,考察中国古代的天人合一思想,不难看出,这种思想有着深刻的合理性。\n中国古代的天人合一思想,强调人与自然的统一,人的行为与自然的协调,道德理性与自然理性的一致,充分显示了中国古代思想家对于主客体之间、主观能动性和客观规律之间关系的辩证思考。根据这种思想,人不能违背自然规律,不能超越自然界的承受力去改造自然、征服自然、破坏自然,而只能在顺从自然规律的条件下去利用自然、调整自然,使之更符合人类的需要,也使自然界的万物都能生长发展。另一方面,自然界也不是主宰人其社会的神秘力量,而是可以认识、可以为我所用的客观对象。这种思想长期实践的结果,是达到自然界与人的统一,人的精神、行为与外在自然的统一,自我身心平衡与自然环境平衡的统一,以及由于这些统一而达到的天道与人道的统一,从而实现完满和谐的精神追求。中国文化的天人合一思想,对于解决当今世界由于工业化和无限制地征服自然而带来的自然环境被污染、生态平衡遭破坏等问题,具有重要的启迪意义;对于我们今天正在进行的社会主义现代化建设,更有着防患于未然的重大现实意义。\n(选自张岱年等主编的《中国文化概论》,有删改)\n根据原文提供的信息,下列推断不正确的一项是","A": "对人与自然关系的认识,中国古代天人合一思想有优于西方文化的地方。","B": "现代人重视和研究天人合一思想,是基于对现实及发展问题的思考。", "C": "肯定天人合一思想的合理性,并不意味着对其思想内容的全盘接受。", "D": "以天人合一思想为指导,可解决当今世界因工业化带来的各种社会问题。",``` |
| 语言 | 情感分析 | 情感分析是对文本中的情感或情绪进行识别和分析的任务。这类任务可用于情感倾向分析场景。例如,分析社交媒体上的用户评论,了解新闻或事件的倾向。| ```"question": "请问以下评价是正面评价还是负面评价?\n大众点评网的霸王餐,200份华辉拉肠双人试吃,员村一店是已经有经营两年以上的,年前装修过,干净齐整,下单的服务员亲切有礼,可能我是第一个用代码验证的,中间拖了点时间去验证,幸好周日10点左右没有平时的多人。拉肠一如既往的滑,皮蛋瘦肉粥很绵,皮蛋瘦肉超多,肉肠是一底带肉一底斋肠,以前没吃过鸡蛋肠觉得6蚊不太划算,现在发现是有三底肠粉的哦,不太喜欢吃肉的可以试下,很饱肚,鼓油是吃过这么多家肠粉店味道调得最好的。","A": "正面评价", "B": "负面评价"```|
| 语言 | 内容总结 | 内容总结是将一篇较长的文本压缩成一篇简短的概括性摘要。这类任务适用于需要快速了解文档核心内容的情境,如新闻标题、电子邮件摘要 | ```联合国减灾办公室负责人格拉瑟。联合国减灾办公室2016年2月11日联合国减灾办公室今天表示,2015年是有记录以来最热的一个年份,在这一年当中,自然灾害影响了近1亿人口。减灾办公室呼吁各国采取行动,应对气候变化,在最大程度上做出努力,防止和减少灾害的发生。联合国减灾办公室所公布的最新数据显示,在过去一年当中,受到灾害影响最重的国家都在亚洲,它们是中国、印度、菲律宾和印度尼西亚。自然灾害共导致2万2000人死亡,带来的经济损失约合660亿美元。然而,尽管这一数字惊人,但却低于1400亿的10年平均数字。其中的部分原因是各国政府采取了更好的防范措施。数据显示,2015年有5000万人深受旱灾之苦,增幅达40%。联合国减灾办公室负责人格拉瑟表示,2015年是记载中最热的一个年份,成因是气候变化和厄尔尼诺天气现象。他指出,最令人感到不安的一个趋势是2015年有记录的主要干旱增加了一倍。他强调,数据表明,减少温室气体排放和适应气候变化对于减少灾害风险至关重要。```|
| 语言 | 内容评价 | 内容评价是对文本的质量、价值或观点进行判断和评价的任务。这类任务可用于评论筛选、观点挖掘等场景。 | ```"question": "以下是一个问题以及针对该问题的两个答案,哪个答案更好?\n问题:创建一篇1000字的非剽窃新闻文章,关于任天堂将于2月8日星期三播出新的任天堂直面会,承诺将公布即将推出的Switch游戏的新细节。2月的任天堂直面会将在东部时间下午5点/太平洋时间下午2点,在公司的YouTube频道上直播。\n\n任天堂表示,星期三的任天堂直面会将持续“大约”40分钟,并将重点放在即将在2023年上半年推出的Nintendo Switch游戏上。\n\n任天堂宣布的Nintendo Switch游戏阵容包括《星之卡比:梦之泉豪华版》,这是2011年Wii游戏的重制版;《魔兵雅各:樱花与失落的恶魔》,这是一个以童话为灵感的冒险故事,主角是年轻的魔兵雅各;《塞尔达传说:王国之泪》,是《荒野之息》的续作;以及《皮克敏4》,这是一款长期运行的策略/动作系列的新作,仍然有些神秘。\n\n此外,任天堂的计划中,但日期尚未确定的还有《超级战争大战1+2:重启营》,这是两款Game Boy Advance时代的《超级战争》游戏的重制版。经过长时间的延迟,Switch版的《超级战争》合集据传很快就要推出。当然,任天堂在2023年还有其他一些大计划,包括2月17日在好莱坞环球影城开设的超级任天堂乐园主题公园,以及4月7日上映的《超级马里奥兄弟》电影。格斗游戏确实很酷,但可能会让人望而生畏。你可能会看到铁拳或街霸这样的竞技世界,然后想,“我可能很菜!而且没有正确的装备!而且我不懂这些术语!”但是停下来。你可以进入格斗游戏。对Pat Gill来说,这段旅程真的很有意义。以下是他如何拥抱格斗游戏,以及你也可以。\n答案 1:任天堂粉丝们正急切地期待着最新的任天堂直面发布会,该发布会定于2月8日星期三播出。该发布会将在公司的YouTube频道上以东部时间下午5点/太平洋时间下午2点的直播形式进行,承诺提供有关任天堂Switch即将推出的游戏的令人兴奋的新细节。\n\n据任天堂称,直面发布会将持续大约40分钟,专门关注2023年上半年将为Switch发布的游戏。已经宣布的一些游戏标题包括《星之卡比:梦之泉DX》、《魔兽传说:樱花与失落的恶魔》、《塞尔达传说:王国的泪滴》和《皮克敏4》。\n\n任天堂日程表上最受期待的游戏之一是《超级战争大战1+2:重启训练营》,这是两款经典Game Boy Advance游戏的重制版。尽管尚未正式宣布发行日期,但有传言称游戏将在经过长时间延迟后很快推出。\n\n除了游戏阵容外,任天堂还计划在2023年推出许多其他令人兴奋的项目,包括2月17日在好莱坞环球影城开放的超级任天堂乐园主题公园,以及4月7日上映的《超级马里奥兄弟》电影。\n\n对于格斗游戏爱好者来说,任天堂直面发布会承诺将是一个令人兴奋的活动,将提供任天堂Switch最新和最伟大游戏的一瞥。无论您是经验丰富的玩家还是刚刚开始,直面发布会中肯定会有适合每个人的内容。因此,请务必在2月8日星期三收看,以获取来自任天堂的所有最新消息和更新!\n答案 2:任天堂宣布将于2月8日星期三下午5点EST/下午2点PST,在公司的YouTube频道上播出新的任天堂Direct演示。演示将持续大约40分钟,并将提供有关2023年上半年即将推出的任天堂Switch游戏的新细节。\n\n公司宣布的任天堂Switch阵容包括《星之卡比:梦之返乡豪华版》,这是2011年Wii游戏的重制版;《魔兽师起源:赛瑞莎与失落的恶魔》,一个以童话为灵感的冒险,主角是年轻的贝约奈塔;《塞尔达传说:王国之泪》,《荒野之息》的续作;以及《皮克敏4》,这是一款长寿的策略/动作系列的新作,仍然有些神秘。\n\n此外,任天堂还有一些其他大计划,包括在2023年2月17日在好莱坞环球影城开设超级任天堂乐园主题公园,以及于4月7日上映《超级马里奥兄弟电影》。\n\n格斗游戏是一种受欢迎的游戏类型,可能是一种令人望而生畏的爱好。然而,人们是可以享受格斗游戏的,Pat Gill就是如何拥抱这种爱好的一个很好的例子。他从一个初学者开始,发现这是一段有意义的旅程。只要有正确的心态和资源,任何人都可以参与格斗游戏,并享受它们所提供的刺激和竞争。" ``` |
| 语言 | 多语言翻译 | 多语言翻译是将一种语言的文本转换为另一种语言的文本。这类任务适用于跨语言沟通、在线翻译等场景。|```"question": "Translate the following sentence from English to French: \"He [Wales] basically lied to us from the start. First, by acting as if this was for legal reasons. Second, by pretending he was listening to us, right up to his art deletion."```|
| 语言 | 中华传统文化理解 | 中华传统文化涉及对中国古代文学、艺术、哲学、历史等领域的研究 | ``` "question": "王实甫在《西厢记》中写道:“淋漓襟袖啼红泪,比司马青衫更湿”,其中“司马青衫”指的是什么"``` |
| 语言 | 中文语意理解 | 中文语意理解涉及理解文本中的词汇、短语和句子之间的语义关系,包括但不限于近义词、反义词、整体-部分关系、修饰关系等。 |``` "question": "“繁荣”与以下哪个词具有近义关系?", "A": "盛世", "B": "荣誉", "C": "繁花", "D": "昌盛"```|
| 语言 | 多轮对话 | 评价模型能否在多轮对话中保持上下文一致性和连贯性的能力,评估模型是否能够理解并记住对话的上下文信息,记住之前的对话内容。 |```[{'role': 'user','content': '我在做一项关于智能手机市场的研究,需要整理一些数据成 Markdown 表格。数据包括品牌名称、市场份额和热销型号。品牌有苹果、三星和华为。苹果的市场份额是30%,热销型号是iPhone 13;三星市场份额是25%,热销型号是Galaxy S21;华为市场份额是20%,热销型号是Mate 40。请帮我做一个表格。'},{'role': 'user','content': '看起来不错,不过我希望表格中的市场份额列展示为百分比和实际销量。苹果的销量是8000万部,三星是6000万部,华为是5000万部。'}, {'role': 'user', 'content': '很好。现在请把表格的标题中文改成英文,并且各列改成对齐方式:品牌列左对齐,市场份额列居中对齐,热销型号列右对齐。'},{'role': 'user', 'content': '可以,我注意到我们可能需要添加一列来表示这些品牌的总收入,苹果为500亿美元,三星为400亿美元,华为为350亿美元。此外,请按市场销量对行进行排序。'}]```|
| 知识 | 生活常识 | 考察普通社会上智力正常的人皆有或普遍拥有的,大众化的知识 | ```"question": "世界四大文明古国有哪些?```|
| 知识 | 自然科学(理科) | 关于自然现象的具体科学,研究自然界的本质和规律(理科):包括不限于数学,物理学,化学,生物学,天文学等 | ```"question": "群的研究对象是什么?"``` |
| 知识 | 自然科学(工科) | 关于自然现象的具体科学,研究自然界的本质和规律(工科):包括不限于计算机科学,医学,建筑学,材料学,机械学,测量学,气象学,环境学等 | ```"question": "下列关于信息安全的说法,正确的是( )。", "options": ["打开朋友转发的网页链接一定是安全的", "安装了杀毒软件后电脑就不会感染病毒", "数据加密是一种提高信息安全性的有效措施", "手机指纹识别技术能确保手机所有信息的安全"]``` |
| 知识 | 社会科学 | 研究社会现象的具体科学,力求揭示社会的本质和规律,例如经济学,政治学,军事学,社会学,管理学,教育学等。社会科学主要以人类社会的组织与结构、体制与关系、功能与效率、秩序与规范为研究认识之对象,并通过这种知识来为人类社会的有序管理、高效运作提供知识、理论和手段 | ```"question": "为了避免资金供应短缺和倒闭,企业经营者需要做什么?"``` |
| 知识 | 人文科学 | 设设计对人的问题的类型思考与情感体验,围绕着关乎人的心灵世界、关乎人的精神生命主题而展开的种种思想、观念、知识和理论的探索。它以人类自身,特别是人的内心情感世界为研究中心,以人自身的发展和完善作为学术探索的出发点和归宿。包括不限于文学,历史学、哲学、艺术、语言等 | ```"question": "光绪二十四年(1898)五月,维新派代表人物康有为从“中体西用”的角度论述了科举制度改革的必要性。这表明他( )", "options": ["在戊戌变法初期思想趋于保守", "认同洋务派的“中体西用”思想", "在教育改革方面与洋务派观点一致", "所说的“体”和“用”与洋务派不同"]``` |
| 创作 | 内容扩写 | 给定标题或者大纲的基础上,通过增加细节、描述和解释,使内容更加丰富、饱满和具有表现力。这种方法主要用于散文、小说等文学创作,以及学术论文、报告等实用文本 | ```请根据我给出的[外星人入侵、核弹、流亡]这些关键词来撰写一篇[科幻]题材的短篇故事。 \n故事需要拥有[引人入胜]的开头以及[反转]的结局,故事线[跌宕起伏]。\n注意请使用[刘慈欣]的写作风格为我撰写这篇故事。减少赘述,内容中不要有重复或意思相近的段落,大约800字``` |
| 创作 | 内容续写 | 现有文本的基础上,继续编写后面的内容。这种方法主要用于小说、故事等叙事性文本。续写部分通常要保持与原有文本的风格、情节和人物设定相一致,同时要求作者具备较强的想象力和创造力。 | ```题目:《新型能源技术在工业生产中的应用与效益》随着能源需求的不断增长和传统能源的有限性,新型能源技术在工业领域的应用备受瞩目。本文将着重探讨新型能源技术对工业生产的潜在影响,以及其在提高生产效益和减少环境影响方面的作用。请按照以上题目和摘要,完成一篇不少于1000字的论文``` |
| 创作 | 内容改写 | 不改变原文主题和基本结构的前提下,对文本进行一定程度的修改、重组和优化。这种方法主要用于修改学术论文、报告、文章等。内容改写的目的是提高文本的表达能力、逻辑性和可读性,同时避免重复。 | ```请帮我总结一封电子邮件的内容,总结需要包含以下四个部分:\n【重要性】根据内容判断事项是否重要,结果包含重要、不重要\n【紧急性】根据内容判断事项是否紧急,结果包含紧急、不紧急\n【核心内容】使用一句简短的话总结邮件最核心的内容。\n【需要回复内容】请判断邮件中哪些内容需要获得我的回复/确认,以列表形式呈现。\n 接下来,请根据下面邮件的内容,进行摘要:\n亲爱的全体员工:\n为了改善大家的身心健康,增强工作效率,公司特别安排了一场瑜伽兴趣培训,现将培训内容通知如下:\n日期及时间:8月15日(周六)上午9:00至11:00\n地点:公司三楼活动室(面积120平米,可容纳30人参加培训)\n培训内容:\n专业瑜伽教练将为大家进行基础的瑜伽技能和健康知识培训。 瑜伽是一种低强度有氧运动,适合各年龄层人群。它能够通过姿势练习、呼吸技巧等,改善身体的柔韧性和平衡感,帮助人体各系统更好地运行,有效减压提神。\n本次培训重点讲解:\n1)基本的瑜伽哲学及其健康效果介绍\n2)冥想和呼吸技巧演练\n3)10多个常见的基础瑜伽姿势示范及练习(包括猿人式、波浪式、斜 Supported Headstand 等)\n4)瑜伽练习时需要注意的安全事项\n5)瑜伽适宜穿戴的服装和个人物品\n6)参与培训后如何延续瑜伽运动\n培训具体流程:\n9:00-9:30 瑜伽基本概念介绍\n9:30-10:10 练习冥想、呼吸及基础姿势\n10:10-10:30 小休10分钟\n10:30-11:00 继续练习高难度姿势并解答问题\n如有意参加本次瑜伽兴趣培训,请于8月10日前用邮件或电话方式告知我们,我方将安排培训。\n若您有任何问题或建议,也欢迎与我联系。感谢您的收听与参与。```|
| 推理 | 逻辑推理 | 综合考察模型的几种常见逻辑推理模式:如演绎、归纳和溯因。 | ```"question": "在接下来的文本中,符号 -> 代表着一个简单的数学运算。\n695 - 472 -> 229\n222 - 62 -> 166\n689 - 439 -> ?",```|
| 推理 | 常识推理 | 常识推理是指基于日常生活中积累的知识和经验,对事物进行合理推断和判断的过程。它涉及到对常见事物、现象和规律的理解,通过综合分析得出合理的结论。 | ```"question": "美即好效应,指对一个外表英俊漂亮的人,人们很容易误认为他或她的其他方面也很不错。根据上述定义,下列哪项属于美即好效应?( )", "A": "外表英俊漂亮的人在应聘中更受招聘者的青睐", "B": "小芳认为自己的女儿是幼儿园中最漂亮的孩子", "C": "人们常说女孩因为可爱而美丽并非因为美丽而可爱", "D": "购物网站上有一个漂亮的模特往往会提高产品的销量"``` |
| 数学 | 初等数学 | 初等教育数学能力(小学数学) | ```"question": "小芳手上有40元。她的爸爸又给了她100元。她花了30元买了一条牛仔裤,又花了20元买了一个包。那么小芳还剩下多少钱呢?"```|
| 数学 | 中等数学 | 中等教育数学能力(初中和高中数学) | ```"question": "某地开展建设绿色家园活动,活动期间,计划每天种植相同数量的树木.该活动开始后,实际每天比原计划每天多植树$50$棵,实际植树$400$棵所需时间与原计划植树$300$棵所需时间相同.设实际每天植树$x$棵,则下列方程正确的是( )", "options": ["$\\frac{{400}}{{x-50}}=\\frac{{300}}{x}$", "$\\frac{{300}}{{x-50}}=\\frac{{400}}{x}$", "$\\frac{{400}}{{x+50}}=\\frac{{300}}{x}$", "$\\frac{{300}}{{x+50}}=\\frac{{400}}{x}$"]```|
| 数学 | 高等 | 高教育数学能力(大学和研究生数学) | ```"question": "已知有向曲线 $L$ 为球面 $x^2+y^2+z^2=2x$ 与平面 $2x-z-1=0$ 的交线,从 $z$ 轴正向往 $z$ 轴负向看去为逆时针方向,计算曲线积分$\\int_L(6xyz-yz^2)dx+2x^2zdy+xyzdz$.", "options": [ "$\\frac{4\\pi}{7\\sqrt5}$", "$\\frac{3\\pi}{7\\sqrt5}$", "$\\frac{3\\pi}{5\\sqrt5}$", "$\\frac{4\\pi}{5\\sqrt5}$"]``` |
| 代码 | 代码理解 | 输入为用户的需求文字或者部分代码,考察模型的逻辑推理能力和代码生成能力,考察模型对各类编程语言的掌握程度。内容包括不限于:算法和数据结构能力考察编程语言语法考察跨编程语言转换 | ```"question": "编写一个 Python 函数,用于检查两个数字是否仅在一个位置上不同。"```|
| 代码 | 代码分析 | 考察模型对代码的理解和分析能力,给定一段代码,进行代码意图分析,代码规范检查,错误检查等 | ```"question":"\n\ndef truncate_number(number: float) -> float:\n \"\"\" 给定一个正的浮点数,可以将其分解为整数部分(小于给定数字的最大整数)和小数部分(余数部分总是小于1)。\n\n 返回该数字的小数部分。\n >>> truncate_number(3.5)\n 0.5\n \"\"\"",``` |
| 长文本 | 长文本理解与推理 | 考察模型在不同的长度上下文(2k, 4k, 8k, 16k, 32k)情况下的理解和推理能力 | 略 |
| 智能体 | 任务规划 | 智能体根据用户的需求目标和具备工具条件,进行合理的任务拆解,科学地安排子任务的执行顺序和策略,对任务执行路径进行设计和规划,选择合适的策略。 | 略|
| 智能体 | 工具调用 | 评估模型能否准确的调用合适的API,在调用API时能否正确的传递参数 | 略 |
| 智能体 | 反思能力 | 评估模型在子任务执行失败时,是否具有反思和重新规划任务路径的能力 | 略 |
| 智能体 | 任务执行总结 | 评估模型能否根据子任务的执行结果进行总结分析,完成原始任务目标,正确地按指令输出回复 | 略|
| 智能体 | 多轮交互 | 评估模型在进行多轮复杂工具调用时的能力,在多轮情况下能否准确理解意图 | 略 |
================================================
FILE: docs/zh_cn/advanced_guides/contamination_eval.md
================================================
# 数据污染评估
**数据污染** 是指本应用在下游测试任务重的数据出现在了大语言模型 (LLM) 的训练数据中,从而导致在下游任务 (例如,摘要、自然语言推理、文本分类) 上指标虚高,无法反映模型真实泛化能力的现象。
由于数据污染的源头是出现在 LLM 所用的训练数据中,因此最直接的检测数据污染的方法就是将测试数据与训练数据进行碰撞,然后汇报两者之间有多少语料是重叠出现的,经典的 GPT-3 [论文](https://arxiv.org/pdf/2005.14165.pdf)中的表 C.1 会报告了相关内容。
但如今开源社区往往只会公开模型参数而非训练数据集,在此种情况下 如何判断是否存在数据污染问题或污染程度如何,这些问题还没有被广泛接受的解决方案。OpenCompass 提供了两种可能的解决方案。
## 基于自建同分布数据的污染数据标注
我们参考了 [Skywork](https://arxiv.org/pdf/2310.19341.pdf) 中 5.2 节提到的方法,直接使用了 Skywork 上传到 HuggingFace 上的数据集 [mock_gsm8k_test](https://huggingface.co/datasets/Skywork/mock_gsm8k_test)。
在该方法中,作者使用 GPT-4 合成了一批与原始 GSM8K 风格类似的数据,然后使用模型分别计算在 GSM8K 训练集 (train),GSM8K 测试集 (test),GSM8K 参考集 (ref) 上的困惑度。由于 GSM8K 参考集是最新生成的,作者认为它必然不属于任何模型的任何训练集中,即它是干净的。作者认为:
- 若 测试集 的困惑度远小于 参考集 的困惑度,那么 测试集 可能出现在了模型的训练阶段;
- 若 训练集 的困惑度远小于 测试集 的困惑度,那么 训练集 可能被模型过拟合了。
我们可以参考使用以下配置文件:
```python
from mmengine.config import read_base
with read_base():
from .datasets.gsm8k_contamination.gsm8k_contamination_ppl_ecdd22 import gsm8k_datasets # 包含训练、测试、参考集
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model # 待审查的模型
from .models.yi.hf_yi_6b import models as hf_yi_6b_model
datasets = [*gsm8k_datasets]
models = [*hf_qwen_7b_model, *hf_yi_6b_model]
```
其样例输出如下:
```text
dataset version metric mode internlm-7b-hf qwen-7b-hf yi-6b-hf chatglm3-6b-base-hf qwen-14b-hf baichuan2-13b-base-hf internlm-20b-hf aquila2-34b-hf ...
--------------- --------- ----------- ------- ---------------- ------------ ---------- --------------------- ------------- ----------------------- ----------------- ---------------- ...
gsm8k-train-ppl 0b8e46 average_ppl unknown 1.5 0.78 1.37 1.16 0.5 0.76 1.41 0.78 ...
gsm8k-test-ppl 0b8e46 average_ppl unknown 1.56 1.33 1.42 1.3 1.15 1.13 1.52 1.16 ...
gsm8k-ref-ppl f729ba average_ppl unknown 1.55 1.2 1.43 1.35 1.27 1.19 1.47 1.35 ...
```
目前该方案仅支持 GSM8K 数据集,我们欢迎社区贡献更多的数据集。
如果使用了该方法,请添加引用:
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## 基于经典预训练集的污染数据标注
感谢 [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) 以及 @liyucheng09 提供了本方法。
在该方法中,作者将测试数据集 (例如 C-Eval, ARC, HellaSwag 等) 使用 Common Crawl 数据库和 Bing 搜索引擎来进行检索,然后依次标记每条测试样本是 干净的 / 题目被污染的 / 题目和答案均被污染的。
测试时,OpenCompass 会分别汇报 ceval 在三种标签所组成的子集上的准确率或困惑度。一般来说,准确率从低到高依次是 干净的,题目被污染的,题目和答案均被污染的 子集。作者认为:
- 若三者性能较为接近,则模型在该测试集上的污染程度较轻;反之则污染程度较重。
我们可以参考使用以下配置文件 [link](https://github.com/open-compass/opencompass/blob/main/examples/eval_contamination.py):
```python
from mmengine.config import read_base
with read_base():
from .datasets.ceval.ceval_clean_ppl import ceval_datasets # 有污染标记的 ceval 数据集
from .models.yi.hf_yi_6b import models as hf_yi_6b_model # 待审查的模型
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model
from .summarizers.contamination import ceval_summarizer as summarizer # 输出格式整理
datasets = [*ceval_datasets]
models = [*hf_yi_6b_model, *hf_qwen_7b_model]
```
其样例输出如下:
```text
dataset version mode yi-6b-hf - - qwen-7b-hf - - ...
---------------------------------------------- --------- ------ ---------------- ----------------------------- --------------------------------------- ---------------- ----------------------------- --------------------------------------- ...
- - - accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated ...
...
ceval-humanities - ppl 74.42 75.00 82.14 67.44 50.00 70.54 ...
ceval-stem - ppl 53.70 57.14 85.61 47.41 52.38 67.63 ...
ceval-social-science - ppl 81.60 84.62 83.09 76.00 61.54 72.79 ...
ceval-other - ppl 72.31 73.91 75.00 58.46 39.13 61.88 ...
ceval-hard - ppl 44.35 37.50 70.00 41.13 25.00 30.00 ...
ceval - ppl 67.32 71.01 81.17 58.97 49.28 67.82 ...
```
目前该方案仅支持 C-Eval, MMLU, HellaSwag 和 ARC 数据集,[Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) 中还包含了 CSQA 和 WinoGrande,但目前还没有在 OpenCompass 中实现。我们欢迎社区贡献更多的数据集。
如果使用了该方法,请添加引用:
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@article{Li2023AnOS,
title={An Open Source Data Contamination Report for Llama Series Models},
author={Yucheng Li},
journal={ArXiv},
year={2023},
volume={abs/2310.17589},
url={https://api.semanticscholar.org/CorpusID:264490711}
}
```
================================================
FILE: docs/zh_cn/advanced_guides/custom_dataset.md
================================================
# 快速评测数据集
OpenCompass提供了两种快速对提供的数据进行评测的路径,即基于ChatMLDataset的数据格式协议和基于CustomDataset的数据格式协议。
相较于 [new_dataset.md](./new_dataset.md) 中的完整数据集集成流程,这两种快速评测路径更加方便快捷,能够在免于增加新配置文件的前提下直接进入评测任务阶段。但如果您存在定制化读取 / 推理 / 评测需求的,建议仍按照完整的集成流程加入新数据集。
## 基于ChatMLDataset的数据格式协议和快速评测
OpenCompass最新推出的基于ChatML对话模板的数据集评测模式,允许用户提供一个符合ChatML对话模板的数据集.jsonl文件,并像配置模型一样对数据集信息进行简单配置后即可直接开始评测任务。
### 数据集文件的格式要求
本评测方式仅支持`.jsonl`格式的数据集文件,且其中的每条数据均需遵守以下格式:
较简易结构的文本数据集:
```jsonl
{
"question":[
{
"role": "system" # 可省略
"content": Str
},
{
"role": "user",
"content": Str
}
],
"answer":[
Str
]
}
{
...
}
...
```
多轮多模等复杂情况的数据集:(由于OpenCompass暂未支持多模评测,因此此处模板仅供参考)
```jsonl
{
"question":[
{
"role": "system",
"content": Str,
},
{
"role": "user",
"content": Str or List
[
{
"type": Str, # "image"
"image_url": Str,
},
...
{
"type": Str, # "text"
"text": Str,
},
]
},
{
"role": "assistant",
"content": Str
},
{
"role": "user",
"content": Str or List
},
...
],
"answer":[
Str,
Str,
...
]
}
{
...
}
...
```
`ChatMLDataset`在读取.jsonl文件时,会使用`pydantic`库对文件进行简易的格式校验。
您可以使用`tools/chatml_format_test.py`对提供的数据文件进行检查。
完成数据检查后,需要在运行配置文件中加入字段名为`chatml_datasets`的配置字典,以在运行时将数据文件转化为OpenCompass的数据集。示例如下:
```python
chatml_datasets = [
dict(
abbr='YOUR_DATASET_NAME',
path='YOUR_DATASET_PATH',
evaluator=dict(
type='cascade_evaluator',
rule_evaluator=dict(
type='math_evaluator',
),
llm_evaluator=dict(
type='llm_evaluator',
prompt="YOUR_JUDGE_PROMPT",
judge_cfg=dict(), # YOUR Judge Model Config
)
),
n=1, # Repeat Number
),
]
```
目前,ChatML模块内提供了四种预设的Evaluator,分别是`mcq_rule_evaluator`(用于选择题评估)、`math_evaluator`(用于latex数学公式评估)、`llm_evaluator`(用于评估难以提取答案的题目或开放式题目)、`cascade_evaluator`(规则式和LLM评估器级联组成的评估模式)。
此外,如果您有基于ChatML模板长期使用数据集的需求,可以将配置添加到`opencompass/configs/chatml_datasets`中。
在`examples/eval_chat_datasets.py`中也给出了调用这类数据集配置的评测示例。
## 基于CustomDataset的数据格式协议和快速评测
(此模块已不再进行更新,但若存在命令行快速运行评测等需求,仍可以使用此模块。)
基于CustomDataset的数据格式协议支持的任务类型包括选择 (`mcq`) 和问答 (`qa`) 两种,其中 `mcq` 支持 `ppl` 推理和 `gen` 推理;`qa` 支持 `gen` 推理。
### 数据集格式
我们支持 `.jsonl` 和 `.csv` 两种格式的数据集。
#### 选择题 (`mcq`)
对于选择 (`mcq`) 类型的数据,默认的字段如下:
- `question`: 表示选择题的题干
- `A`, `B`, `C`, ...: 使用单个大写字母表示选项,个数不限定。默认只会从 `A` 开始,解析连续的字母作为选项。
- `answer`: 表示选择题的正确答案,其值必须是上述所选用的选项之一,如 `A`, `B`, `C` 等。
对于非默认字段,我们都会进行读入,但默认不会使用。如需使用,则需要在 `.meta.json` 文件中进行指定。
`.jsonl` 格式样例如下:
```jsonl
{"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B"}
{"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A"}
{"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B"}
{"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C"}
```
`.csv` 格式样例如下:
```csv
question,A,B,C,answer
127+545+588+620+556+199=,2632,2635,2645,B
735+603+102+335+605=,2376,2380,2410,B
506+346+920+451+910+142+659+850=,4766,4774,4784,C
504+811+870+445=,2615,2630,2750,B
```
#### 问答题 (`qa`)
对于问答 (`qa`) 类型的数据,默认的字段如下:
- `question`: 表示问答题的题干
- `answer`: 表示问答题的正确答案。可缺失,表示该数据集无正确答案。
对于非默认字段,我们都会进行读入,但默认不会使用。如需使用,则需要在 `.meta.json` 文件中进行指定。
`.jsonl` 格式样例如下:
```jsonl
{"question": "752+361+181+933+235+986=", "answer": "3448"}
{"question": "712+165+223+711=", "answer": "1811"}
{"question": "921+975+888+539=", "answer": "3323"}
{"question": "752+321+388+643+568+982+468+397=", "answer": "4519"}
```
`.csv` 格式样例如下:
```csv
question,answer
123+147+874+850+915+163+291+604=,3967
149+646+241+898+822+386=,3142
332+424+582+962+735+798+653+214=,4700
649+215+412+495+220+738+989+452=,4170
```
### 命令行列表
自定义数据集可直接通过命令行来调用开始评测。
```bash
python run.py \
--models hf_llama2_7b \
--custom-dataset-path xxx/test_mcq.csv \
--custom-dataset-data-type mcq \
--custom-dataset-infer-method ppl
```
```bash
python run.py \
--models hf_llama2_7b \
--custom-dataset-path xxx/test_qa.jsonl \
--custom-dataset-data-type qa \
--custom-dataset-infer-method gen
```
在绝大多数情况下,`--custom-dataset-data-type` 和 `--custom-dataset-infer-method` 可以省略,OpenCompass 会根据以下逻辑进行设置:
- 如果从数据集文件中可以解析出选项,如 `A`, `B`, `C` 等,则认定该数据集为 `mcq`,否则认定为 `qa`。
- 默认 `infer_method` 为 `gen`。
### 配置文件
在原配置文件中,直接向 `datasets` 变量中添加新的项即可即可。自定义数据集亦可与普通数据集混用。
```python
datasets = [
{"path": "xxx/test_mcq.csv", "data_type": "mcq", "infer_method": "ppl"},
{"path": "xxx/test_qa.jsonl", "data_type": "qa", "infer_method": "gen"},
]
```
### 数据集补充信息 `.meta.json`
OpenCompass 会默认尝试对输入的数据集文件进行解析,因此在绝大多数情况下,`.meta.json` 文件都是 **不需要** 的。但是,如果数据集的字段名不是默认的字段名,或者需要自定义提示词,则需要在 `.meta.json` 文件中进行指定。
我们会在数据集同级目录下,以文件名+`.meta.json` 的形式放置一个表征数据集使用方法的文件,样例文件结构如下:
```tree
.
├── test_mcq.csv
├── test_mcq.csv.meta.json
├── test_qa.jsonl
└── test_qa.jsonl.meta.json
```
该文件可能字段如下:
- `abbr` (str): 数据集缩写,作为该数据集的 ID。
- `data_type` (str): 数据集类型,可选值为 `mcq` 和 `qa`.
- `infer_method` (str): 推理方法,可选值为 `ppl` 和 `gen`.
- `human_prompt` (str): 用户提示词模板,用于生成提示词。模板中的变量使用 `{}` 包裹,如 `{question}`,`{opt1}` 等。如存在 `template`,则该字段会被忽略。
- `bot_prompt` (str): 机器人提示词模板,用于生成提示词。模板中的变量使用 `{}` 包裹,如 `{answer}` 等。如存在 `template`,则该字段会被忽略。
- `template` (str or dict): 问题模板,用于生成提示词。模板中的变量使用 `{}` 包裹,如 `{question}`,`{opt1}` 等。相关语法见[此处](../prompt/prompt_template.md) 关于 `infer_cfg['prompt_template']['template']` 的内容。
- `input_columns` (list): 输入字段列表,用于读入数据。
- `output_column` (str): 输出字段,用于读入数据。
- `options` (list): 选项列表,用于读入数据,仅在 `data_type` 为 `mcq` 时有效。
样例如下:
```json
{
"human_prompt": "Question: 127 + 545 + 588 + 620 + 556 + 199 =\nA. 2632\nB. 2635\nC. 2645\nAnswer: Let's think step by step, 127 + 545 + 588 + 620 + 556 + 199 = 672 + 588 + 620 + 556 + 199 = 1260 + 620 + 556 + 199 = 1880 + 556 + 199 = 2436 + 199 = 2635. So the answer is B.\nQuestion: {question}\nA. {A}\nB. {B}\nC. {C}\nAnswer: ",
"bot_prompt": "{answer}"
}
```
或者
```json
{
"template": "Question: {my_question}\nX. {X}\nY. {Y}\nZ. {Z}\nW. {W}\nAnswer:",
"input_columns": ["my_question", "X", "Y", "Z", "W"],
"output_column": "my_answer",
}
```
================================================
FILE: docs/zh_cn/advanced_guides/evaluation_lightllm.md
================================================
# 评测 Lightllm 模型
我们支持评测使用 [Lightllm](https://github.com/ModelTC/lightllm) 进行推理的大语言模型。Lightllm 是由商汤科技开发,是一个基于 Python 的 LLM 推理和服务框架,以其轻量级设计、易于扩展和高速性能而著称,Lightllm 对多种大模型都进行了支持。用户可以通过 Lightllm 进行模型推理,并且以服务的形式在本地起起来,在评测过程中,OpenCompass 通过 api 将数据喂给Lightllm,并对返回的结果进行处理。OpenCompass 对 Lightllm 进行了适配,本教程将介绍如何使用 OpenCompass 来对以 Lightllm 作为推理后端的模型进行评测。
## 环境配置
### 安装 OpenCompass
请根据 OpenCompass [安装指南](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) 来安装算法库和准备数据集。
### 安装 Lightllm
请根据 [Lightllm 主页](https://github.com/ModelTC/lightllm) 来安装 Lightllm。注意对齐相关依赖库的版本,尤其是 transformers 的版本。
## 评测
我们以 llama2-7B 评测 humaneval 作为例子来介绍如何评测。
### 第一步: 将模型通过 Lightllm 在本地以服务的形式起起来
```shell
python -m lightllm.server.api_server --model_dir /path/llama2-7B \
--host 0.0.0.0 \
--port 1030 \
--nccl_port 2066 \
--max_req_input_len 4096 \
--max_req_total_len 6144 \
--tp 1 \
--trust_remote_code \
--max_total_token_num 120000
```
**注:** 上述命令可以通过 tp 的数量设置,在 tp 张卡上进行 TensorParallel 推理,适用于较大的模型的推理。
**注:** 上述命令中的 max_total_token_num,会影响测试过程中的吞吐性能,可以根据 [Lightllm 主页](https://github.com/ModelTC/lightllm) 上的文档,进行设置。只要不爆显存,往往设置越大越好。
**注:** 如果要在同一个机器上起多个 Lightllm 服务,需要重新设定上面的 port 和 nccl_port。
可以使用下面的 Python 脚本简单测试一下当前服务是否已经起成功
```python
import time
import requests
import json
url = 'http://localhost:8080/generate'
headers = {'Content-Type': 'application/json'}
data = {
'inputs': 'What is AI?',
"parameters": {
'do_sample': False,
'ignore_eos': False,
'max_new_tokens': 1024,
}
}
response = requests.post(url, headers=headers, data=json.dumps(data))
if response.status_code == 200:
print(response.json())
else:
print('Error:', response.status_code, response.text)
```
### 第二步: 使用 OpenCompass 评测上述模型
```shell
python run.py examples/eval_lightllm.py
```
当模型完成推理和指标计算后,我们便可获得模型的评测结果。
**注:** `eval_lightllm.py` 中,配置的 url 要和上一步服务地址对齐。
================================================
FILE: docs/zh_cn/advanced_guides/evaluation_lmdeploy.md
================================================
# 使用 LMDeploy 加速评测
我们支持在评测大语言模型时,使用 [LMDeploy](https://github.com/InternLM/lmdeploy) 作为推理加速引擎。LMDeploy 是涵盖了 LLM 和 VLM 任务的全套轻量化、部署和服务解决方案,拥有卓越的推理性能。本教程将介绍如何使用 LMDeploy 加速对模型的评测。
## 环境配置
### 安装 OpenCompass
请根据 OpenCompass [安装指南](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) 来安装算法库和准备数据集。
### 安装 LMDeploy
使用 pip 安装 LMDeploy (python 3.8+):
```shell
pip install lmdeploy
```
LMDeploy 预编译包默认基于 CUDA 12 编译。如果需要在 CUDA 11+ 下安装 LMDeploy,请执行以下命令:
```shell
export LMDEPLOY_VERSION=0.6.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
## 评测
在评测一个模型时,需要准备一份评测配置,指明评测集、模型和推理参数等信息。
以 [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) 模型为例,相关的配置信息如下:
```python
# configure the dataset
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_a58960 import \
gsm8k_datasets
# and output the results in a chosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# configure lmdeploy
from opencompass.models import TurboMindModelwithChatTemplate
# configure the model
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr=f'internlm2-chat-7b-lmdeploy',
# model path, which can be the address of a model repository on the Hugging Face Hub or a local path
path='internlm/internlm2-chat-7b',
# inference backend of LMDeploy. It can be either 'turbomind' or 'pytorch'.
# If the model is not supported by 'turbomind', it will fallback to
# 'pytorch'
backend='turbomind',
# For the detailed engine config and generation config, please refer to
# https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py
engine_config=dict(tp=1),
gen_config=dict(do_sample=False),
# the max size of the context window
max_seq_len=7168,
# the max number of new tokens
max_out_len=1024,
# the max number of prompts that LMDeploy receives
# in `generate` function
batch_size=5000,
run_cfg=dict(num_gpus=1),
)
]
```
把上述配置放在文件中,比如 "configs/eval_internlm2_lmdeploy.py"。然后,在 OpenCompass 的项目目录下,执行如下命令可得到评测结果:
```shell
python run.py configs/eval_internlm2_lmdeploy.py -w outputs
```
================================================
FILE: docs/zh_cn/advanced_guides/llm_judge.md
================================================
# LLM 作为评判器
## 简介
GenericLLMEvaluator组件特别适用于那些难以通过规则式方法(如正则表达式)进行完美判断的场景,例如:
- 模型不输出选项标识而只输出选项内容的情况
- 需要事实性判断的数据集
- 需要复杂理解和推理的开放式回答
- 需要设计大量规则的判断
OpenCompass提供了GenericLLMEvaluator组件来实现LLM作为评判器的评估。
## 数据集格式
用于LLM评判的数据集应该是JSON Lines (.jsonl)或CSV格式。每个条目至少应包含:
- 问题或任务
- 参考答案或标准答案
- (模型的预测将在评估过程中生成)
JSONL格式示例:
```json
{"problem": "法国的首都是什么?", "answer": "巴黎"}
```
CSV格式示例:
```csv
problem,answer
"法国的首都是什么?","巴黎"
```
## 配置说明
### 基于命令行使用LLM进行评估
OpenCompass中部分数据集已经包含了LLM评判器的配置。
你需要使用一个模型服务(如OpenAI或DeepSeek官方提供的API)或本地使用LMDeploy、vLLM、SGLang等工具启动一个模型服务。
然后,你可以通过以下命令设置相关评估服务的环境变量,并对模型进行评估:
```bash
export OC_JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct
export OC_JUDGE_API_KEY=sk-1234
export OC_JUDGE_API_BASE=http://172.30.56.1:4000/v1
```
注意,默认情况下,OpenCompass会使用这三个环境变量,但如果你使用了基于配置文件的方式配置评估服务,这三个环境变量将不会生效。
### 基于配置文件使用LLM进行评估
对一个数据集设置LLM评判评估,你需要配置三个主要组件:
1. 数据集读取配置
```python
reader_cfg = dict(
input_columns=['problem'], # 问题列的名称
output_column='answer' # 参考答案列的名称
)
```
2. 推理配置
```python
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}', # 提示模型的模板
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
```
3. 使用LLM评判器的评估配置
```python
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator, # 使用LLM作为评估器
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="你是一个负责评估模型输出正确性和质量的助手。",
)
],
round=[
dict(role='HUMAN', prompt=YOUR_JUDGE_TEMPLATE), # 评判器的模板
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='path/to/your/dataset',
file_name='your_dataset.jsonl',
reader_cfg=reader_cfg,
),
judge_cfg=YOUR_JUDGE_MODEL_CONFIG, # 评判模型的配置
dict_postprocessor=dict(type=generic_llmjudge_postprocess), # 处理评判器输出的后处理器
),
)
```
## 使用CustomDataset和GenericLLMEvaluator
以下是如何设置完整的LLM评判评估配置:
```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
# 导入评判模型配置
with read_base():
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
models as judge_model,
)
# 定义评判模板
JUDGE_TEMPLATE = """
请评估以下回答是否正确地回答了问题。
问题:{problem}
参考答案:{answer}
模型回答:{prediction}
模型回答是否正确?如果正确,请回答"A";如果不正确,请回答"B"。
""".strip()
# 数据集读取配置
reader_cfg = dict(input_columns=['problem'], output_column='answer')
# 被评估模型的推理配置
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# 使用LLM评判器的评估配置
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="你是一个负责评估模型输出正确性和质量的助手。",
)
],
round=[
dict(role='HUMAN', prompt=JUDGE_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='path/to/your/dataset',
file_name='your_dataset.jsonl',
reader_cfg=reader_cfg,
),
judge_cfg=judge_model[0],
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
# 数据集配置
datasets = [
dict(
type=CustomDataset,
abbr='my-dataset',
path='path/to/your/dataset',
file_name='your_dataset.jsonl',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
]
# 被评估模型的配置
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='model-to-evaluate',
path='path/to/your/model',
# ... 其他模型配置
)
]
# 输出目录
work_dir = './outputs/llm_judge_eval'
```
## GenericLLMEvaluator
GenericLLMEvaluator专为使用LLM作为评判器评估模型输出而设计。主要特点包括:
1. 灵活的提示模板,用于指导评判器
2. 支持各种评判模型(本地或基于API)
3. 通过提示工程自定义评估标准
4. 对评判器输出进行后处理以提取结构化评估
**重要说明**:目前通用版本的评判模板只支持输出"A"(正确)或"B"(不正确)的格式,不支持其他输出格式(如"正确"或"不正确")。这是因为后处理函数`generic_llmjudge_postprocess`专门设计为解析这种格式。
评估器的工作原理:
1. 获取原始问题、参考答案和模型预测
2. 将它们格式化为评判模型的提示
3. 解析评判器的响应以确定评估结果(寻找"A"或"B")
4. 汇总整个数据集的结果
如果需要查看评估的详细结果,可以在启动任务时添加`--dump-eval-details`到命令行。
评估输出示例:
```python
{
'accuracy': 75.0, # 被判断为正确的回答百分比
'details': [
{
'origin_prompt': """
请评估以下回答是否正确地回答了问题。
问题:法国的首都是什么?
参考答案:巴黎
模型回答:法国的首都是巴黎。
模型回答是否正确?如果正确,请回答"A";如果不正确,请回答"B"。""",
'gold': '巴黎',
'prediction': 'A',
},
# ... 更多结果
]
}
```
## 级联评估器 (CascadeEvaluator)
OpenCompass还提供了级联评估器`CascadeEvaluator`,它结合了规则式评估和LLM评估的优势。级联评估器有两种模式:
1. **级联模式(Cascade Mode, parallel=False)**:首先使用规则式评估器评估所有样本,然后只将规则式评估认为不正确的样本发送给LLM评判器进行重新评估。这种方式可以在保持准确性的同时减少对LLM评判的依赖,从而降低评估成本和时间。
2. **并行模式(Parallel Mode, parallel=True)**:使用规则式评估器和LLM评判器同时评估所有样本,如果任何一个评估器认为样本是正确的,则将该样本视为正确。这种方式可以提高评估的宽容度,但可能会导致更高的成本,因为所有样本都需要LLM评估。
### 配置CascadeEvaluator
以下是配置`CascadeEvaluator`的示例:
```python
# 定义规则式评估器
rule_evaluator = dict(type=MATHVerifyEvaluator)
# 定义LLM评判器
llm_judge_evaluator = dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="你是一个负责评估模型输出正确性和质量的助手。",
)
],
round=[
dict(role='HUMAN', prompt=YOUR_JUDGE_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=YourDataset,
path='path/to/your/dataset',
reader_cfg=reader_cfg,
),
judge_cfg=dict(), # 可以使用环境变量配置评判模型
)
# 配置级联评估器(级联模式)
cascade_evaluator = dict(
type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=False # 级联模式
)
# 如果需要并行模式,可以设置parallel=True
parallel_evaluator = dict(
type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=True # 并行模式
)
# 在数据集评估配置中使用级联评估器
eval_cfg = dict(evaluator=cascade_evaluator)
```
### 评估结果
级联评估器会输出详细的评估统计信息,包括:
- 规则评估的准确率
- LLM评估的准确率(针对规则评估失败的样本)
- 最终的综合准确率
输出示例:
```python
{
'accuracy': 85.0, # 最终准确率
'cascade_stats': {
'total_samples': 100,
'rule_correct': 70, # 规则评估认为正确的样本数
'rule_accuracy': 70.0, # 规则评估的准确率
'llm_evaluated': 30, # LLM评估的样本数(级联模式下为规则评估失败的样本数)
'llm_correct': 15, # LLM评估认为正确的样本数
'llm_accuracy': 50.0, # LLM评估的准确率
'final_correct': 85, # 最终正确的样本数
'final_accuracy': 85.0, # 最终准确率
'parallel_mode': False, # 是否是并行模式
},
'details': [
# 每个样本的详细评估结果
]
}
```
级联评估器特别适用于:
1. 需要平衡评估成本和准确性的场景
2. 有可用的规则式评估器但可能不够完善的情况
3. 需要对边界情况进行更精确判断的评估任务
## 完整示例
如果希望了解通用LLM评判器,请参考examples目录中的`eval_llm_judge.py`文件,该示例展示了如何使用LLM评判器评估数学问题。
如果希望了解级联评估器请参考examples目录中的`eval_cascade_evaluator.py`文件,该示例展示了如何使用级联评估器评估数学问题。
================================================
FILE: docs/zh_cn/advanced_guides/longeval.md
================================================
# 长文本评测指引
## 介绍
虽然大语言模型(LLM)如GPT-4在处理自然语言任务已经展现出明显的优势,但目前的开源模型大多只能处理数千个token长度以内的文本,这限制了模型阅读书籍、撰写文本摘要等需要处理长文本的能力。为了探究模型在应对长文本能力时的表现,我们采用[L-Eval](https://github.com/OpenLMLab/LEval)和[LongBench](https://github.com/THUDM/LongBench)两个长文本数据集来测试模型长文本能力。
## 现有算法及模型
在处理长文本输入时,推理时间开销和灾难性遗忘是大模型面临的两大主要挑战。最近,大量研究致力于扩展模型长度,这些研究集中于以下三个改进方向。
- 注意力机制。这些方法的最终目的多为减少query-key对的计算开销,但可能对下游任务的效果产生影响。
- 输入方法。部分研究将长文本输入分块或将部分已有文本段重复输入模型以增强模型处理长文本能力,但这些方法只对部分任务有效,难以适应多种下游任务。
- 位置编码。这部分研究包括RoPE, ALiBi,位置插值等,在长度外推方面展现出了良好的效果。这些方法已经被用于训练如ChatGLM2-6b-32k和LongChat-32k等长文本模型。
首先,我们介绍一些流行的位置编码算法。
### RoPE
RoPE是一种在Transformer中注入位置信息的位置嵌入方法。它使用旋转矩阵对绝对位置进行编码,并同时在自注意力公式中融入显式的相对位置依赖关系。下图是RoPE机制的一个示例。
更一般地,即便在 0-shot learning 的情况下(即 `retriever` 为 `ZeroRetriver`)时,这一机制依然生效。因此以下配置也是合法的:
```python
datasets = [
dict(
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template="Q: {question}\nA: {answer}",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
),
]
```
## 使用建议
建议使用 [Prompt Viewer](../tools.md) 工具对完成拼装后的 prompt 进行可视化,确认模板是否正确,结果是否符合预期。
================================================
FILE: docs/zh_cn/statis.py
================================================
#! /usr/bin/env python
from pathlib import Path
import yaml
from tabulate import tabulate
OC_ROOT = Path(__file__).absolute().parents[2]
GITHUB_PREFIX = 'https://github.com/open-compass/opencompass/tree/main/'
DATASETZOO_TEMPLATE = """\
# 数据集统计
在本页面中,我们列举了OpenCompass所支持的所有数据集。
你可以使用排序和搜索功能找到需要的数据集。
我们对每一个数据集都给出了推荐的运行配置,部分数据集中还提供了基于LLM Judge的推荐配置。
你可以基于推荐配置快速启动评测。但请注意,推荐配置可能随时间推移被更新。
"""
with open('dataset_statistics.md', 'w') as f:
f.write(DATASETZOO_TEMPLATE)
load_path = str(OC_ROOT / 'dataset-index.yml')
with open(load_path, 'r') as f2:
data_list = yaml.load(f2, Loader=yaml.FullLoader)
HEADER = ['name', 'category', 'paper', 'configpath', 'configpath_llmjudge']
recommanded_dataset_list = [
'ifeval', 'aime2024', 'bbh', 'bigcodebench', 'cmmlu', 'drop', 'gpqa',
'hellaswag', 'humaneval', 'korbench', 'livecodebench', 'math', 'mmlu',
'mmlu_pro', 'musr', 'math500'
]
def table_format(data_list):
table_format_list = []
for i in data_list:
table_format_list_sub = []
for j in i:
if j in recommanded_dataset_list:
link_token = '[链接]('
else:
link_token = '[链接(TBD)]('
for index in HEADER:
if index == 'paper':
table_format_list_sub.append('[链接](' + i[j][index] + ')')
elif index == 'configpath_llmjudge':
if i[j][index] == '':
table_format_list_sub.append(i[j][index])
elif isinstance(i[j][index], list):
sub_list_text = ''
for k in i[j][index]:
sub_list_text += (link_token + GITHUB_PREFIX + k +
') / ')
table_format_list_sub.append(sub_list_text[:-2])
else:
table_format_list_sub.append(link_token +
GITHUB_PREFIX +
i[j][index] + ')')
elif index == 'configpath':
if isinstance(i[j][index], list):
sub_list_text = ''
for k in i[j][index]:
sub_list_text += (link_token + GITHUB_PREFIX + k +
') / ')
table_format_list_sub.append(sub_list_text[:-2])
else:
table_format_list_sub.append(link_token +
GITHUB_PREFIX +
i[j][index] + ')')
else:
table_format_list_sub.append(i[j][index])
table_format_list.append(table_format_list_sub)
return table_format_list
data_format_list = table_format(data_list)
def generate_table(data_list, title=None):
with open('dataset_statistics.md', 'a') as f:
if title is not None:
f.write(f'\n{title}')
f.write("""\n```{table}\n:class: dataset\n""")
header = ['数据集名称', '数据集类型', '原文或资源地址', '推荐配置', '推荐配置(基于LLM评估)']
table_cfg = dict(tablefmt='pipe',
floatfmt='.2f',
numalign='right',
stralign='center')
f.write(tabulate(data_list, header, **table_cfg))
f.write('\n```\n')
generate_table(
data_list=data_format_list,
title='## 支持数据集列表',
)
================================================
FILE: docs/zh_cn/tools.md
================================================
# 实用工具
## Prompt Viewer
本工具允许你在不启动完整训练流程的情况下,直接查看生成的 prompt。如果传入的配置仅为数据集配置(如 `configs/datasets/nq/nq_gen_3dcea1.py`),则展示数据集配置中定义的原始 prompt。若为完整的评测配置(包含模型和数据集),则会展示所选模型运行时实际接收到的 prompt。
运行方式:
```bash
python tools/prompt_viewer.py CONFIG_PATH [-n] [-a] [-p PATTERN]
```
- `-n`: 不进入交互模式,默认选择第一个 model (如有)和 dataset。
- `-a`: 查看配置中所有模型和所有数据集组合接收到的 prompt。
- `-p PATTERN`: 不进入交互模式,选择所有与传入正则表达式匹配的数据集。
## Case Analyzer
本工具在已有评测结果的基础上,产出推理错误样本以及带有标注信息的全量样本。
运行方式:
```bash
python tools/case_analyzer.py CONFIG_PATH [-w WORK_DIR]
```
- `-w`:工作路径,默认为 `'./outputs/default'`。
## Lark Bot
用户可以通过配置飞书机器人,实现任务状态的实时监控。飞书机器人的设置文档请[参考这里](https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN?lang=zh-CN#7a28964d)。
配置方式:
- 打开 `configs/secrets.py` 文件,并在文件中加入以下行:
```python
lark_bot_url = 'YOUR_WEBHOOK_URL'
```
通常, Webhook URL 格式如 https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx 。
- 在完整的评测配置中继承该文件:
```python
_base_ = [
'secrets.py',
...
]
```
实例可见 `configs/eval.py`。
- 为了避免机器人频繁发消息形成骚扰,默认运行时状态不会自动上报。有需要时,可以通过 `-l` 或 `--lark` 启动状态上报:
```bash
python run.py configs/eval_demo.py -l
```
## API Model Tester
本工具可以快速测试 API 模型的功能是否正常。
运行方式:
```bash
python tools/test_api_model.py [CONFIG_PATH] -n
```
## Prediction Merger
本工具可以合并由于 `partitioner` 而产生的分片推理结果。
运行方式:
```bash
python tools/prediction_merger.py CONFIG_PATH [-w WORK_DIR]
```
- `-w`:工作路径,默认为 `'./outputs/default'`。
## List Configs
本工具可以列出或搜索所有可用的模型和数据集配置,且支持模糊搜索,便于结合 `run.py` 使用。
运行方式:
```bash
python tools/list_configs.py [PATTERN1] [PATTERN2] [...]
```
若运行时不加任何参数,则默认列出所有在 `configs/models` 和 `configs/dataset` 下的模型配置。
用户同样可以传入任意数量的参数,脚本会列出所有跟传入字符串相关的配置,支持模糊搜索及 * 号匹配。如下面的命令会列出所有跟 `mmlu` 和 `llama` 相关的配置:
```bash
python tools/list_configs.py mmlu llama
```
它的输出可以是:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| hf_llama2_7b | configs/models/hf_llama2_7b.py |
| hf_llama_13b | configs/models/hf_llama_13b.py |
| hf_llama_30b | configs/models/hf_llama_30b.py |
| hf_llama_65b | configs/models/hf_llama_65b.py |
| hf_llama_7b | configs/models/hf_llama_7b.py |
| llama2_13b_chat | configs/models/llama2_13b_chat.py |
| llama2_70b_chat | configs/models/llama2_70b_chat.py |
| llama2_7b_chat | configs/models/llama2_7b_chat.py |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| cmmlu_ppl | configs/datasets/cmmlu/cmmlu_ppl.py |
| cmmlu_ppl_fd1f2f | configs/datasets/cmmlu/cmmlu_ppl_fd1f2f.py |
| mmlu_gen | configs/datasets/mmlu/mmlu_gen.py |
| mmlu_gen_23a9a9 | configs/datasets/mmlu/mmlu_gen_23a9a9.py |
| mmlu_gen_5d1409 | configs/datasets/mmlu/mmlu_gen_5d1409.py |
| mmlu_gen_79e572 | configs/datasets/mmlu/mmlu_gen_79e572.py |
| mmlu_gen_a484b3 | configs/datasets/mmlu/mmlu_gen_a484b3.py |
| mmlu_ppl | configs/datasets/mmlu/mmlu_ppl.py |
| mmlu_ppl_ac766d | configs/datasets/mmlu/mmlu_ppl_ac766d.py |
+-------------------+---------------------------------------------------+
```
## Dataset Suffix Updater
本工具可以快速修改 `configs/dataset` 目录下的配置文件后缀,使其符合提示词哈希命名规范。
运行方式:
```bash
python tools/update_dataset_suffix.py
```
================================================
FILE: docs/zh_cn/user_guides/config.md
================================================
# 学习配置文件
OpenCompass 使用 OpenMMLab 新式风格的配置文件。如果你之前熟悉 OpenMMLab 风格的配置文件,可以直接阅读
[纯 Python 风格的配置文件(Beta)](https://mmengine.readthedocs.io/zh_CN/latest/advanced_tutorials/config.html#python-beta)
了解新式配置文件与原配置文件的区别。如果你之前没有接触过 OpenMMLab 风格的配置文件,
下面我将会用一个简单的例子来介绍配置文件的使用。请确保你安装了最新版本的 MMEngine,以支持新式风格的配置文件。
## 基本格式
OpenCompass 的配置文件都是 Python 格式的,遵从基本的 Python 语法,通过定义变量的形式指定每个配置项。
比如在定义模型时,我们使用如下配置:
```python
# model_cfg.py
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
path='huggyllama/llama-7b',
model_kwargs=dict(device_map='auto'),
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
max_out_len=50,
run_cfg=dict(num_gpus=8, num_procs=1),
)
]
```
当读取配置文件时,使用 MMEngine 中的 `Config.fromfile` 进行解析。
```python
>>> from mmengine.config import Config
>>> cfg = Config.fromfile('./model_cfg.py')
>>> print(cfg.models[0])
{'type': HuggingFaceCausalLM, 'path': 'huggyllama/llama-7b', 'model_kwargs': {'device_map': 'auto'}, ...}
```
## 继承机制
OpenCompass 的配置文件使用了 Python 的 import 机制进行配置文件的继承。需要注意的是,
我们需要在继承配置文件时使用 `read_base` 上下文管理器。
```python
# inherit.py
from mmengine.config import read_base
with read_base():
from .model_cfg import models # model_cfg.py 中的 models 被继承到本配置文件
```
使用 `Config.fromfile` 解析配置文件:
```python
>>> from mmengine.config import Config
>>> cfg = Config.fromfile('./inherit.py')
>>> print(cfg.models[0])
{'type': HuggingFaceCausalLM, 'path': 'huggyllama/llama-7b', 'model_kwargs': {'device_map': 'auto'}, ...}
```
## 评测配置文件示例
```python
# configs/llama7b.py
from mmengine.config import read_base
with read_base():
# 直接从预设数据集配置中读取需要的数据集配置
from .datasets.piqa.piqa_ppl import piqa_datasets
from .datasets.siqa.siqa_gen import siqa_datasets
# 将需要评测的数据集拼接成 datasets 字段
datasets = [*piqa_datasets, *siqa_datasets]
# 使用 HuggingFaceCausalLM 评测 HuggingFace 中 AutoModelForCausalLM 支持的模型
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
# 以下参数为 HuggingFaceCausalLM 的初始化参数
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
# 以下参数为各类模型都必须设定的参数,非 HuggingFaceCausalLM 的初始化参数
abbr='llama-7b', # 模型简称,用于结果展示
max_out_len=100, # 最长生成 token 数
batch_size=16, # 批次大小
run_cfg=dict(num_gpus=1), # 运行配置,用于指定资源需求
)
]
```
## 数据集配置文件示例
以上示例配置文件中,我们直接以继承的方式获取了数据集相关的配置。接下来,
我们会以 PIQA 数据集配置文件为示例,展示数据集配置文件中各个字段的含义。
如果你不打算修改模型测试的 prompt,或者添加新的数据集,则可以跳过这一节的介绍。
PIQA 数据集 [配置文件](https://github.com/open-compass/opencompass/blob/main/configs/datasets/piqa/piqa_ppl_1cf9f0.py)
如下,这是一个基于 PPL(困惑度)进行评测的配置,并且不使用上下文学习方法(In-Context Learning)。
```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
# 读取配置
# 加载后的数据集通常以字典形式组织样本,分别指定样本中用于组成 prompt 的输入字段,和作为答案的输出字段
piqa_reader_cfg = dict(
input_columns=['goal', 'sol1', 'sol2'],
output_column='label',
test_split='validation',
)
# 推理配置
piqa_infer_cfg = dict(
# Prompt 生成配置
prompt_template=dict(
type=PromptTemplate,
# Prompt 模板,模板形式与后续指定的 inferencer 类型相匹配
# 这里为了计算 PPL,需要指定每个答案对应的 Prompt 模板
template={
0: 'The following makes sense: \nQ: {goal}\nA: {sol1}\n',
1: 'The following makes sense: \nQ: {goal}\nA: {sol2}\n'
}),
# 上下文样本配置,此处指定 `ZeroRetriever`,即不使用上下文样本
retriever=dict(type=ZeroRetriever),
# 推理方式配置
# - PPLInferencer 使用 PPL(困惑度)获取答案
# - GenInferencer 使用模型的生成结果获取答案
inferencer=dict(type=PPLInferencer))
# 评估配置,使用 Accuracy 作为评估指标
piqa_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
# 数据集配置,以上各个变量均为此配置的参数
# 为一个列表,用于指定一个数据集各个评测子集的配置。
piqa_datasets = [
dict(
type=HFDataset,
path='piqa',
reader_cfg=piqa_reader_cfg,
infer_cfg=piqa_infer_cfg,
eval_cfg=piqa_eval_cfg)
]
```
其中 **Prompt 生成配置** 的详细配置方式,可以参见 [Prompt 模板](../prompt/prompt_template.md)。
## 进阶评测配置
在 OpenCompass 中,我们支持了任务划分器(Partitioner)、运行后端(Runner)等配置项,
用于更加灵活、高效的利用计算资源。
默认情况下,我们会使用基于样本数的方式对推理任务进行划分,你可以在启动任务时使用
`--max-partition-size` 指定进行任务划分的样本数阈值。同时,我们默认使用本地资源进行推理和评估任务,
如果你希望使用 Slurm 集群资源,可以在启动任务时使用 `--slurm` 参数和 `--partition` 参数指定 slurm 运行后端。
进一步地,如果以上功能无法满足你的任务划分和运行后端配置需求,你可以在配置文件中进行更详细的配置。
参见[数据分片](./evaluation.md)。
================================================
FILE: docs/zh_cn/user_guides/corebench.md
================================================
# 主要数据集性能
我们选择部分用于评估大型语言模型(LLMs)的知名基准,并提供了主要的LLMs在这些数据集上的详细性能结果。
| Model | Version | Metric | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 |
| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- |
| MMLU | - | naive_average | gen | 83.6 | 84.2 | 84.6 | 80.5 | 77.2 |
| CMMLU | - | naive_average | gen | 71.9 | 72.4 | 74.2 | 70.1 | 59.7 |
| CEval-Test | - | naive_average | gen | 69.7 | 70.5 | 71.7 | 66.9 | 58.7 |
| GaokaoBench | - | weighted_average | gen | 74.8 | 76.0 | 74.2 | 67.8 | 60.0 |
| Triviaqa_wiki(1shot) | 01cf41 | score | gen | 73.1 | 82.9 | 82.4 | 89.8 | 89.7 |
| NQ_open(1shot) | eaf81e | score | gen | 27.9 | 30.4 | 39.4 | 40.1 | 46.8 |
| Race-High | 9a54b6 | accuracy | gen | 89.3 | 89.6 | 90.8 | 89.4 | 84.8 |
| WinoGrande | 6447e6 | accuracy | gen | 80.7 | 83.3 | 84.1 | 69.7 | 76.6 |
| HellaSwag | e42710 | accuracy | gen | 92.7 | 93.5 | 94.6 | 87.7 | 86.1 |
| BBH | - | naive_average | gen | 82.7 | 78.5 | 78.5 | 80.5 | 79.1 |
| GSM-8K | 1d7fe4 | accuracy | gen | 80.5 | 79.7 | 87.7 | 90.2 | 88.3 |
| Math | 393424 | accuracy | gen | 61.9 | 71.2 | 60.2 | 47.1 | 50 |
| TheoremQA | ef26ca | accuracy | gen | 28.4 | 23.3 | 29.6 | 25.4 | 13 |
| HumanEval | 8e312c | humaneval_pass@1 | gen | 74.4 | 82.3 | 76.2 | 72.6 | 72.0 |
| MBPP(sanitized) | 1e1056 | score | gen | 78.6 | 77.0 | 76.7 | 71.6 | 68.9 |
| GPQA_diamond | 4baadb | accuracy | gen | 40.4 | 48.5 | 46.5 | 38.9 | 36.4 |
| IFEval | 3321a3 | Prompt-level-strict-accuracy | gen | 71.9 | 79.9 | 80.0 | 77.1 | 65.8 |
================================================
FILE: docs/zh_cn/user_guides/datasets.md
================================================
# 配置数据集
本节教程主要关注如何选择和配置所需要的数据集。请确保你已按照[数据集准备](../get_started/installation.md#数据集准备)中的步骤下载好数据集。
## 数据集配置文件目录结构
首先简单介绍一下 OpenCompass `configs/datasets` 目录下的结构,如下所示:
```text
configs/datasets/
├── agieval
├── apps
├── ARC_c
├── ...
├── CLUE_afqmc # 数据集
│ ├── CLUE_afqmc_gen_901306.py # 不同版本数据集配置文件
│ ├── CLUE_afqmc_gen.py
│ ├── CLUE_afqmc_ppl_378c5b.py
│ ├── CLUE_afqmc_ppl_6507d7.py
│ ├── CLUE_afqmc_ppl_7b0c1e.py
│ └── CLUE_afqmc_ppl.py
├── ...
├── XLSum
├── Xsum
└── z_bench
```
在 `configs/datasets` 目录结构下,我们直接展平所有数据集,在各个数据集对应的文件夹下存在多个数据集配置。
数据集配置文件名由以下命名方式构成 `{数据集名称}_{评测方式}_{prompt版本号}.py`,以 `CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 为例,该配置文件则为中文通用能力下的 `CLUE_afqmc` 数据集,对应的评测方式为 `gen`,即生成式评测,对应的prompt版本号为 `db509b`;同样的, `CLUE_afqmc_ppl_00b348.py` 指评测方式为`ppl`即判别式评测,prompt版本号为 `00b348` 。
除此之外,不带版本号的文件,例如: `CLUE_afqmc_gen.py` 则指向该评测方式最新的prompt配置文件,通常来说会是精度最高的prompt。
## 数据集选择
在各个数据集配置文件中,数据集将会被定义在 `{}_datasets` 变量当中,例如下面 `CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 中的 `afqmc_datasets`。
```python
afqmc_datasets = [
dict(
abbr="afqmc-dev",
type=AFQMCDatasetV2,
path="./data/CLUE/AFQMC/dev.json",
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
```
以及 `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py` 中的 `cmnli_datasets`。
```python
cmnli_datasets = [
dict(
type=HFDataset,
abbr='cmnli',
path='json',
split='train',
data_files='./data/CLUE/cmnli/cmnli_public/dev.json',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg)
]
```
以上述两个数据集为例, 如果用户想同时评测这两个数据集,可以在 `configs` 目录下新建一个配置文件,我们使用 `mmengine` 配置中直接import的机制来构建数据集部分的参数,如下所示:
```python
from mmengine.config import read_base
with read_base():
from .datasets.CLUE_afqmc.CLUE_afqmc_gen_db509b import afqmc_datasets
from .datasets.CLUE_cmnli.CLUE_cmnli_ppl_b78ad4 import cmnli_datasets
datasets = []
datasets += afqmc_datasets
datasets += cmnli_datasets
```
用户可以根据需要,选择不同能力不同数据集以及不同评测方式的配置文件来构建评测脚本中数据集的部分。
有关如何启动评测任务,以及如何评测自建数据集可以参考相关文档。
### 数据集多次评测
在数据集配置中可以通过设置参数`n`来对同一数据集进行多次评测,最终返回平均指标,例如:
```python
afqmc_datasets = [
dict(
abbr="afqmc-dev",
type=AFQMCDatasetV2,
path="./data/CLUE/AFQMC/dev.json",
n=10, # 进行10次评测
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
```
另外,对于二值评测指标(例如accuracy,pass-rate等),还可以通过设置参数`k`配合`n`进行[G-Pass@k](http://arxiv.org/abs/2412.13147)评测。G-Pass@k计算公式为:
```{math}
\text{G-Pass@}k_\tau=E_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],
```
其中 $n$ 为评测次数, $c$ 为 $n$ 次运行中通过或正确的次数。配置例子如下:
```python
aime2024_datasets = [
dict(
abbr='aime2024',
type=Aime2024Dataset,
path='opencompass/aime2024',
k=[2, 4], # 返回 G-Pass@2和G-Pass@4的结果
n=12, # 12次评测
...
)
]
```
================================================
FILE: docs/zh_cn/user_guides/deepseek_r1.md
================================================
# 强推理模型评测教程
OpenCompass提供针对DeepSeek R1系列推理模型的评测教程(数学数据集)。
- 在模型层面,我们建议使用Sampling方式,以减少因为Greedy评测带来的大量重复
- 在数据集层面,我们对数据量较小的评测基准,使用多次评测并取平均的方式。
- 在答案验证层面,为了减少基于规则评测带来的误判,我们统一使用基于LLM验证的方式进行评测。
## 安装和准备
请按OpenCompass安装教程进行安装。
## 构建评测配置
我们在 `example/eval_deepseek_r1.py` 中提供了示例配置,以下对评测配置进行解读
### 评测配置解读
#### 1. 数据集与验证器配置
```python
# 支持多运行次数的数据集配置(示例)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# 设置LLM验证器, 用户需事先通过LMDeploy/vLLM/SGLang等工具启动API 评测服务器,或者直接使用兼容OpenAI标准接口的模型服务
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # 需替换实际路径
key='YOUR_API_KEY', # 需替换真实API Key
openai_api_base=['http://your-api-endpoint'], # 需替换API地址
query_per_second=16,
batch_size=1024,
temperature=0.001,
max_out_len=16384
)
# 应用验证器到所有数据集
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
```
#### 2. 模型配置
我们提供了基于LMDeploy作为推理后端的评测示例,用户可以通过修改path(即HF路径)
```python
# LMDeploy模型配置示例
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768
),
max_seq_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# 可扩展14B/32B配置...
]
```
#### 3. 评估流程配置
```python
# 推理配置
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
# 评估配置
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
```
#### 4. 结果汇总配置
```python
# 多运行结果平均配置
summary_groups = [
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
# 其他数据集平均配置...
]
summarizer = dict(
dataset_abbrs=[
['AIME2024-Aveage8', 'naive_average'],
# 其他数据集指标...
],
summary_groups=summary_groups
)
# 工作目录设置
work_dir = "outputs/deepseek_r1_reasoning"
```
## 执行评测
### 场景1:模型1卡加载,数据1个worker评测,共使用1个GPU
```bash
opencompass example/eval_deepseek_r1.py --debug --dump-eval-details
```
评测日志会在命令行输出。
### 场景2:模型1卡加载,数据8个worker评测,共使用8个GPU
需要修改配置文件中的infer配置,将num_worker设置为8
```python
# 推理配置
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
同时评测命令去掉`--debug`参数
```bash
opencompass example/eval_deepseek_r1.py --dump-eval-details
```
此模式下,OpenCompass将使用多线程启动`$num_worker`个任务,命令行不展示具体日志,具体的评测日志将会在`$work_dir`下中展示。
### 场景3:模型2卡加载,数据4个worker评测,共使用8个GPU
需要注意模型配置中,`run_cfg`中的`num_gpus`需要设置为2(如使用推理后端,则推理后端的参数也需要同步修改,比如LMDeploy中的tp需要设置为2),同时修改`infer`配置中的`num_worker`为4
```python
models += [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-14b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=128,
run_cfg=dict(num_gpus=2),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
```
```python
# 推理配置
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
### 评测结果
评测结果展示如下:
```bash
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
```
## 性能基线参考
由于模型使用Sampling进行解码,同时AIME数据量较小,使用8次评测取平均情况下,仍会出现1-3分的性能抖动
| 模型 | 数据集 | 指标 | 数值 |
| ---------------------------- | -------- | -------- | ---- |
| DeepSeek-R1-Distill-Qwen-7B | AIME2024 | Accuracy | 56.3 |
| DeepSeek-R1-Distill-Qwen-14B | AIME2024 | Accuracy | 74.2 |
| DeepSeek-R1-Distill-Qwen-32B | AIME2024 | Accuracy | 74.2 |
================================================
FILE: docs/zh_cn/user_guides/evaluation.md
================================================
# 数据分片
OpenCompass 支持自定义评测任务的任务划分器(`Partitioner`),实现评测任务的灵活切分;同时配合 `Runner` 控制任务执行的平台,如本机及集群。通过二者的组合,OpenCompass 可以将大评测任务分割到大量计算节点上运行,高效利用计算资源,从而大大加速评测流程。
默认情况下,OpenCompass 向用户隐藏了这些细节,并自动选择推荐的执行策略。但是,用户仍然可以根据自己需求定制其策略,只需向配置文件中添加 `infer` 和/或 `eval` 字段即可:
```python
from opencompass.partitioners import SizePartitioner, NaivePartitioner
from opencompass.runners import SlurmRunner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=5000),
runner=dict(
type=SlurmRunner,
max_num_workers=64,
task=dict(type=OpenICLInferTask),
retry=5),
)
eval = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=32,
task=dict(type=OpenICLEvalTask)),
)
```
上面的例子演示了如何为推理和评估阶段配置执行策略。在推理阶段,任务将被划分成若干个子任务,每个子任务包含5000个样本,然后提交到 Slurm 集群进行执行,其中最多有64个任务并行运行。在评估阶段,每个单一的模型-数据集对形成一个任务,并在本地启动32个进程来计算指标。
以下章节将详细介绍里面涉及的模块。
## 任务划分 (Partitioner)
由于大语言模型的推理耗时长,评测的数据集量大,因此串行运行一次评测任务的时间开销往往很大。
OpenCompass 支持通过自定义评测任务的任务划分器(`Partitioner`),将大评测任务按不同策略划分为众多独立的小任务,通过并行运行充分利用计算资源。用户可以通过 `infer.partitioner` 及 `eval.partitioner` 配置推理和评测阶段的任务划分策略。下面,我们将会介绍 OpenCompass 中支持的所有划分策略。
### `NaivePartitioner`
该划分器会将每个模型和数据集的组合作为一个独立任务派发,为最基础的划分策略,并无任何额外参数。

```python
from opencompass.partitioners import NaivePartitioner
infer = dict(
partitioner=dict(type=NaivePartitioner)
# ...
)
```
### `NumWorkerPartitioner`
```{warning}
该划分器目前不适用于评测阶段的任务(`OpenICLEvalTask`)。
```
```{note}
该划分器是目前推理阶段默认使用的划分器。
```
```{warning}
由于实现方式等问题,推理时如果需要断点继续,请不要修改 `num_split` 的值 (若 `num_split` 为 `None`,则不要修改 `num_worker` 的值)。
```
该划分器会将每个数据集划分成 `num_split` 个,然后将这些数据集均匀地分入 `num_worker` 个任务中,其中的任务数预期应该是与实际运行的 worker 数目是相同的。


```python
from opencompass.partitioners import NumWorkerPartitioner
infer = dict(
partitioner=dict(
type=NumWorkerPartitioner,
num_worker=16, # 划分完成后的任务数 / 预期能有的 worker 数
num_split=None, # 每个数据集将被划分成多少份。若为 None,则使用 num_worker。
min_task_size=16, # 每个划分的最小数据条目数
),
# ...
)
```
### `SizePartitioner`
```{warning}
该划分器目前不适用于评测阶段的任务(`OpenICLEvalTask`)。
```
该划分器会根据数据集的大小,乘上一个扩张系数,估算该数据集的推理成本(耗时)。然后会通过切分大数据集、合并小数据集的方式创建任务,尽可能保证各个子任务推理成本均等。

该划分器常用的参数如下:
```python
from opencompass.partitioners import SizePartitioner
infer = dict(
partitioner=dict(
type=SizePartitioner,
max_task_size: int = 2000, # 单个任务的最大长度
gen_task_coef: int = 20, # 生成式任务的扩张系数
),
# ...
)
```
`SizePartitioner` 在估算数据集推理成本时, 会根据推理任务的类型,选择不同的扩张系数。对于生成式任务,如使用 `GenInferencer` 的任务,会设置成比较大的 `gen_task_coef`;对于判别式任务,如使用 `PPLInferencer` 的任务,则会设置成 prompt 中 label 的数量。
```{note}
目前这种分割策略实现仍然比较粗糙,并未能准确反映生成式任务与判别式任务的计算量差距。我们也期待社区能提出更好的划分策略 :)
```
## 运行后端 (Runner)
在多卡多机的集群环境下,我们若想实现多个任务的并行执行,通常需要依赖集群管理系统(如 Slurm)对任务进行分配和调度。OpenCompass 中,任务的分配和运行统一交由 Runner 负责。目前已经支持了 Slurm 和 PAI-DLC 两种调度后端,同时也保留了在本机直接启动任务的 `LocalRunner`。
### `LocalRunner`
`LocalRunner` 为最基本的运行器,可以将任务在本机并行运行。
```python
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=LocalRunner,
max_num_workers=16, # 最大并行运行进程数
task=dict(type=OpenICLEvalTask), # 待运行的任务
)
)
```
```{note}
实际的运行任务数受到可用 GPU 资源和 `max_num_workers` 的限制。
```
### `SlurmRunner`
`SlurmRunner` 会将任务提交到 Slurm 集群上运行。常用的配置字段如下:
```python
from opencompass.runners import SlurmRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=SlurmRunner,
task=dict(type=OpenICLEvalTask), # 待运行任务
max_num_workers=16, # 最大同时评测任务数
retry=2, # 任务失败的重试次数,可以避免意外发生的错误
),
)
```
### `DLCRunner`
`DLCRunner` 则可以将任务提交到 Alibaba Deep Learning Ceneter (DLC) 运行,该 Runner 依赖于 dlc。首先,先在环境内准备好 dlc:
```bash
cd ~
wget https://dlc-cli.oss-cn-zhangjiakou.aliyuncs.com/light/binary/linux/amd64/dlc
chmod +x ./dlc
sudo ln -rs dlc /usr/local/bin
./dlc config
```
根据提示填入相应信息,并得到 dlc 的配置文件(如 /user/.dlc/config),即完成了前期准备。之后,我们在配置文件按照格式指定 `DLCRunner` 的配置:
```python
from opencompass.runners import DLCRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=DLCRunner,
task=dict(type=OpenICLEvalTask), # 待运行任务
max_num_workers=16, # 最大同时评测任务数
aliyun_cfg=dict(
bashrc_path="/user/.bashrc", # 用于初始化运行环境的 bashrc 路径
conda_env_name='opencompass', # OpenCompass 的 conda 环境
dlc_config_path="/user/.dlc/config", # dlc 配置文件
workspace_id='ws-xxx', # DLC 工作空间 ID
worker_image='xxx', # 运行任务的 image url
),
retry=2, # 任务失败的重试次数,可以避免意外发生的错误
),
)
```
## 任务 (Task)
任务(Task)是 OpenCompass 中的一个基础模块,本身是一个独立的脚本,用于执行计算密集的操作。每个任务都通过配置文件确定参数设置,且可以通过两种不同的方式执行:
1. 实例化一个任务对象,然后调用 `task.run()` 方法。
2. 调用 `get_command` 方法,并传入配置路径和包含 `{task_cmd}` 占位符的命令模板字符串(例如 `srun {task_cmd}`)。返回的命令字符串将是完整的命令,可以直接执行。
目前,OpenCompass 支持以下任务类型:
- `OpenICLInferTask`:基于 OpenICL 框架执行语言模型(LM)推断任务。
- `OpenICLEvalTask`:基于 OpenEval 框架执行语言模型(LM)评估任务。
未来,OpenCompass 将支持更多类型的任务。
================================================
FILE: docs/zh_cn/user_guides/experimentation.md
================================================
# 任务运行和监控
## 评测任务发起
评测任务的程序入口为 `run.py`,使用方法如下:
```shell
python run.py $EXP {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--debug] [-m MODE] [-r [REUSE]] [-w WORKDIR] [-l] [--dry-run] [--dump-eval-details]
```
任务配置 (`$EXP`):
- `run.py` 允许接受一个 .py 配置文件作为任务相关参数,里面需要包含 `datasets` 和 `models` 字段。
```bash
python run.py configs/eval_demo.py
```
- 如果不传入配置文件,用户也可以通过 `--models MODEL1 MODEL2 ...` 和 `--datasets DATASET1 DATASET2 ...` 来指定模型和数据集:
```bash
python run.py --models hf_opt_350m hf_opt_125m --datasets siqa_gen winograd_ppl
```
- 对于 HuggingFace 相关模型,用户也可以通过 HuggingFace 参数快速在命令行中定义一个模型,再通过 `--datasets DATASET1 DATASET2 ...` 定义数据集。
```bash
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path huggyllama/llama-7b
```
HuggingFace 全量参数介绍如下:
- `--hf-path`: HuggingFace 模型地址
- `--peft-path`: PEFT 模型地址
- `--tokenizer-path`: HuggingFace tokenizer 地址(如与模型地址相同,可省略)
- `--model-kwargs`: 构造 model 的参数
- `--tokenizer-kwargs`: 构造 tokenizer 的参数
- `--max-out-len`: 最长生成 token 数
- `--max-seq-len`: 模型能接受的最大序列长度
- `--batch-size`: 批次大小
- `--hf-num-gpus`: 运行模型所需的gpu数
启动方式:
- 本地机器运行: `run.py $EXP`。
- srun运行: `run.py $EXP --slurm -p $PARTITION_name`。
- dlc运行: `run.py $EXP --dlc --aliyun-cfg $AliYun_Cfg`
- 定制化启动: `run.py $EXP`。这里 $EXP 为配置文件,且里面包含 `eval` 和 `infer` 字段,详细配置请参考 [数据分片](./evaluation.md)。
参数解释如下:
- `-p`: 指定 slurm 分区;
- `-q`: 指定 slurm quotatype(默认为 None),可选 reserved, auto, spot。该参数可能仅适用于部分 slurm 的变体;
- `--debug`: 开启时,推理和评测任务会以单进程模式运行,且输出会实时回显,便于调试;
- `-m`: 运行模式,默认为 `all`。可以指定为 `infer` 则仅运行推理,获得输出结果;如果在 `{WORKDIR}` 中已经有模型输出,则指定为 `eval` 仅运行评测,获得评测结果;如果在 `results/` 中已有单项评测结果,则指定为 `viz` 仅运行可视化;指定为 `all` 则同时运行推理和评测。
- `-r`: 重用已有的推理结果。如果后面跟有时间戳,则会复用工作路径下该时间戳的结果;否则则复用指定工作路径下的最新结果。
- `-w`: 指定工作路径,默认为 `./outputs/default`
- `-l`: 打开飞书机器人状态上报。
- `--dry-run`: 开启时,推理和评测任务仅会分发但不会真正运行,便于调试;
- `--dump-eval-details`: 默认开启,`results` 下的评测结果中将会包含更加详细的评测结果信息,例如每条样本是否正确等。如不需要开启,需设置`--dump-eval-details False`。
以运行模式 `-m all` 为例,整体运行流如下:
1. 读取配置文件,解析出模型、数据集、评估器等配置信息
2. 评测任务主要分为推理 `infer`、评测 `eval` 和可视化 `viz` 三个阶段,其中推理和评测经过 Partitioner 进行任务切分后,交由 Runner 负责并行执行。单个推理和评测任务则被抽象成 `OpenICLInferTask` 和 `OpenICLEvalTask`。
3. 两阶段分别结束后,可视化阶段会读取 `results/` 中的评测结果,生成可视化报告。
## 任务监控:飞书机器人
用户可以通过配置飞书机器人,实现任务状态的实时监控。飞书机器人的设置文档请[参考这里](https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN?lang=zh-CN#7a28964d)。
配置方式:
1. 打开 `configs/lark.py` 文件,并在文件中加入以下行:
```python
lark_bot_url = 'YOUR_WEBHOOK_URL'
```
通常, Webhook URL 格式如 https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx 。
2. 在完整的评测配置中继承该文件:
```python
from mmengine.config import read_base
with read_base():
from .lark import lark_bot_url
```
3. 为了避免机器人频繁发消息形成骚扰,默认运行时状态不会自动上报。有需要时,可以通过 `-l` 或 `--lark` 启动状态上报:
```bash
python run.py configs/eval_demo.py -p {PARTITION} -l
```
## 运行结果
所有运行结果会默认放在`outputs/default/`目录下,目录结构如下所示:
```text
outputs/default/
├── 20200220_120000
├── ...
├── 20230220_183030
│ ├── configs
│ ├── logs
│ │ ├── eval
│ │ └── infer
│ ├── predictions
│ │ └── MODEL1
│ └── results
│ └── MODEL1
```
其中,每一个时间戳中存在以下内容:
- configs 文件夹,用于存放以这个时间戳为输出目录的每次运行对应的配置文件;
- logs 文件夹,用于存放推理和评测两个阶段的输出日志文件,各个文件夹内会以模型为子文件夹存放日志;
- predicitions 文件夹,用于存放推理 json 结果,以模型为子文件夹;
- results 文件夹,用于存放评测 json 结果,以模型为子文件夹
另外,所有指定-r 但是没有指定对应时间戳将会按照排序选择最新的文件夹作为输出目录。
## Summerizer介绍 (待更新)
================================================
FILE: docs/zh_cn/user_guides/framework_overview.md
================================================
# 整体概括
## 评测对象
本算法库的主要评测对象为语言大模型与多模态大模型。我们以语言大模型为例介绍评测的具体模型类型。
- 基座模型:一般是经过海量的文本数据以自监督学习的方式进行训练获得的模型(如OpenAI的GPT-3,Meta的LLaMA),往往具有强大的文字续写能力。
- 对话模型:一般是在的基座模型的基础上,经过指令微调或人类偏好对齐获得的模型(如OpenAI的ChatGPT、上海人工智能实验室的书生·浦语),能理解人类指令,具有较强的对话能力。
## 工具架构

- 模型层:大模型评测所涉及的主要模型种类,OpenCompass以基座模型和对话模型作为重点评测对象。
- 能力层:OpenCompass从本方案从通用能力和特色能力两个方面来进行评测维度设计。在模型通用能力方面,从语言、知识、理解、推理、安全等多个能力维度进行评测。在特色能力方面,从长文本、代码、工具、知识增强等维度进行评测。
- 方法层:OpenCompass采用客观评测与主观评测两种评测方式。客观评测能便捷地评估模型在具有确定答案(如选择,填空,封闭式问答等)的任务上的能力,主观评测能评估用户对模型回复的真实满意度,OpenCompass采用基于模型辅助的主观评测和基于人类反馈的主观评测两种方式。
- 工具层:OpenCompass提供丰富的功能支持自动化地开展大语言模型的高效评测。包括分布式评测技术,提示词工程,对接评测数据库,评测榜单发布,评测报告生成等诸多功能。
## 能力维度
### 设计思路
为准确、全面、系统化地评估大语言模型的能力,OpenCompass从通用人工智能的角度出发,结合学术界的前沿进展和工业界的最佳实践,提出一套面向实际应用的模型能力评价体系。OpenCompass能力维度体系涵盖通用能力和特色能力两大部分。
### 通用能力
通用能力涵盖学科综合能力、知识能力、语言能力、理解能力、推理能力、安全能力,共计六大维度构造立体全面的模型能力评价体系。
#### 学科综合能力
该维度旨在从人类成长角度,借鉴教育学中的分类逻辑,从学科综合能力层面为模型能力评测提供维度支撑。本维度的核心思路是从义务教育、高等教育以及职业教育等角度,通过对各级学科进行分类,构建完整的学科能力评测方案。
#### 知识能力
知识能力具体衡量模型对于各类知识的掌握情况,包括但不限于社会常识、专业领域知识等。该能力项希望模型能准确、完善的回答各类知识性问题。
#### 推理能力
推理能力是通用人工智能的重要能力维度,该维度旨在系统性评估模型的推理能力,包括但不限于数学计算能力,逻辑推理能力,因果推断能力,代码生成与修改能力等。
#### 理解能力
理解能力旨在评估模型对文字的理解能力,包括不限于:
- 修辞手法理解与分析:理解文字中使用的各类修辞手法,能对相关修辞手法进行分析与解释。
- 文字内容总结:针对给定内容进行内容总结和信息抽取。
- 文字内容创作:围绕给定的主题或要求进行开放式或半开放式的内容创作。
#### 语言能力
语言能力旨在评估模型在语言先验上的表现,该维度能力包括但不限于:
- 字词理解与生成:从字词层面理解语言,并能完成诸如字词识别与分类,字词含义解释,字词生成等任务。
- 语法理解与修改:理解文字中的语法,并能错误语法表达进行识别和修改。
- 文字跨语言翻译:针对给定的源语言,翻译到目标语言。在多语种能力维度评估现有大模型的能力。
#### 安全能力
OpenCompass结合大语言模型的技术特点,对模型输出是否合法合规、安全无害设计相应维度进行评测,助力安全、负责任的大模型研发。改维度能力包括但不限于:
- 公平性
- 合法性
- 无害性
- 伦理道德
- 保护隐私
## 评测方法
OpenCompass采取客观评测与主观评测相结合的方法。针对具有确定性答案的能力维度和场景,通过构造丰富完善的评测集,对模型能力进行综合评价。针对体现模型能力的开放式或半开放式的问题、模型安全问题等,采用主客观相结合的评测方式。
### 客观评测
针对具有标准答案的客观问题,我们可以我们可以通过使用定量指标比较模型的输出与标准答案的差异,并根据结果衡量模型的性能。同时,由于大语言模型输出自由度较高,在评测阶段,我们需要对其输入和输出作一定的规范和设计,尽可能减少噪声输出在评测阶段的影响,才能对模型的能力有更加完整和客观的评价。
为了更好地激发出模型在题目测试领域的能力,并引导模型按照一定的模板输出答案,OpenCompass采用提示词工程 (prompt engineering)和语境学习(in-context learning)进行客观评测。
在客观评测的具体实践中,我们通常采用下列两种方式进行模型输出结果的评测:
- **判别式评测**:该评测方式基于将问题与候选答案组合在一起,计算模型在所有组合上的困惑度(perplexity),并选择困惑度最小的答案作为模型的最终输出。例如,若模型在 `问题? 答案1` 上的困惑度为 0.1,在 `问题? 答案2` 上的困惑度为 0.2,最终我们会选择 `答案1` 作为模型的输出。
- **生成式评测**:该评测方式主要用于生成类任务,如语言翻译、程序生成、逻辑分析题等。具体实践时,使用问题作为模型的原始输入,并留白答案区域待模型进行后续补全。我们通常还需要对其输出进行后处理,以保证输出满足数据集的要求。
### 主观评测(即将发布)
语言表达生动精彩,变化丰富,大量的场景和能力无法凭借客观指标进行评测。针对如模型安全和模型语言能力的评测,以人的主观感受为主的评测更能体现模型的真实能力,并更符合大模型的实际使用场景。
OpenCompass采取的主观评测方案是指借助受试者的主观判断对具有对话能力的大语言模型进行能力评测。在具体实践中,我们提前基于模型的能力维度构建主观测试问题集合,并将不同模型对于同一问题的不同回复展现给受试者,收集受试者基于主观感受的评分。由于主观测试成本高昂,本方案同时也采用使用性能优异的大语言模拟人类进行主观打分。在实际评测中,本文将采用真实人类专家的主观评测与基于模型打分的主观评测相结合的方式开展模型能力评估。
在具体开展主观评测时,OpenComapss采用**单模型回复满意度统计**和**多模型满意度比较**两种方式开展具体的评测工作。
================================================
FILE: docs/zh_cn/user_guides/interns1.md
================================================
# Intern-S1评测教程
OpenCompass现已提供评测Intern-S1所需的相关模型配置与数据集配置。请顺序执行下列步骤来启动对Intern-S1的评测。
## 模型下载与部署
Intern-S1的模型权重现已开源,请从[Huggingface](https://huggingface.co/internlm/Intern-S1)获取。
完成模型下载后,推荐将其部署为API服务形式进行调用。可根据[此页面](https://github.com/InternLM/Intern-S1/blob/main/README.md#Serving)上提供的LMdeploy/vLLM/sglang形式进行部署。
## 评测配置
### 模型配置
我们在`opencompass/configs/models/interns1/intern_s1.py`中提供了OpenAISDK形式调用模型的配置示例,请根据你的需求进行相应更改。
```python
models = [
dict(
abbr="intern-s1",
key="YOUR_API_KEY", # 在此处填写模型服务的API KEY
openai_api_base="YOUR_API_BASE", # 在此处填写模型服务的API BASE
type=OpenAISDK,
path="internlm/Intern-S1",
temperature=0.7,
meta_template=api_meta_template,
query_per_second=1,
batch_size=8,
max_out_len=64000,
max_seq_len=65536,
openai_extra_kwargs={
'top_p': 0.95,
},
retry=10,
extra_body={
"chat_template_kwargs": {"enable_thinking": True} # 基于vllm或sglang部署服务后通过该开关来调控模型的思考模式
},
pred_postprocessor=dict(type=extract_non_reasoning_content), # 开启思考模式后可添加此配置来在Eval时去除Thinking内容
),
]
```
### 数据集配置
我们在`examples/eval_bench_intern_s1.py`中提供了评测Intern-S1所使用的相关数据集配置。你也可以根据需要自行添加其他数据集。
此外,你还需在该配置文件中添加LLM Judger的配置,示例如下:
```python
judge_cfg = dict(
abbr='YOUR_JUDGE_MODEL',
type=OpenAISDK,
path='YOUR_JUDGE_MODEL_PATH',
key='YOUR_API_KEY',
openai_api_base='YOUR_API_BASE',
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]),
query_per_second=1,
batch_size=1,
temperature=0.001,
max_out_len=8192,
max_seq_len=32768,
mode='mid',
)
```
## 启动评测
完成上述配置后,在命令行输入下面的指令启动评测:
```bash
opencompass examples/eval_bench_intern_s1.py
```
================================================
FILE: docs/zh_cn/user_guides/metrics.md
================================================
# 评估指标
在评测阶段,我们一般以数据集本身的特性来选取对应的评估策略,最主要的依据为**标准答案的类型**,一般以下几种类型:
- **选项**:常见于分类任务,判断题以及选择题,目前这类问题的数据集占比最大,有 MMLU, CEval 数据集等等,评估标准一般使用准确率--`ACCEvaluator`。
- **短语**:常见于问答以及阅读理解任务,这类数据集主要包括 CLUE_CMRC, CLUE_DRCD, DROP 数据集等等,评估标准一般使用匹配率--`EMEvaluator`。
- **句子**:常见于翻译以及生成伪代码、命令行任务中,主要包括 Flores, Summscreen, Govrepcrs, Iwdlt2017 数据集等等,评估标准一般使用 BLEU(Bilingual Evaluation Understudy)--`BleuEvaluator`。
- **段落**:常见于文本摘要生成的任务,常用的数据集主要包括 Lcsts, TruthfulQA, Xsum 数据集等等,评估标准一般使用 ROUGE(Recall-Oriented Understudy for Gisting Evaluation)--`RougeEvaluator`。
- **代码**:常见于代码生成的任务,常用的数据集主要包括 Humaneval,MBPP 数据集等等,评估标准一般使用执行通过率以及 `pass@k`,目前 Opencompass 支持的有`MBPPEvaluator`、`HumanEvalEvaluator`。
还有一类**打分类型**评测任务没有标准答案,比如评判一个模型的输出是否存在有毒,可以直接使用相关 API 服务进行打分,目前支持的有 `ToxicEvaluator`,目前有 realtoxicityprompts 数据集使用此评测方式。
## 已支持评估指标
目前 OpenCompass 中,常用的 Evaluator 主要放在 [`opencompass/openicl/icl_evaluator`](https://github.com/open-compass/opencompass/tree/main/opencompass/openicl/icl_evaluator)文件夹下, 还有部分数据集特有指标的放在 [`opencompass/datasets`](https://github.com/open-compass/opencompass/tree/main/opencompass/datasets) 的部分文件中。以下是汇总:
| 评估指标 | 评估策略 | 常用后处理方式 | 数据集 |
| --------------------- | -------------------- | --------------------------- | -------------------------------------------------------------------- |
| `ACCEvaluator` | 正确率 | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag |
| `EMEvaluator` | 匹配率 | None, dataset_specification | drop, CLUE_CMRC, CLUE_DRCD |
| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs |
| `RougeEvaluator` | ROUGE | None, dataset_specification | truthfulqa, Xsum, XLSum |
| `JiebaRougeEvaluator` | ROUGE | None, dataset_specification | lcsts |
| `HumanEvalEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess |
| `MBPPEvaluator` | 执行通过率 | None | mbpp |
| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts |
| `AGIEvalEvaluator` | 正确率 | None | agieval |
| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments |
| `MATHEvaluator` | 正确率 | `math_postprocess` | math |
| `MccEvaluator` | Matthews Correlation | None | -- |
| `SquadEvaluator` | F1-scores | None | -- |
## 如何配置
评估标准配置一般放在数据集配置文件中,最终的 xxdataset_eval_cfg 会传给 `dataset.infer_cfg` 作为实例化的一个参数。
下面是 `govrepcrs_eval_cfg` 的定义, 具体可查看 [configs/datasets/govrepcrs](https://github.com/open-compass/opencompass/tree/main/configs/datasets/govrepcrs)。
```python
from opencompass.openicl.icl_evaluator import BleuEvaluator
from opencompass.datasets import GovRepcrsDataset
from opencompass.utils.text_postprocessors import general_cn_postprocess
govrepcrs_reader_cfg = dict(.......)
govrepcrs_infer_cfg = dict(.......)
# 评估指标的配置
govrepcrs_eval_cfg = dict(
evaluator=dict(type=BleuEvaluator), # 使用常用翻译的评估器BleuEvaluator
pred_role='BOT', # 接受'BOT' 角色的输出
pred_postprocessor=dict(type=general_cn_postprocess), # 预测结果的后处理
dataset_postprocessor=dict(type=general_cn_postprocess)) # 数据集标准答案的后处理
govrepcrs_datasets = [
dict(
type=GovRepcrsDataset, # 数据集类名
path='./data/govrep/', # 数据集路径
abbr='GovRepcrs', # 数据集别名
reader_cfg=govrepcrs_reader_cfg, # 数据集读取配置文件,配置其读取的split,列等
infer_cfg=govrepcrs_infer_cfg, # 数据集推理的配置文件,主要 prompt 相关
eval_cfg=govrepcrs_eval_cfg) # 数据集结果的评估配置文件,评估标准以及前后处理。
]
```
================================================
FILE: docs/zh_cn/user_guides/models.md
================================================
# 准备模型
要在 OpenCompass 中支持新模型的评测,有以下几种方式:
1. 基于 HuggingFace 的模型
2. 基于 API 的模型
3. 自定义模型
## 基于 HuggingFace 的模型
在 OpenCompass 中,我们支持直接从 Huggingface 的 `AutoModel.from_pretrained` 和
`AutoModelForCausalLM.from_pretrained` 接口构建评测模型。如果需要评测的模型符合 HuggingFace 模型通常的生成接口,
则不需要编写代码,直接在配置文件中指定相关配置即可。
如下,为一个示例的 HuggingFace 模型配置文件:
```python
# 使用 `HuggingFace` 评测 HuggingFace 中 AutoModel 支持的模型
# 使用 `HuggingFaceCausalLM` 评测 HuggingFace 中 AutoModelForCausalLM 支持的模型
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
# 以下参数为 `HuggingFaceCausalLM` 的初始化参数
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
batch_padding=False,
# 以下参数为各类模型都有的参数,非 `HuggingFaceCausalLM` 的初始化参数
abbr='llama-7b', # 模型简称,用于结果展示
max_out_len=100, # 最长生成 token 数
batch_size=16, # 批次大小
run_cfg=dict(num_gpus=1), # 运行配置,用于指定资源需求
)
]
```
对以上一些参数的说明:
- `batch_padding=False`:如为 False,会对一个批次的样本进行逐一推理;如为 True,则会对一个批次的样本进行填充,
组成一个 batch 进行推理。对于部分模型,这样的填充可能导致意料之外的结果;如果评测的模型支持样本填充,
则可以将该参数设为 True,以加速推理。
- `padding_side='left'`:在左侧进行填充,因为不是所有模型都支持填充,在右侧进行填充可能会干扰模型的输出。
- `truncation_side='left'`:在左侧进行截断,评测输入的 prompt 通常包括上下文样本 prompt 和输入 prompt 两部分,
如果截断右侧的输入 prompt,可能导致生成模型的输入和预期格式不符,因此如有必要,应对左侧进行截断。
在评测时,OpenCompass 会使用配置文件中的 `type` 与各个初始化参数实例化用于评测的模型,
其他参数则用于推理及总结等过程中,与模型相关的配置。例如上述配置文件,我们会在评测时进行如下实例化过程:
```python
model = HuggingFaceCausalLM(
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
)
```
## 基于 API 的模型
OpenCompass 目前支持以下基于 API 的模型推理:
- OpenAI(`opencompass.models.OpenAI`)
- ChatGLM@智谱清言 (`opencompass.models.ZhiPuAI`)
- ABAB-Chat@MiniMax (`opencompass.models.MiniMax`)
- XunFei@科大讯飞 (`opencompass.models.XunFei`)
以下,我们以 OpenAI 的配置文件为例,模型如何在配置文件中使用基于 API 的模型。
```python
from opencompass.models import OpenAI
models = [
dict(
type=OpenAI, # 使用 OpenAI 模型
# 以下为 `OpenAI` 初始化参数
path='gpt-4', # 指定模型类型
key='YOUR_OPENAI_KEY', # OpenAI API Key
max_seq_len=2048, # 最大输入长度
# 以下参数为各类模型都有的参数,非 `OpenAI` 的初始化参数
abbr='GPT-4', # 模型简称
run_cfg=dict(num_gpus=0), # 资源需求(不需要 GPU)
max_out_len=512, # 最长生成长度
batch_size=1, # 批次大小
),
]
```
我们也提供了API模型的评测示例,请参考
```bash
configs
├── eval_zhipu.py
├── eval_xunfei.py
└── eval_minimax.py
```
## 自定义模型
如果以上方式无法支持你的模型评测需求,请参考 [支持新模型](../advanced_guides/new_model.md) 在 OpenCompass 中增添新的模型支持。
================================================
FILE: docs/zh_cn/user_guides/summarizer.md
================================================
# 结果展示
在评测完成后,评测的结果需要被打印到屏幕或者被保存下来,该过程是由 summarizer 控制的。
```{note}
如果 summarizer 出现在了 config 中,则评测结果输出会按照下述逻辑进行。
如果 summarizer 没有出现在 config 中,则评测结果会按照 `dataset` 中出现的顺序进行输出。
```
## 样例
一个典型的 summerizer 配置文件如下:
```python
summarizer = dict(
dataset_abbrs = [
'race',
'race-high',
'race-middle',
],
summary_groups=[
{'name': 'race', 'subsets': ['race-high', 'race-middle']},
]
)
```
其输出结果如下:
```text
dataset version metric mode internlm-7b-hf
----------- --------- ------------- ------ ----------------
race - naive_average ppl 76.23
race-high 0c332f accuracy ppl 74.53
race-middle 0c332f accuracy ppl 77.92
```
summarizer 会以 config 中的 `models`, `datasets` 为全集,去尝试读取 `{work_dir}/results/` 路径下的评测分数,并按照 `summarizer.dataset_abbrs` 列表的顺序进行展示。另外,summarizer 会尝试通过 `summarizer.summary_groups` 来进行一些汇总指标的计算。当且仅当 `subsets` 中的值都存在时,对应的 `name` 指标才会生成,这也就是说,若有部分数字缺失,则这个汇总指标也是会缺失的。若分数无法通过上述两种方式被获取到,则 summarizer 会在表格中对应项处使用 `-` 进行表示。
此外,输出结果是有多列的:
- `dataset` 列与 `summarizer.dataset_abbrs` 配置一一对应
- `version` 列是这个数据集的 hash 值,该 hash 值会考虑该数据集模板的评测方式、提示词、输出长度限制等信息。用户可通过该列信息确认两份评测结果是否可比
- `metric` 列是指这个指标的评测方式,具体说明见 [metrics](./metrics.md)
- `mode` 列是指这个推理结果的获取方式,可能的值有 `ppl` / `gen`。对于 `summarizer.summary_groups` 的项,若被 `subsets` 的获取方式都一致,则其值也跟 `subsets` 一致,否则即为 `mixed`
- 其后若干列,一列代表一个模型
## 完整字段说明
summarizer 字段说明如下
- `dataset_abbrs`: (list,可选) 展示列表项。若该项省略,则会输出全部评测结果。
- `summary_groups`: (list,可选) 汇总指标配置。
`summary_groups` 中的字段说明如下:
- `name`: (str) 汇总指标的名称。
- `subsets`: (list) 被汇总指标的名称。注意它不止可以是原始的 `dataset_abbr`,也可以是另一个汇总指标的名称。
- `weights`: (list,可选) 被汇总指标的权重。若该项省略,则默认使用不加权的求平均方法。
注意,我们在 `configs/summarizers/groups` 路径下存放了 MMLU, C-Eval 等数据集的评测结果汇总,建议优先考虑使用。
================================================
FILE: examples/eval_OlympiadBench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.OlympiadBench.OlympiadBench_0shot_gen_be8b13 import olympiadbench_datasets
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import models as lmdeploy_qwen2_5_7b_instruct_model
from opencompass.configs.summarizers.OlympiadBench import summarizer
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
from opencompass.runners import LocalRunner
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)
),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLEvalTask)
),
)
work_dir = 'outputs/debug/OlympiadBench'
================================================
FILE: examples/eval_PMMEval.py
================================================
from mmengine.config import read_base
from opencompass.models import HuggingFacewithChatTemplate
with read_base():
# from opencompass.configs.datasets.PMMEval.flores_gen import PMMEval_flores_datasets
# from opencompass.configs.datasets.PMMEval.humanevalxl_gen import PMMEval_HumanEvalXL_datasets
# from opencompass.configs.datasets.PMMEval.mgsm_gen import PMMEval_MGSM_datasets
# from opencompass.configs.datasets.PMMEval.mhellaswag_gen import PMMEval_MHellaswag_datasets
# from opencompass.configs.datasets.PMMEval.mifeval_gen import PMMEval_MIFEval_datasets
# from opencompass.configs.datasets.PMMEval.mlogiqa_gen import PMMEval_MLogiQA_datasets
# from opencompass.configs.datasets.PMMEval.mmmlu_gen import PMMEval_MMMLU_datasets
# from opencompass.configs.datasets.PMMEval.xnli import PMMEval_XNLI_datasets
from opencompass.configs.datasets.PMMEval.pmmeval_gen import \
PMMEval_datasets
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import \
models
from opencompass.configs.summarizers.PMMEval import summarizer
# datasets = PMMEval_flores_datasets
# datasets = PMMEval_HumanEvalXL_datasets
# datasets = PMMEval_MGSM_datasets
# datasets = PMMEval_MHellaswag_datasets
# datasets = PMMEval_MIFEval_datasets
# datasets = PMMEval_MLogiQA_datasets
# datasets = PMMEval_MMMLU_datasets
# datasets = PMMEval_XNLI_datasets
datasets = PMMEval_datasets
================================================
FILE: examples/eval_ProcessBench.py
================================================
from mmengine.config import read_base
from opencompass.models import VLLMwithChatTemplate
with read_base():
from opencompass.configs.datasets.ProcessBench.processbench_gen import processbench_datasets as processbench_datasets
from opencompass.configs.models.qwen2_5.vllm_qwen2_5_7b_instruct import models as vllm_qwen2_5_7b_instruct_model
from opencompass.configs.models.qwen2_5.vllm_qwen2_5_14b_instruct import models as vllm_qwen2_5_14b_instruct_model
from opencompass.configs.models.qwen2_5.vllm_qwen2_5_32b_instruct import models as vllm_qwen2_5_32b_instruct_model
from opencompass.configs.models.qwen2_5.vllm_qwen2_5_72b_instruct import models as vllm_qwen2_5_72b_instruct_model
# from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import models as lmdeploy_qwen2_5_7b_instruct_model
# from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import models as lmdeploy_qwen2_5_72b_instruct_model
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
from opencompass.runners import LocalRunner
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)
),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLEvalTask)
),
)
work_dir = 'outputs/ProcessBench'
================================================
FILE: examples/eval_TheoremQA.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import \
TheoremQA_datasets as datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_20b import \
models as hf_internlm2_20b_model
from opencompass.configs.models.hf_internlm.hf_internlm2_math_20b import \
models as hf_internlm2_math_20b_model
from opencompass.configs.models.mistral.hf_mistral_7b_v0_1 import \
models as hf_mistral_7b_v0_1_model
from opencompass.configs.models.mistral.hf_mistral_7b_v0_2 import \
models as hf_mistral_7b_v0_2_model
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
work_dir = 'outputs/TheoremQA-5shot'
# dataset version metric mode mistral-7b-v0.1-hf mistral-7b-v0.2-hf internlm2-20b-hf internlm2-math-20b-hf
# --------- --------- -------- ------ -------------------- -------------------- ------------------ -----------------------
# TheoremQA 6f0af8 score gen 18.00 16.75 25.87 30.88
================================================
FILE: examples/eval_academic_leaderboard_202407.py
================================================
import os.path as osp
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets Part
## Core Set
# ## Examination
# ## Reasoning
from opencompass.configs.datasets.bbh.bbh_gen_4a31fa import bbh_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_0shot_cot_gen_305931 import \
cmmlu_datasets
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets
# ## Coding
from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import \
humaneval_datasets
# ## Instruction Following
from opencompass.configs.datasets.IFEval.IFEval_gen_3321a3 import \
ifeval_datasets
# ## Math
from opencompass.configs.datasets.math.math_0shot_gen_393424 import \
math_datasets
from opencompass.configs.datasets.mmlu.mmlu_openai_simple_evals_gen_b618ea import \
mmlu_datasets
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets
from opencompass.configs.summarizers.groups.bbh import bbh_summary_groups
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups
# Summarizer
from opencompass.configs.summarizers.groups.mmlu import mmlu_summary_groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
# Model List
# from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import models as lmdeploy_qwen2_1_5b_instruct_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import models as hf_internlm2_5_7b_chat_model
# from opencompass.configs.models.openbmb.hf_minicpm_2b_sft_bf16 import models as hf_minicpm_2b_sft_bf16_model
# from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import models as hf_yi_1_5_6b_chat_model
# from opencompass.configs.models.gemma.hf_gemma_2b_it import models as hf_gemma_2b_it_model
# from opencompass.configs.models.yi.hf_yi_1_5_34b_chat import models as hf_yi_1_5_34b_chat_model
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
# with read_base():
core_summary_groups = [
{
'name':
'core_average',
'subsets': [
['mmlu', 'accuracy'],
['mmlu_pro', 'accuracy'],
# ['cmmlu', 'naive_average'],
['cmmlu', 'accuracy'],
['bbh', 'score'],
['math', 'accuracy'],
['openai_humaneval', 'humaneval_pass@1'],
['GPQA_diamond', 'accuracy'],
['IFEval', 'Prompt-level-strict-accuracy'],
],
},
]
summarizer = dict(
dataset_abbrs=[
['core_average', 'naive_average'],
['mmlu', 'accuracy'],
['mmlu_pro', 'accuracy'],
['cmmlu', 'accuracy'],
['bbh', 'score'],
['math', 'accuracy'],
['openai_humaneval', 'humaneval_pass@1'],
['GPQA_diamond', 'accuracy'],
['IFEval', 'Prompt-level-strict-accuracy'],
'',
['mmlu', 'accuracy'],
['mmlu-stem', 'accuracy'],
['mmlu-social-science', 'accuracy'],
['mmlu-humanities', 'accuracy'],
['mmlu-other', 'accuracy'],
'',
['mmlu_pro', 'accuracy'],
['mmlu_pro_math', 'accuracy'],
['mmlu_pro_physics', 'accuracy'],
['mmlu_pro_chemistry', 'accuracy'],
['mmlu_pro_law', 'accuracy'],
['mmlu_pro_engineering', 'accuracy'],
['mmlu_pro_other', 'accuracy'],
['mmlu_pro_economics', 'accuracy'],
['mmlu_pro_health', 'accuracy'],
['mmlu_pro_psychology', 'accuracy'],
['mmlu_pro_business', 'accuracy'],
['mmlu_pro_biology', 'accuracy'],
['mmlu_pro_philosophy', 'accuracy'],
['mmlu_pro_computer_science', 'accuracy'],
['mmlu_pro_history', 'accuracy'],
'',
['cmmlu', 'accuracy'],
['cmmlu-stem', 'accuracy'],
['cmmlu-social-science', 'accuracy'],
['cmmlu-humanities', 'accuracy'],
['cmmlu-other', 'accuracy'],
['cmmlu-china-specific', 'accuracy'],
'',
['bbh', 'extract_rate'],
['math', 'extract_rate'],
# ['openai_humaneval', 'extract_rate'],
['GPQA_diamond', 'extract_rate'],
# ['IFEval', 'extract_rate'],
'',
['mmlu', 'extract_rate'],
['mmlu-stem', 'extract_rate'],
['mmlu-social-science', 'extract_rate'],
['mmlu-humanities', 'extract_rate'],
['mmlu-other', 'extract_rate'],
'',
['mmlu_pro', 'extract_rate'],
['mmlu_pro_math', 'extract_rate'],
['mmlu_pro_physics', 'extract_rate'],
['mmlu_pro_chemistry', 'extract_rate'],
['mmlu_pro_law', 'extract_rate'],
['mmlu_pro_engineering', 'extract_rate'],
['mmlu_pro_other', 'extract_rate'],
['mmlu_pro_economics', 'extract_rate'],
['mmlu_pro_health', 'extract_rate'],
['mmlu_pro_psychology', 'extract_rate'],
['mmlu_pro_business', 'extract_rate'],
['mmlu_pro_biology', 'extract_rate'],
['mmlu_pro_philosophy', 'extract_rate'],
['mmlu_pro_computer_science', 'extract_rate'],
['mmlu_pro_history', 'extract_rate'],
'',
['cmmlu', 'extract_rate'],
['cmmlu-stem', 'extract_rate'],
['cmmlu-social-science', 'extract_rate'],
['cmmlu-humanities', 'extract_rate'],
['cmmlu-other', 'extract_rate'],
['cmmlu-china-specific', 'extract_rate'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Local Runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask)),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
base_exp_dir = 'outputs/corebench_v1_9/'
work_dir = osp.join(base_exp_dir, 'chat_objective')
================================================
FILE: examples/eval_academic_leaderboard_202412.py
================================================
import os.path as osp
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner, VOLCRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets Part
# Knowledge
# Math
from opencompass.configs.datasets.aime2024.aime2024_gen_6e39a4 import \
aime2024_datasets
from opencompass.configs.datasets.bbh.bbh_0shot_nocot_gen_925fc4 import \
bbh_datasets
# General Reasoning
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets
from opencompass.configs.datasets.humaneval.humaneval_openai_sample_evals_gen_dcae0e import \
humaneval_datasets
# Instruction Following
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets
from opencompass.configs.datasets.livecodebench.livecodebench_gen_a4f90b import \
LCBCodeGeneration_dataset
from opencompass.configs.datasets.math.math_prm800k_500_0shot_cot_gen import \
math_datasets
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets
# Model List
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model
# Summary Groups
from opencompass.configs.summarizers.groups.bbh import bbh_summary_groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
# Only take LCB generation for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')),
[]) + [LCBCodeGeneration_dataset]
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
core_summary_groups = [
{
'name':
'core_average',
'subsets': [
['IFEval', 'Prompt-level-strict-accuracy'],
['bbh', 'naive_average'],
['math_prm800k_500', 'accuracy'],
['aime2024', 'accuracy'],
['GPQA_diamond', 'accuracy'],
['mmlu_pro', 'naive_average'],
['openai_humaneval', 'humaneval_pass@1'],
['lcb_code_generation', 'pass@1'],
],
},
]
summarizer = dict(
dataset_abbrs=[
['core_average', 'naive_average'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
'',
'Math Calculation',
['math_prm800k_500', 'accuracy'],
['aime2024', 'accuracy'],
'',
'Knowledge',
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['lcb_code_generation', 'pass@1'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Local Runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask),
),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
work_dir = './outputs/oc_academic_202412'
================================================
FILE: examples/eval_academic_leaderboard_202502.py
================================================
# flake8: noqa
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner, VOLCRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets Part
# Knowledge
# Math
from opencompass.configs.datasets.aime2024.aime2024_0shot_nocot_genericllmeval_academic_gen import \
aime2024_datasets
from opencompass.configs.datasets.bbh.bbh_0shot_nocot_academic_gen import \
bbh_datasets
# General Reasoning
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets
from opencompass.configs.datasets.humaneval.humaneval_openai_sample_evals_gen_dcae0e import \
humaneval_datasets
# Instruction Following
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets
from opencompass.configs.datasets.livecodebench.livecodebench_gen_a4f90b import \
LCBCodeGeneration_dataset
from opencompass.configs.datasets.math.math_prm800k_500_0shot_cot_gen import \
math_datasets
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets
# Model List
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model
# Summary Groups
from opencompass.configs.summarizers.groups.bbh import bbh_summary_groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
# Only take LCB generation for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')),
[]) + [LCBCodeGeneration_dataset]
# LLM judge config: using LLM to evaluate predictions
judge_cfg = dict()
for dataset in datasets:
dataset['infer_cfg']['inferencer']['max_out_len'] = 32768
if 'judge_cfg' in dataset['eval_cfg']['evaluator']:
dataset['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
core_summary_groups = [
{
'name':
'core_average',
'subsets': [
['IFEval', 'Prompt-level-strict-accuracy'],
['bbh', 'naive_average'],
['math_prm800k_500', 'accuracy'],
['aime2024', 'accuracy'],
['GPQA_diamond', 'accuracy'],
['mmlu_pro', 'naive_average'],
['openai_humaneval', 'humaneval_pass@1'],
['lcb_code_generation', 'pass@1'],
],
},
]
summarizer = dict(
dataset_abbrs=[
['core_average', 'naive_average'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
'',
'Math Calculation',
['math_prm800k_500', 'accuracy'],
['aime2024', 'accuracy'],
'',
'Knowledge',
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['lcb_code_generation', 'pass@1'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Local Runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask),
),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
work_dir = './outputs/oc_academic_202502'
================================================
FILE: examples/eval_academic_leaderboard_REALTIME.py
================================================
# flake8: noqa
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets
from opencompass.configs.datasets.aime2025.aime2025_llmjudge_academic import \
aime2025_datasets
from opencompass.configs.datasets.gpqa.gpqa_cascade_eval_academic import \
gpqa_datasets
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets
from opencompass.configs.datasets.livecodebench.livecodebench_v6_academic import \
LCBCodeGeneration_dataset
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets
from opencompass.configs.datasets.HLE.hle_llmverify_academic import \
hle_datasets
# Summary Groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
# Models (add your models here)
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
# models as hf_internlm2_5_7b_chat_model
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
# Only take LCB generation for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')),
[]) + [LCBCodeGeneration_dataset]
# LLM judge config: using LLM to evaluate predictions
judge_cfg = dict()
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
if 'llm_evaluator' in item['eval_cfg']['evaluator'].keys() and 'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
core_summary_groups = [
{
'name':
'core_average',
'subsets': [
['IFEval', 'Prompt-level-strict-accuracy'],
['hle_llmjudge', 'accuracy'],
['aime2025_repeat_32', 'accuracy (32 runs average)'],
['GPQA_diamond_repeat_4', 'accuracy (4 runs average)'],
['mmlu_pro', 'naive_average'],
['lcb_code_generation_repeat_6', 'pass@1 (6 runs average)'],
],
},
]
summarizer = dict(
dataset_abbrs=[
['core_average', 'naive_average'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['hle_llmjudge', 'accuracy'],
['GPQA_diamond_repeat_4', 'accuracy (4 runs average)'],
'',
'Math Calculation',
['aime2025_repeat_32', 'accuracy (32 runs average)'],
'',
'Knowledge',
['mmlu_pro', 'naive_average'],
'',
'Code',
['lcb_code_generation_repeat_6', 'pass@1 (6 runs average)'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# infer with local runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask),
),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
work_dir = './outputs/oc_academic_202507'
================================================
FILE: examples/eval_academic_telechat_thinking.py
================================================
# flake8: noqa
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets
from opencompass.configs.datasets.aime2025.aime2025_llmjudge_academic import \
aime2025_datasets
from opencompass.configs.datasets.gpqa.gpqa_cascade_eval_academic import \
gpqa_datasets
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets
from opencompass.configs.datasets.livecodebench.livecodebench_v6_academic import \
LCBCodeGeneration_dataset
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets
from opencompass.configs.datasets.HLE.hle_llmverify_academic import \
hle_datasets
# Summary Groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
# Models (add your models here)
from opencompass.configs.models.telechat.telechat_thinking_streaming_v1 import models as telechat_thinking_streaming_v1_model
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
# Only take LCB generation for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')),
[]) + [LCBCodeGeneration_dataset]
# LLM judge config: using LLM to evaluate predictions
judge_cfg = dict()
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
if 'llm_evaluator' in item['eval_cfg']['evaluator'].keys() and 'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
core_summary_groups = [
{
'name':
'core_average',
'subsets': [
['IFEval', 'Prompt-level-strict-accuracy'],
['hle_llmjudge', 'accuracy'],
['aime2025_repeat_32', 'accuracy (32 runs average)'],
['GPQA_diamond_repeat_4', 'accuracy (4 runs average)'],
['mmlu_pro', 'naive_average'],
['lcb_code_generation_repeat_6', 'pass@1 (6 runs average)'],
],
},
]
summarizer = dict(
dataset_abbrs=[
['core_average', 'naive_average'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['hle_llmjudge', 'accuracy'],
['GPQA_diamond_repeat_4', 'accuracy (4 runs average)'],
'',
'Math Calculation',
['aime2025_repeat_32', 'accuracy (32 runs average)'],
'',
'Knowledge',
['mmlu_pro', 'naive_average'],
'',
'Code',
['lcb_code_generation_repeat_6', 'pass@1 (6 runs average)'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Inference configuration using a local runner
# - Partitioner: NumWorkerPartitioner splits tasks across 16 workers
# - Runner: LocalRunner executes tasks locally, with a maximum of 1 concurrent worker
# - Task type: OpenICLInferTask
# - Each worker thread processes a batch of 8 samples
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=16),
runner=dict(
type=LocalRunner,
max_num_workers=4,
task=dict(type=OpenICLInferTask),
),
)
# Evaluation configuration using a local runner
# - Partitioner: NaivePartitioner splits tasks into 10 partitions
# - Runner: LocalRunner executes tasks locally, with a maximum of 1 concurrent worker
# - Task type: OpenICLEvalTask
# - Each worker thread processes a batch of 8 samples
eval = dict(
partitioner=dict(type=NaivePartitioner, n=16),
runner=dict(type=LocalRunner,
max_num_workers=4,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
work_dir = './outputs/eval_TeleChat_thinking'
================================================
FILE: examples/eval_alaya.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.agieval.agieval_gen import \
agieval_datasets
from opencompass.configs.datasets.bbh.bbh_gen import bbh_datasets
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_gen import cmmlu_datasets
from opencompass.configs.datasets.mmlu.mmlu_gen import mmlu_datasets
from opencompass.configs.models.alaya.alaya import models
datasets = [
*bbh_datasets, *ceval_datasets, *cmmlu_datasets, *agieval_datasets,
*mmlu_datasets
]
================================================
FILE: examples/eval_api_demo.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.demo.demo_gsm8k_chat_gen import \
gsm8k_datasets
from opencompass.configs.datasets.demo.demo_math_chat_gen import \
math_datasets
from opencompass.configs.models.openai.gpt_4o_2024_05_13 import \
models as gpt4
datasets = gsm8k_datasets + math_datasets
models = gpt4
================================================
FILE: examples/eval_attack.py
================================================
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLAttackTask
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.promptbench.promptbench_wnli_gen_50662f import \
wnli_datasets
from opencompass.configs.models.qwen.hf_qwen2_1_5b import models
datasets = wnli_datasets
# Please run whole dataset at a time, aka use `NaivePartitioner` only
# Please use `OpenICLAttackTask` if want to perform attack experiment
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLAttackTask)),
)
attack = dict(
attack='textfooler',
query_budget=100,
prompt_topk=1,
)
================================================
FILE: examples/eval_babilong.py
================================================
from mmengine.config import read_base
with read_base():
# Models
# Datasets
from opencompass.configs.datasets.babilong.babilong_0k_gen import \
babiLong_0k_datasets
from opencompass.configs.datasets.babilong.babilong_4k_gen import \
babiLong_4k_datasets
from opencompass.configs.datasets.babilong.babilong_16k_gen import \
babiLong_16k_datasets
from opencompass.configs.datasets.babilong.babilong_32k_gen import \
babiLong_32k_datasets
from opencompass.configs.datasets.babilong.babilong_128k_gen import \
babiLong_128k_datasets
from opencompass.configs.datasets.babilong.babilong_256k_gen import \
babiLong_256k_datasets
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
models as lmdeploy_llama3_1_8b_instruct_model
from opencompass.configs.models.mistral.lmdeploy_ministral_8b_instruct_2410 import \
models as lmdeploy_ministral_8b_instruct_2410_model
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import \
models as lmdeploy_qwen2_5_7b_instruct_model
from opencompass.configs.summarizers.groups.babilong import \
babilong_summary_groups
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for model in models:
model['engine_config']['session_len'] = 1024 * 1024
model['max_seq_len'] = 1024 * 1024
model['engine_config']['tp'] = 4
model['run_cfg']['num_gpus'] = 4
summarizer = dict(
dataset_abbrs=[
'babilong_0k',
'babilong_4k',
'babilong_16k',
'babilong_32k',
'babilong_128k',
'babilong_256k',
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
work_dir = './outputs/babilong'
================================================
FILE: examples/eval_base_demo.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.demo.demo_gsm8k_base_gen import \
gsm8k_datasets
from opencompass.configs.datasets.demo.demo_math_base_gen import \
math_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_1_8b import \
models as hf_internlm2_1_8b_models
from opencompass.configs.models.qwen.hf_qwen2_1_5b import \
models as hf_qwen2_1_5b_models
datasets = gsm8k_datasets + math_datasets
models = hf_qwen2_1_5b_models + hf_internlm2_1_8b_models
================================================
FILE: examples/eval_bench_intern_s1.py
================================================
# flake8: noqa
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
from opencompass.configs.datasets.gpqa.gpqa_cascade_eval_gen_772ea0 import (
gpqa_datasets,
)
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_nocot_genericllmeval_gen_08c1de import (
mmlu_pro_datasets,
)
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import (
ifeval_datasets,
)
from opencompass.configs.datasets.SmolInstruct.smolinstruct_0shot_instruct_gen import (
smolinstruct_datasets_0shot_instruct as smolinstruct_datasets,
)
from opencompass.configs.datasets.ChemBench.ChemBench_llmjudge_gen_c584cf import (
chembench_datasets,
)
from opencompass.configs.datasets.matbench.matbench_llm_judge_gen_0e9276 import (
matbench_datasets,
)
from opencompass.configs.datasets.ProteinLMBench.ProteinLMBench_llmjudge_gen_a67965 import (
proteinlmbench_datasets,
)
# Summary Groups
from opencompass.configs.summarizers.groups.mmlu_pro import (
mmlu_pro_summary_groups,
)
# Models
from opencompass.configs.models.interns1.intern_s1 import \
models as interns1_model
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
# Only take LCB generation for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')),
[])
# LLM judge config: using LLM to evaluate predictions
judge_cfg = dict()
for item in datasets:
item['infer_cfg']['inferencer']['max_out_len'] = 65536
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
if 'llm_evaluator' in item['eval_cfg']['evaluator'].keys() and 'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
)
summary_groups.extend(
[
{
'name': 'ChemBench',
'subsets': [
'ChemBench_Name_Conversion',
'ChemBench_Property_Prediction',
'ChemBench_Mol2caption',
'ChemBench_Caption2mol',
'ChemBench_Product_Prediction',
'ChemBench_Retrosynthesis',
'ChemBench_Yield_Prediction',
'ChemBench_Temperature_Prediction',
],
},
]
)
summarizer = dict(
dataset_abbrs=[
'Knowledge',
['mmlu_pro', 'accuracy'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['GPQA_diamond', 'accuracy'],
'',
'Math Calculation',
['aime2025', 'accuracy'],
'',
'Academic',
['ChemBench', 'naive_average'],
['ProteinLMBench', 'accuracy'],
'',
'SmolInstruct',
['NC-I2F-0shot-instruct', 'score'],
['NC-I2S-0shot-instruct', 'score'],
['NC-S2F-0shot-instruct', 'score'],
['NC-S2I-0shot-instruct', 'score'],
['PP-ESOL-0shot-instruct', 'score'],
['PP-Lipo-0shot-instruct', 'score'],
['PP-BBBP-0shot-instruct', 'accuracy'],
['PP-ClinTox-0shot-instruct', 'accuracy'],
['PP-HIV-0shot-instruct', 'accuracy'],
['PP-SIDER-0shot-instruct', 'accuracy'],
['MC-0shot-instruct', 'score'],
['MG-0shot-instruct', 'score'],
['FS-0shot-instruct', 'score'],
['RS-0shot-instruct', 'score'],
'',
['matbench_expt_gap', 'mae'],
['matbench_steels', 'mae'],
['matbench_expt_is_metal', 'accuracy'],
['matbench_glass', 'accuracy'],
'',
],
summary_groups=summary_groups,
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# infer with local runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask),
),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
work_dir = './outputs/oc_bench_intern_s1'
================================================
FILE: examples/eval_bluelm_32k_lveval.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.lveval.lveval import \
LVEval_datasets as datasets
from opencompass.configs.models.bluelm.hf_bluelm_7b_chat_32k import models
from opencompass.configs.summarizers.lveval import summarizer
models[0]['path'] = '/path/to/your/huggingface_models/BlueLM-7B-Chat-32K'
models[0][
'tokenizer_path'] = '/path/to/your/huggingface_models/BlueLM-7B-Chat-32K'
models[0]['max_seq_len'] = 32768
models[0]['generation_kwargs'] = dict(do_sample=False)
models[0]['mode'] = 'mid' # truncate in the middle
================================================
FILE: examples/eval_cascade_evaluator.py
================================================
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import (
GenericLLMEvaluator,
CascadeEvaluator,
MATHVerifyEvaluator,
)
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets import (
MATHDataset,
math_postprocess_v2,
normalize_final_answer,
)
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets, Summarizer
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
models as lmdeploy_qwen2_5_7b_instruct_model,
)
reader_cfg = dict(input_columns=['problem'], output_column='solution')
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
########################## Evaluator #################################
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{problem}\n\n\n
: \n{solution}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
llm_judge_evaluator = dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=MATHDataset,
path='opencompass/math',
file_name='test_prm800k_500.json',
),
judge_cfg=dict(),
)
rule_evaluator =dict(type=MATHVerifyEvaluator)
cascade_evaluator = dict(type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,
parallel=False
)
########################## #################################
eval_cfg = dict()
# eval_cfg['evaluator'] = rule_evaluator
# eval_cfg['evaluator'] = llm_judge_evaluator
eval_cfg['evaluator'] = cascade_evaluator
math_datasets = [
dict(
abbr='math_prm800k_500',
type=MATHDataset,
path='opencompass/math',
file_name='test_prm800k_500.json',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
]
datasets = math_datasets
models = lmdeploy_qwen2_5_7b_instruct_model
work_dir = 'math_prm800k_500_cascade_evaluator'
================================================
FILE: examples/eval_charm_mem.py
================================================
from mmengine.config import read_base
from opencompassopencompass.configs.models import OpenAI
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import CharmMemSummarizer
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
with read_base():
from opencompass.configs.datasets.CHARM.charm_memory_gen_bbbd53 import \
charm_memory_datasets as datasets
# ------>>>>>> https://arxiv.org/abs/2403.14112
# from opencompass.configs.models.openai.gpt_3_5_turbo_1106 import models as gpt_3_5_turbo_1106_model
# from opencompass.configs.models.openai.gpt_4_1106_preview import models as gpt_4_1106_preview_model
# from opencompass.configs.models.hf_llama.hf_llama2_7b_chat import models as llama2_7b_chat_model
# from opencompass.configs.models.hf_llama.hf_llama2_13b_chat import models as llama2_13b_chat_model
# from opencompass.configs.models.hf_llama.hf_llama2_70b_chat import models as llama2_70b_chat_model
# from opencompass.configs.models.vicuna.hf_vicuna_7b_v15_16k import models as vicuna_7b_v15_16k_model
# from opencompass.configs.models.vicuna.hf_vicuna_13b_v15_16k import models as vicuna_13b_v15_16k_model
# from opencompass.configs.models.chatglm.hf_chatglm3_6b_32k import models as chatglm3_6b_32k_model
# from opencompass.configs.models.baichuan.hf_baichuan2_7b_chat import models as baichuan2_7b_chat_model # need torch 2.1
# from opencompass.configs.models.baichuan.hf_baichuan2_13b_chat import models as baichuan2_13b_chat_model # need torch 2.1
# from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as hf_internlm2_chat_7b_model
# from opencompass.configs.models.hf_internlm.hf_internlm2_chat_20b import models as hf_internlm2_chat_20b_model
# from opencompass.configs.models.yi.hf_yi_6b_chat import models as yi_6b_chat_model
# from opencompass.configs.models.yi.hf_yi_34b_chat import models as yi_34b_chat_model
# from opencompass.configs.models.deepseek.hf_deepseek_7b_chat import models as deepseek_7b_chat_model
# from opencompass.configs.models.deepseek.hf_deepseek_67b_chat import models as deepseek_67b_chat_model
# from opencompass.configs.models.qwen.hf_qwen_7b_chat import models as qwen_7b_chat_model
# from opencompass.configs.models.qwen.hf_qwen_14b_chat import models as qwen_14b_chat_model
# from opencompass.configs.models.qwen.hf_qwen_72b_chat import models as qwen_72b_chat_model
# <<<<<<------ https://arxiv.org/abs/2403.14112
# from opencompass.configs.models.openai.gpt_3_5_turbo_0125 import models as gpt_3_5_turbo_0125_model
# from opencompass.configs.models.openai.gpt_4o_2024_05_13 import models as gpt_4o_2024_05_13_model
# from opencompass.configs.models.gemini.gemini_1_5_flash import models as gemini_1_5_flash_model
# from opencompass.configs.models.gemini.gemini_1_5_pro import models as gemini_1_5_pro_model
# from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import models as lmdeploy_llama3_8b_instruct_model
# from opencompass.configs.models.hf_llama.lmdeploy_llama3_70b_instruct import models as lmdeploy_llama3_70b_instruct_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b import models as lmdeploy_internlm2_chat_1_8b_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import models as lmdeploy_internlm2_chat_7b_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_20b import models as lmdeploy_internlm2_chat_20b_model
# from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import models as yi_1_5_6b_chat_model
# from opencompass.configs.models.yi.hf_yi_1_5_34b_chat import models as yi_1_5_34b_chat_model
# from opencompass.configs.models.deepseek.hf_deepseek_v2_chat import models as deepseek_v2_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_1_8b_chat import models as qwen1_5_1_8b_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_7b_chat import models as qwen1_5_7b_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_14b_chat import models as qwen1_5_14b_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_72b_chat import models as qwen1_5_72b_chat_model
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
## ------------- JudgeLLM Configuration
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
judge_models = [
dict(
abbr='GPT-3.5-turbo-0125',
type=OpenAI,
path='gpt-3.5-turbo-0125',
key='ENV',
meta_template=api_meta_template,
query_per_second=16,
max_out_len=2048,
max_seq_len=2048,
batch_size=8,
temperature=0,
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(
type=SubjectiveSizePartitioner,
max_task_size=1000,
mode='singlescore',
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=2,
task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=CharmMemSummarizer)
work_dir = './outputs/CHARM_mem/chat/'
================================================
FILE: examples/eval_charm_rea.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.CHARM.charm_reason_gen_f8fca2 import \
charm_reason_datasets as datasets
# ------>>>>>> https://arxiv.org/abs/2403.14112
# from opencompass.configs.models.openai.gpt_3_5_turbo_1106 import models as gpt_3_5_turbo_1106_model
# from opencompass.configs.models.openai.gpt_4_1106_preview import models as gpt_4_1106_preview_model
# from opencompass.configs.models.hf_llama.hf_llama2_7b_chat import models as llama2_7b_chat_model
# from opencompass.configs.models.hf_llama.hf_llama2_13b_chat import models as llama2_13b_chat_model
# from opencompass.configs.models.hf_llama.hf_llama2_70b_chat import models as llama2_70b_chat_model
# from opencompass.configs.models.vicuna.hf_vicuna_7b_v15_16k import models as vicuna_7b_v15_16k_model
# from opencompass.configs.models.vicuna.hf_vicuna_13b_v15_16k import models as vicuna_13b_v15_16k_model
# from opencompass.configs.models.chatglm.hf_chatglm3_6b_32k import models as chatglm3_6b_32k_model
# from opencompass.configs.models.baichuan.hf_baichuan2_7b_chat import models as baichuan2_7b_chat_model # need torch 2.1
# from opencompass.configs.models.baichuan.hf_baichuan2_13b_chat import models as baichuan2_13b_chat_model # need torch 2.1
# from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as hf_internlm2_chat_7b_model
# from opencompass.configs.models.hf_internlm.hf_internlm2_chat_20b import models as hf_internlm2_chat_20b_model
# from opencompass.configs.models.yi.hf_yi_6b_chat import models as yi_6b_chat_model
# from opencompass.configs.models.yi.hf_yi_34b_chat import models as yi_34b_chat_model
# from opencompass.configs.models.deepseek.hf_deepseek_7b_chat import models as deepseek_7b_chat_model
# from opencompass.configs.models.deepseek.hf_deepseek_67b_chat import models as deepseek_67b_chat_model
# from opencompass.configs.models.qwen.hf_qwen_7b_chat import models as qwen_7b_chat_model
# from opencompass.configs.models.qwen.hf_qwen_14b_chat import models as qwen_14b_chat_model
# from opencompass.configs.models.qwen.hf_qwen_72b_chat import models as qwen_72b_chat_model
# <<<<<<------ https://arxiv.org/abs/2403.14112
# from opencompass.configs.models.openai.gpt_3_5_turbo_0125 import models as gpt_3_5_turbo_0125_model
# from opencompass.configs.models.openai.gpt_4o_2024_05_13 import models as gpt_4o_2024_05_13_model
# from opencompass.configs.models.gemini.gemini_1_5_flash import models as gemini_1_5_flash_model
# from opencompass.configs.models.gemini.gemini_1_5_pro import models as gemini_1_5_pro_model
# from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import models as lmdeploy_llama3_8b_instruct_model
# from opencompass.configs.models.hf_llama.lmdeploy_llama3_70b_instruct import models as lmdeploy_llama3_70b_instruct_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b import models as lmdeploy_internlm2_chat_1_8b_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import models as lmdeploy_internlm2_chat_7b_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_20b import models as lmdeploy_internlm2_chat_20b_model
# from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import models as yi_1_5_6b_chat_model
# from opencompass.configs.models.yi.hf_yi_1_5_34b_chat import models as yi_1_5_34b_chat_model
# from opencompass.configs.models.deepseek.hf_deepseek_v2_chat import models as deepseek_v2_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_1_8b_chat import models as qwen1_5_1_8b_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_7b_chat import models as qwen1_5_7b_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_14b_chat import models as qwen1_5_14b_chat_model
# from opencompass.configs.models.qwen.hf_qwen1_5_72b_chat import models as qwen1_5_72b_chat_model
from .summarizers.charm_reason import summarizer
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
work_dir = './outputs/CHARM_rea/chat/'
# dataset version metric mode internlm2-chat-7b-turbomind
# ------------------------------------------------------------- --------- ------------- ------ -----------------------------
# charm-reason-Direct - naive_average gen 49.51
# charm-reason-ZH-CoT - naive_average gen 61.33
# charm-reason-EN-CoT - naive_average gen 54.55
# charm-reason-XLT - naive_average gen 58.46
# charm-reason-Translate-EN - naive_average gen 56.15
# - - - -
# charm-reason-Chinese_Direct - naive_average gen 47.14
# charm-reason-Chinese_ZH-CoT - naive_average gen 58.40
# charm-reason-Chinese_EN-CoT - naive_average gen 48.31
# charm-reason-Chinese_XLT - naive_average gen 53.57
# charm-reason-Chinese_Translate-EN - naive_average gen 48.21
# charm-reason-Global_Direct - naive_average gen 51.88
# charm-reason-Global_ZH-CoT - naive_average gen 64.26
# charm-reason-Global_EN-CoT - naive_average gen 60.79
# charm-reason-Global_XLT - naive_average gen 63.36
# charm-reason-Global_Translate-EN - naive_average gen 64.10
================================================
FILE: examples/eval_chat_agent.py
================================================
from lagent import ReAct
from lagent.agents.react import ReActProtocol
from mmengine.config import read_base
from opencompass.lagent.actions.python_interpreter import PythonInterpreter
from opencompass.models.lagent import LagentAgent
from opencompass.models.openai_api import OpenAI
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.gsm8k.gsm8k_agent_gen_be1606 import \
gsm8k_datasets
from opencompass.configs.datasets.math.math_agent_gen_af2293 import \
math_datasets
from opencompass.configs.datasets.MathBench.mathbench_agent_gen_568903 import \
mathbench_agent_datasets
from opencompass.configs.summarizers.math_agent import summarizer
datasets = []
datasets += gsm8k_datasets
datasets += math_datasets
datasets += mathbench_agent_datasets
system_prompt = """You are a helpful assistant which use tools to solve mathematical reasoning questions. The code must be a function, and the function name must be 'solution'. For mathematics, please use code tool to calculate. The example format is as follows:
```
def solution():
variable_names_with_real_meaning = func(variable)
return variable_names_with_real_meaning
```"""
protocol = dict(
type=ReActProtocol,
action=dict(role='ACTION', begin='Tool:', end='\n'),
action_input=dict(role='ARGS', begin='Tool Input:', end='\n'),
finish=dict(role='FINISH', begin='FinalAnswer:', end='\n'),
call_protocol=system_prompt,
)
models = [
dict(
abbr='gpt-3.5-react',
type=LagentAgent,
agent_type=ReAct,
max_turn=3,
llm=dict(
type=OpenAI,
path='gpt-3.5-turbo',
key='ENV',
query_per_second=1,
max_seq_len=4096,
),
actions=[
dict(type=PythonInterpreter),
],
protocol=protocol,
batch_size=1,
),
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=1000),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_chat_agent_baseline.py
================================================
from mmengine.config import read_base
from opencompass.models.openai_api import OpenAI
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.gsm8k.gsm8k_gen_d6de81 import \
gsm8k_datasets
from opencompass.configs.datasets.math.math_gen_1ed9c2 import math_datasets
from opencompass.configs.datasets.MathBench.mathbench_gen import \
mathbench_datasets
from opencompass.configs.summarizers.math_baseline import summarizer
datasets = []
datasets += gsm8k_datasets
datasets += math_datasets
datasets += mathbench_datasets
models = [
dict(
abbr='gpt-3.5-react',
type=OpenAI,
path='gpt-3.5-turbo',
key='ENV',
query_per_second=1,
max_seq_len=4096,
batch_size=1,
),
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=1000),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_chat_demo.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.demo.demo_gsm8k_chat_gen import \
gsm8k_datasets
from opencompass.configs.datasets.demo.demo_math_chat_gen import \
math_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_1_8b import \
models as hf_internlm2_chat_1_8b_models
from opencompass.configs.models.qwen.hf_qwen2_1_5b_instruct import \
models as hf_qwen2_1_5b_instruct_models
datasets = gsm8k_datasets + math_datasets
models = hf_qwen2_1_5b_instruct_models + hf_internlm2_chat_1_8b_models
================================================
FILE: examples/eval_chat_last.py
================================================
from mmengine.config import read_base
from opencompass.models.openai_api import OpenAI
from opencompass.openicl import ChatInferencer
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
gsm8k_datasets as datasets
models = [
dict(
abbr='gpt-3.5',
type=OpenAI,
path='gpt-3.5-turbo',
key='ENV',
max_out_len=100,
max_seq_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
for dataset in datasets:
# Use ChatInferencer instead of GenInferencer
dataset['infer_cfg']['inferencer'] = dict(type=ChatInferencer)
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=1000),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_chatml_datasets.py
================================================
# flake8: noqa
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Models (add your models here)
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model
# Datasets
from opencompass.configs.chatml_datasets.MaScQA.MaScQA_gen import datasets as MaScQA_chatml
from opencompass.configs.chatml_datasets.CPsyExam.CPsyExam_gen import datasets as CPsyExam_chatml
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
chatml_datasets = sum(
(v for k, v in locals().items() if k.endswith('_chatml')),
[],
)
# Your Judge Model Configs Here
judge_cfg = dict()
for dataset in chatml_datasets:
if dataset['evaluator']['type'] == 'llm_evaluator':
dataset['evaluator']['judge_cfg'] = judge_cfg
if dataset['evaluator']['type'] == 'cascade_evaluator':
dataset['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(
type=LocalRunner, task=dict(type=OpenICLEvalTask), max_num_workers=32
),
)
work_dir = 'outputs/ChatML_Datasets'
================================================
FILE: examples/eval_chembench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.ChemBench.ChemBench_gen import \
chembench_datasets
from opencompass.configs.models.mistral.hf_mistral_7b_instruct_v0_2 import \
models
datasets = [*chembench_datasets]
models = [*models]
'''
dataset version metric mode mistral-7b-instruct-v0.2-hf
-------------------------------- --------- -------- ------ -----------------------------
ChemBench_Name_Conversion d4e6a1 accuracy gen 45.43
ChemBench_Property_Prediction d4e6a1 accuracy gen 47.11
ChemBench_Mol2caption d4e6a1 accuracy gen 64.21
ChemBench_Caption2mol d4e6a1 accuracy gen 35.38
ChemBench_Product_Prediction d4e6a1 accuracy gen 38.67
ChemBench_Retrosynthesis d4e6a1 accuracy gen 27
ChemBench_Yield_Prediction d4e6a1 accuracy gen 27
ChemBench_Temperature_Prediction d4e6a1 accuracy gen 26.73
ChemBench_Solvent_Prediction d4e6a1 accuracy gen 32.67
'''
================================================
FILE: examples/eval_chinese_simpleqa.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.chinese_simpleqa.chinese_simpleqa_gen import csimpleqa_datasets
from opencompass.models import HuggingFacewithChatTemplate
from opencompass.models.openai_api import OpenAI
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import DefaultSubjectiveSummarizer
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
# -------------Inference Stage ----------------------------------------
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='Qwen2.5-1.5B-Instruct',
path='Qwen/Qwen2.5-1.5B-Instruct',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
generation_kwargs=dict(do_sample=True, ),
max_out_len=200,
max_seq_len=4096,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
summarizer = dict(type=DefaultSubjectiveSummarizer)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
api_meta_template = dict(round=[
dict(role='SYSTEM', api_role='SYSTEM'),
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
judge_models = [
dict(
# GPT4o
abbr='gpt-4o-0513-global',
type=OpenAI,
# gpt-4o
path='gpt-4o-0513-global',
key='xxx', # provide OPENAI_API_KEY
meta_template=api_meta_template,
query_per_second=16,
max_out_len=1000,
batch_size=8,
retry=3)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
work_dir = 'outputs/chinese_simpleqa/'
================================================
FILE: examples/eval_cibench.py
================================================
from copy import deepcopy
from lagent import ReAct
from lagent.agents.react import ReActProtocol
from mmengine.config import read_base
from opencompass.lagent.actions.ipython_interpreter import IPythonInterpreter
from opencompass.lagent.actions.python_interpreter import PythonInterpreter
from opencompass.lagent.agents.react import CIReAct
from opencompass.models import HuggingFaceCausalLM
from opencompass.models.lagent import CodeAgent, LagentAgent
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# Note that it might occur cuda OOM error for hf model
from opencompass.configs.datasets.CIBench.CIBench_generation_gen_8ab0dc import \
cibench_datasets as cibench_datasets_generation
from opencompass.configs.datasets.CIBench.CIBench_template_gen_e6b12a import \
cibench_datasets as cibench_datasets_template
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
models as lmdeploy_llama3_8b_instruct_model
from opencompass.configs.summarizers.cibench import summarizer
# Oracle mode for analysis
# from opencompass.configs.datasets.CIBench.CIBench_template_oracle_gen_fecda1 import cibench_datasets as cibench_datasets_template_oracle
# from opencompass.configs.datasets.CIBench.CIBench_generation_oracle_gen_c4a7c1 import cibench_datasets as cibench_datasets_generation_oracle
datasets = []
datasets += cibench_datasets_template
datasets += cibench_datasets_generation
# datasets += cibench_datasets_template_oracle
# datasets += cibench_datasets_generation_oracle
_origin_models = sum([v for k, v in locals().items() if k.endswith('_model')],
[])
FORCE_STOP_PROMPT_EN = """You should directly give results based on history information."""
FEWSHOT_INSTRUCTION = """\
You are an assistant who can utilize external tools.
{tool_description}
To use a tool, please response with the following format:
```
{thought} Think what you need to solve, do you need to use tools?
{action} The tool name, should be one of [{action_names}].
{action_input} The input to the tool that you want to use.
```
The tool will give you response after your response using the following format:
```
{response} the results after call the tool.
```
Therefore DO NOT generate tool response by yourself.
Also please follow the guidelines:
1. Always use code interpreter to solve the problem.
2. The generated codes should always in a markdown code block format.
3. The generated codes will be executed in an ipython manner and the results will be cached.
4. Your responded code should always be simple and only solves the problem in current step.
For example:
File url: `xxxx`
### Step 1. Load the dataset from the url into a pandas DataFrame named `df`.
{thought} We should use `pandas` to solve this step.
{action} IPythonInterpreter
{action_input} ```python
import pandas as pd
url = "xxxx"
data = pd.read_csv(url)
```
{response} The code is succeed without any outputs.
Let us begin from here!
"""
IPYTHON_INTERPRETER_DESCRIPTION = '''\
It can run Python code in a manner as jupyter notebook. The code must be a valid code that contains only python method.'''
actions = [
dict(type=IPythonInterpreter,
user_data_dir='./data/cibench_dataset/datasources',
description=IPYTHON_INTERPRETER_DESCRIPTION)
]
protocol = dict(
type=ReActProtocol,
call_protocol=FEWSHOT_INSTRUCTION,
force_stop=FORCE_STOP_PROMPT_EN,
finish=dict(role='FINISH', begin='Final Answer:', end='\n'),
)
work_dir = './outputs/cibench/'
_agent_models = []
for m in _origin_models:
m = deepcopy(m)
if 'meta_template' in m and 'round' in m['meta_template']:
round = m['meta_template']['round']
if all(r['role'].upper() != 'SYSTEM'
for r in round): # no system round
if not any('api_role' in r for r in round):
m['meta_template']['round'].append(
dict(role='system', begin='System response:', end='\n'))
else:
m['meta_template']['round'].append(
dict(role='system', api_role='SYSTEM'))
print(
f'WARNING: adding SYSTEM round in meta_template for {m.get("abbr", None)}'
)
_agent_models.append(m)
protocol = dict(
type=ReActProtocol,
call_protocol=FEWSHOT_INSTRUCTION,
force_stop=FORCE_STOP_PROMPT_EN,
finish=dict(role='FINISH', begin='Final Answer:', end='\n'),
)
models = []
for m in _agent_models:
m = deepcopy(m)
origin_abbr = m.pop('abbr')
abbr = origin_abbr
m.pop('batch_size', None)
m.pop('max_out_len', None)
m.pop('max_seq_len', None)
run_cfg = m.pop('run_cfg', {})
agent_model = dict(
abbr=abbr,
summarizer_abbr=origin_abbr,
type=CodeAgent,
agent_type=CIReAct,
max_turn=3,
llm=m,
actions=[
dict(type=IPythonInterpreter,
user_data_dir='./data/cibench_dataset/datasources',
description=IPYTHON_INTERPRETER_DESCRIPTION)
],
protocol=protocol,
batch_size=1,
run_cfg=run_cfg,
)
models.append(agent_model)
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=4,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_cibench_api.py
================================================
from lagent.agents.react import ReActProtocol
from mmengine.config import read_base
from opencompass.lagent.actions.ipython_interpreter import IPythonInterpreter
from opencompass.lagent.agents.react import CIReAct
from opencompass.models import OpenAI
from opencompass.models.lagent import CodeAgent
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.CIBench.CIBench_generation_gen_8ab0dc import \
cibench_datasets as cibench_datasets_generation
from opencompass.configs.datasets.CIBench.CIBench_template_gen_e6b12a import \
cibench_datasets as cibench_datasets_template
# Oracle mode for analysis
# from opencompass.configs.datasets.CIBench.CIBench_template_oracle_gen_fecda1 import cibench_datasets as cibench_datasets_template_oracle
# from opencompass.configs.datasets.CIBench.CIBench_generation_oracle_gen_c4a7c1 import cibench_datasets as cibench_datasets_generation_oracle
from opencompass.configs.summarizers.cibench import summarizer
datasets = []
datasets += cibench_datasets_template
datasets += cibench_datasets_generation
# datasets += cibench_datasets_template_oracle
# datasets += cibench_datasets_generation_oracle
FORCE_STOP_PROMPT_EN = """You should directly give results based on history information."""
FEWSHOT_INSTRUCTION = """\
You are an assistant who can utilize external tools.
{tool_description}
To use a tool, please response with the following format:
```
{thought} Think what you need to solve, do you need to use tools?
{action} The tool name, should be one of [{action_names}].
{action_input} The input to the tool that you want to use.
```
The tool will give you response after your response using the following format:
```
{response} the results after call the tool.
```
Therefore DO NOT generate tool response by yourself.
Also please follow the guidelines:
1. Always use code interpreter to solve the problem.
2. The generated codes should always in a markdown code block format.
3. The generated codes will be executed in an ipython manner and the results will be cached.
4. Your responded code should always be simple and only solves the problem in current step.
For example:
File url: `xxxx`
### Step 1. Load the dataset from the url into a pandas DataFrame named `df`.
{thought} We should use `pandas` to solve this step.
{action} IPythonInterpreter
{action_input} ```python
import pandas as pd
url = "xxxx"
data = pd.read_csv(url)
```
{response} The code is succeed without any outputs.
Let us begin from here!
"""
IPYTHON_INTERPRETER_DESCRIPTION = '''\
It can run Python code in a manner as jupyter notebook. The code must be a valid code that contains only python method.'''
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
dict(role='SYSTEM', api_role='SYSTEM'),
], )
actions = [
dict(type=IPythonInterpreter,
user_data_dir='./data/cibench_dataset/datasources',
description=IPYTHON_INTERPRETER_DESCRIPTION)
]
protocol = dict(
type=ReActProtocol,
call_protocol=FEWSHOT_INSTRUCTION,
force_stop=FORCE_STOP_PROMPT_EN,
finish=dict(role='FINISH', begin='Final Answer:', end='\n'),
)
work_dir = 'outputs/cibench/'
models = [
dict(
abbr='gpt-4o',
type=CodeAgent,
agent_type=CIReAct,
max_turn=3,
llm=dict(
type=OpenAI,
path='gpt-4o',
rpm_verbose=True,
retry=99,
meta_template=api_meta_template,
query_per_second=1,
max_seq_len=2048,
temperature=0,
),
actions=actions,
protocol=protocol,
batch_size=1,
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=4,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_circular.py
================================================
from mmengine.config import read_base
from opencompass.datasets.circular import (
CircularARCDataset, CircularCEvalDataset, CircularCMMLUDataset,
CircularCSQADataset, CircularEvaluator, CircularHSWAGDataset,
CircularMMLUDataset, CircularOBQADataset, CircularRaceDataset)
from opencompass.summarizers import CircularSummarizer
with read_base():
from opencompass.configs.datasets.ARC_c.ARC_c_gen_1e0de5 import \
ARC_c_datasets
from opencompass.configs.datasets.ARC_e.ARC_e_gen_1e0de5 import \
ARC_e_datasets
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import \
ceval_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_gen_c13365 import \
cmmlu_datasets
from opencompass.configs.datasets.commonsenseqa.commonsenseqa_gen_1da2d0 import \
commonsenseqa_datasets
from opencompass.configs.datasets.hellaswag.hellaswag_gen_6faab5 import \
hellaswag_datasets
from opencompass.configs.datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from opencompass.configs.datasets.obqa.obqa_gen_9069e4 import obqa_datasets
from opencompass.configs.datasets.race.race_gen_69ee4f import race_datasets
from opencompass.configs.models.hf_internlm.hf_internlm_chat_7b import \
models as hf_internlm_chat_7b_model
from opencompass.configs.models.hf_internlm.hf_internlm_chat_20b import \
models as hf_internlm_chat_20b_model
from opencompass.configs.models.qwen.hf_qwen_7b_chat import \
models as hf_qwen_7b_chat_model
from opencompass.configs.models.qwen.hf_qwen_14b_chat import \
models as hf_qwen_14b_chat_model
from opencompass.configs.summarizers.groups.ceval import \
ceval_summary_groups
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups
from opencompass.configs.summarizers.groups.mmlu import mmlu_summary_groups
for ds, t in [
(ceval_datasets, CircularCEvalDataset),
(mmlu_datasets, CircularMMLUDataset),
(cmmlu_datasets, CircularCMMLUDataset),
(hellaswag_datasets, CircularHSWAGDataset),
(ARC_e_datasets, CircularARCDataset),
(ARC_c_datasets, CircularARCDataset),
(commonsenseqa_datasets, CircularCSQADataset),
(obqa_datasets, CircularOBQADataset),
(race_datasets, CircularRaceDataset),
]:
for d in ds:
d['type'] = t
d['abbr'] = d['abbr'] + '-circular-4'
d['eval_cfg']['evaluator'] = {
'type': CircularEvaluator,
'circular_pattern': 'circular'
}
d['circular_patterns'] = 'circular'
datasets = sum([
v
for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'
], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
# config summarizer
other_summary_groups = [
{
'name':
'average',
'subsets': [
'ceval', 'mmlu', 'cmmlu', 'hellaswag', 'ARC-e', 'ARC-c',
'commonsense_qa', 'openbookqa_fact', 'race-middle', 'race-high'
]
},
]
origin_summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
new_summary_groups = []
for item in origin_summary_groups:
new_summary_groups.append({
'name':
item['name'] + '-circular-4',
'subsets': [i + '-circular-4' for i in item['subsets']],
})
summarizer = dict(
type=CircularSummarizer,
metric_types=['acc_origin', 'perf_circular'],
dataset_abbrs=[
'average-circular-4',
'ceval-circular-4',
'mmlu-circular-4',
'cmmlu-circular-4',
'hellaswag-circular-4',
'ARC-e-circular-4',
'ARC-c-circular-4',
'commonsense_qa-circular-4',
'openbookqa_fact-circular-4',
'race-middle-circular-4',
'race-high-circular-4',
'ceval-humanities-circular-4',
'ceval-stem-circular-4',
'ceval-social-science-circular-4',
'ceval-other-circular-4',
'mmlu-humanities-circular-4',
'mmlu-stem-circular-4',
'mmlu-social-science-circular-4',
'mmlu-other-circular-4',
'cmmlu-humanities-circular-4',
'cmmlu-stem-circular-4',
'cmmlu-social-science-circular-4',
'cmmlu-other-circular-4',
'cmmlu-china-specific-circular-4',
],
summary_groups=new_summary_groups,
)
================================================
FILE: examples/eval_claude.py
================================================
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.collections.chat_medium import datasets
from opencompass.configs.models.claude.claude import models
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_code_passk.py
================================================
# This config is used for pass@k evaluation with `num_return_sequences`
# That model can generate multiple responses for single input
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.humaneval.humaneval_passk_gen_8e312c import \
humaneval_datasets
from opencompass.configs.datasets.mbpp.deprecated_mbpp_passk_gen_1e1056 import \
mbpp_datasets
from opencompass.configs.datasets.mbpp.deprecated_sanitized_mbpp_passk_gen_1e1056 import \
sanitized_mbpp_datasets
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
datasets += sanitized_mbpp_datasets
models = [
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-7b-Python',
path='codellama/CodeLlama-7b-Python-hf',
tokenizer_path='codellama/CodeLlama-7b-Python-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
generation_kwargs=dict(
num_return_sequences=10,
do_sample=True,
top_p=0.95,
temperature=0.8,
),
run_cfg=dict(num_gpus=1, num_procs=1),
),
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=300),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_code_passk_repeat_dataset.py
================================================
# This config is used for pass@k evaluation with dataset repetition
# That model cannot generate multiple response for single input
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.humaneval.humaneval_repeat10_gen_8e312c import \
humaneval_datasets
from opencompass.configs.datasets.mbpp.deprecated_mbpp_repeat10_gen_1e1056 import \
mbpp_datasets
from opencompass.configs.datasets.mbpp.deprecated_sanitized_mbpp_repeat10_gen_1e1056 import \
sanitized_mbpp_datasets
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
datasets += sanitized_mbpp_datasets
_meta_template = dict(round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='\n', generate=True),
], )
models = [
dict(
abbr='internlm-chat-7b-hf-v11',
type=HuggingFaceCausalLM,
path='internlm/internlm-chat-7b-v1_1',
tokenizer_path='internlm/internlm-chat-7b-v1_1',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False,
trust_remote_code=True,
),
max_seq_len=2048,
meta_template=_meta_template,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
generation_kwargs=dict(
do_sample=True,
top_p=0.95,
temperature=0.8,
),
run_cfg=dict(num_gpus=1, num_procs=1),
batch_size=8,
)
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=600),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_codeagent.py
================================================
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM, OpenAI
from opencompass.models.lagent import CodeAgent
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.gsm8k.gsm8k_gen_57b0b1 import \
gsm8k_datasets
from opencompass.configs.datasets.math.math_gen_943d32 import math_datasets
datasets = []
datasets += gsm8k_datasets
datasets += math_datasets
models = [
dict(abbr='gpt-3.5-react',
type=CodeAgent,
llm=dict(
type=OpenAI,
path='gpt-3.5-turbo',
key='ENV',
query_per_second=1,
max_seq_len=4096,
),
batch_size=8),
dict(abbr='WizardCoder-Python-13B-V1.0-react',
type=CodeAgent,
llm=dict(
type=HuggingFaceCausalLM,
path='WizardLM/WizardCoder-Python-13B-V1.0',
tokenizer_path='WizardLM/WizardCoder-Python-13B-V1.0',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_seq_len=2048,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
),
batch_size=8,
run_cfg=dict(num_gpus=2, num_procs=1)),
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_codebench_full.py
================================================
# This config is used to test all the code benchmarks
from mmengine.config import read_base
import os.path as osp
from opencompass.runners import LocalRunner, VOLCRunner
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
with read_base():
# Datasets Part
# bigcodebench
from opencompass.configs.datasets.bigcodebench.bigcodebench_full_instruct_gen import (
bigcodebench_full_instruct_datasets
)
from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_instruct_gen import (
bigcodebench_hard_instruct_datasets
)
# livecodebench code generation lite v5
from opencompass.configs.datasets.livecodebench.livecodebench_time_split_gen_a4f90b import (
LCB_datasets
)
# huamneval series
from opencompass.configs.datasets.humaneval.humaneval_openai_sample_evals_gen_dcae0e import (
humaneval_datasets
)
from opencompass.configs.datasets.humaneval_pro.humaneval_pro_gen import (
humanevalpro_datasets
)
from opencompass.configs.datasets.humanevalx.humanevalx_gen_620cfa import (
humanevalx_datasets
)
from opencompass.configs.datasets.humaneval_plus.humaneval_plus_gen import (
humaneval_plus_datasets
)
# mbpp series
from opencompass.configs.datasets.mbpp.mbpp_gen import (
mbpp_datasets
)
from opencompass.configs.datasets.mbpp_pro.mbpp_pro_gen import (
mbpppro_datasets
)
# multipl-e
from opencompass.configs.datasets.multipl_e.multiple_gen import (
multiple_datasets
)
# ds1000
from opencompass.configs.datasets.ds1000.ds1000_service_eval_gen_cbc84f import (
ds1000_datasets
)
# Models Part
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
models as lmdeploy_qwen2_5_7b_instruct_model,
)
# Summary Groups
from opencompass.configs.summarizers.groups.ds1000 import (
ds1000_summary_groups,
)
from opencompass.configs.summarizers.groups.multipl_e import (
multiple_summary_groups,
)
from opencompass.configs.summarizers.groups.humanevalx import (
humanevalx_summary_groups,
)
# models config
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for model in models:
model['max_seq_len'] = 16384
model['max_out_len'] = 8192
# datasets config
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
for item in humanevalx_datasets:
item['eval_cfg']['evaluator'][
'ip_address'
] = 'codeeval.opencompass.org.cn/humanevalx'
item['eval_cfg']['evaluator']['port'] = ''
for item in ds1000_datasets:
item['eval_cfg']['evaluator'][
'ip_address'
] = 'codeeval.opencompass.org.cn/ds1000'
item['eval_cfg']['evaluator']['port'] = ''
for dataset in datasets:
dataset['infer_cfg']['inferencer']['max_out_len'] = 8192
# summary
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
)
summary_groups.append(
{'name': 'humanevalx',
'subsets': ['humanevalx-python', 'humanevalx-cpp', 'humanevalx-java', 'humanevalx-js']}
)
summarizer = dict(
dataset_abbrs = [
['bigcodebench_hard_instruct', 'pass@1'],
['bigcodebench_full_instruct', 'pass@1'],
['lcb_code_generation', 'pass@1'],
['openai_humaneval', 'humaneval_pass@1'],
['mbpp', 'score'],
['humaneval_pro', 'pass@1'],
['mbpp_pro', 'pass@1'],
['humaneval_plus', 'humaneval_plus_pass@1'],
['multiple', 'naive_average'],
['humanevalx', 'naive_average'],
['ds1000', 'naive_average'],
'',
'humanevalx-python',
'humanevalx-cpp',
'humanevalx-java',
'humanevalx-js',
'',
'ds1000_Pandas',
'ds1000_Numpy',
'ds1000_Tensorflow',
'ds1000_Scipy',
'ds1000_Sklearn',
'ds1000_Pytorch',
'ds1000_Matplotlib',
'',
'humaneval-multiple-cpp',
'humaneval-multiple-cs',
'humaneval-multiple-go',
'humaneval-multiple-java',
'humaneval-multiple-rb',
'humaneval-multiple-js',
'humaneval-multiple-php',
'humaneval-multiple-r',
'humaneval-multiple-rs',
'humaneval-multiple-sh',
'',
'mbpp-multiple-cpp',
'mbpp-multiple-cs',
'mbpp-multiple-go',
'mbpp-multiple-java',
'mbpp-multiple-rb',
'mbpp-multiple-js',
'mbpp-multiple-php',
'mbpp-multiple-r',
'mbpp-multiple-rs',
'mbpp-multiple-sh'
],
summary_groups=summary_groups,
)
work_dir = 'outputs/code'
================================================
FILE: examples/eval_codegeex2.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.humanevalx.humanevalx_gen import \
humanevalx_datasets
from opencompass.configs.models.codegeex2.hf_codegeex2_6b import models
datasets = humanevalx_datasets
================================================
FILE: examples/eval_compassarena_subjectivebench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.compass_arena_subjective_bench.singleturn.pairwise_judge import compassarena_subjectivebench_singleturn_datasets
from opencompass.configs.datasets.subjective.compass_arena_subjective_bench.multiturn.pairwise_judge import compassarena_subjectivebench_multiturn_datasets
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import models as lmdeploy_internlm2_5_7b_chat
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import models as lmdeploy_internlm2_5_20b_chat
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import models as lmdeploy_llama3_1_8b_instruct
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_70b_instruct import models as lmdeploy_llama3_1_70b_instruct
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_0_5b_instruct import models as lmdeploy_qwen2_5_0_5b_instruct
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import models as lmdeploy_qwen2_5_1_5b_instruct
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_3b_instruct import models as lmdeploy_qwen2_5_3b_instruct
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import models as lmdeploy_qwen2_5_7b_instruct
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import models as lmdeploy_qwen2_5_14b_instruct
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import models as lmdeploy_qwen2_5_32b_instruct
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import models as lmdeploy_qwen2_5_72b_instruct
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import models as lmdeploy_qwen2_7b_instruct
from opencompass.models import (HuggingFace, HuggingFaceCausalLM,
HuggingFaceChatGLM3, OpenAI,
TurboMindModelwithChatTemplate)
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_num_worker import \
SubjectiveNumWorkerPartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.summarizers import DefaultSubjectiveSummarizer
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
# models = [
# dict(
# type=TurboMindModelwithChatTemplate,
# abbr='CompassJudger-1-7B-Instruct',
# path='opencompass/CompassJudger-1-7B-Instruct',
# engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
# gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
# max_seq_len=16384,
# max_out_len=2048,
# batch_size=16,
# run_cfg=dict(num_gpus=1),
# )
# ]
models = [
*lmdeploy_qwen2_5_14b_instruct, *lmdeploy_qwen2_5_32b_instruct,
*lmdeploy_qwen2_5_7b_instruct, *lmdeploy_qwen2_7b_instruct
]
datasets = [
*compassarena_subjectivebench_singleturn_datasets,
*compassarena_subjectivebench_multiturn_datasets
] # add datasets you want
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
judge_models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='CompassJudger-1-32B-Instruct',
path='opencompass/CompassJudger-1-32B-Instruct',
engine_config=dict(session_len=16384, max_batch_size=16, tp=4),
gen_config=dict(top_k=1,
temperature=1e-6,
top_p=0.9,
max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=4),
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=DefaultSubjectiveSummarizer, )
work_dir = 'outputs/subjective/'
================================================
FILE: examples/eval_compassarena_subjectivebench_bradleyterry.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.compass_arena_subjective_bench.singleturn.pairwise_bt_judge import (
compassarena_subjectivebench_bradleyterry_singleturn_datasets, )
from opencompass.configs.datasets.subjective.compass_arena_subjective_bench.multiturn.pairwise_bt_judge import (
compassarena_subjectivebench_bradleyterry_multiturn_datasets, )
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import (
models as lmdeploy_internlm2_5_7b_chat, )
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import (
models as lmdeploy_internlm2_5_20b_chat, )
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import (
models as lmdeploy_llama3_1_8b_instruct, )
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_70b_instruct import (
models as lmdeploy_llama3_1_70b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_0_5b_instruct import (
models as lmdeploy_qwen2_5_0_5b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import (
models as lmdeploy_qwen2_5_1_5b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_3b_instruct import (
models as lmdeploy_qwen2_5_3b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
models as lmdeploy_qwen2_5_7b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
models as lmdeploy_qwen2_5_14b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import (
models as lmdeploy_qwen2_5_32b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import (
models as lmdeploy_qwen2_5_72b_instruct, )
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import (
models as lmdeploy_qwen2_7b_instruct, )
from opencompass.models import (HuggingFace, HuggingFaceCausalLM,
HuggingFaceChatGLM3, OpenAI,
TurboMindModelwithChatTemplate)
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_num_worker import \
SubjectiveNumWorkerPartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.summarizers import CompassArenaBradleyTerrySummarizer
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
# -------------Inference Stage ----------------------------------------
models = [
*lmdeploy_qwen2_5_14b_instruct,
*lmdeploy_qwen2_5_32b_instruct,
*lmdeploy_qwen2_5_7b_instruct,
*lmdeploy_qwen2_7b_instruct,
]
datasets = [
*compassarena_subjectivebench_bradleyterry_singleturn_datasets,
*compassarena_subjectivebench_bradleyterry_multiturn_datasets,
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
judge_models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='CompassJudger-1-32B-Instruct',
path='opencompass/CompassJudger-1-32B-Instruct',
engine_config=dict(session_len=16384, max_batch_size=16, tp=4),
gen_config=dict(top_k=1,
temperature=1e-6,
top_p=0.9,
max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=4),
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
## ------------- Summary Configuration
# This step fits a Bradley-Terry model (statistical model) with an option
# to include style features and control variables based on groups
# (group variables must be available in the input dataset for each observation).
summarizer = dict(
type=CompassArenaBradleyTerrySummarizer,
rating_system='bradleyterry',
report_pred_win_rates=True,
num_bootstrap=100,
num_cpu=None,
with_control_vars=True,
normalize_style_features=False,
odds_ratio=True,
groups=['difficulty', 'category'],
)
work_dir = 'outputs/compassarena_subjectivebench_bradleyterry/'
================================================
FILE: examples/eval_contamination.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.ARC_c.ARC_c_clean_ppl import \
ARC_c_datasets
from opencompass.configs.datasets.ceval.ceval_clean_ppl import \
ceval_datasets
from opencompass.configs.datasets.hellaswag.hellaswag_clean_ppl import \
hellaswag_datasets
from opencompass.configs.datasets.mmlu.mmlu_clean_ppl import mmlu_datasets
from opencompass.configs.models.hf_llama.hf_llama2_7b import \
models as hf_llama2_7b_model
from opencompass.configs.models.qwen.hf_qwen_7b import \
models as hf_qwen_7b_model
from opencompass.configs.models.yi.hf_yi_6b import models as hf_yi_6b_model
from opencompass.configs.summarizers.contamination import summarizer
datasets = [
*ceval_datasets, *mmlu_datasets, *hellaswag_datasets, *ARC_c_datasets
]
models = [*hf_yi_6b_model, *hf_qwen_7b_model, *hf_llama2_7b_model]
================================================
FILE: examples/eval_corebench_2409_base_objective.py
================================================
import os.path as osp
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets Part
## Core Set
# ## Examination
# ## Reasoning
from opencompass.configs.datasets.bbh.bbh_gen_98fba6 import bbh_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_ppl_041cbf import \
cmmlu_datasets
from opencompass.configs.datasets.drop.drop_gen_a2697c import drop_datasets
# ## Scientific
from opencompass.configs.datasets.gpqa.gpqa_few_shot_ppl_2c9cd6 import \
gpqa_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import \
gsm8k_datasets
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_ppl_59c85e import \
hellaswag_datasets
# ## Coding
from opencompass.configs.datasets.humaneval.deprecated_humaneval_gen_d2537e import \
humaneval_datasets
# ## Math
from opencompass.configs.datasets.math.math_4shot_base_gen_43d5b6 import \
math_datasets
from opencompass.configs.datasets.MathBench.mathbench_2024_few_shot_mixed_4a3fd4 import \
mathbench_datasets
from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_742f0c import \
sanitized_mbpp_datasets
from opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d import mmlu_datasets
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_few_shot_gen_bfaf90 import \
mmlu_pro_datasets
# Model List
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b import \
models as lmdeploy_qwen2_5_1_5b_model
from opencompass.configs.summarizers.groups.bbh import bbh_summary_groups
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups
# TODO: Add LiveCodeBench
# ## Instruction Following
# from opencompass.configs.datasets.IFEval.IFEval_gen_3321a3 import ifeval_datasets
# Summarizer
from opencompass.configs.summarizers.groups.mmlu import mmlu_summary_groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
# from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import models as lmdeploy_qwen2_1_5b_instruct_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import models as hf_internlm2_5_7b_chat_model
# from opencompass.configs.models.openbmb.hf_minicpm_2b_sft_bf16 import models as hf_minicpm_2b_sft_bf16_model
# from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import models as hf_yi_1_5_6b_chat_model
# from opencompass.configs.models.gemma.hf_gemma_2b_it import models as hf_gemma_2b_it_model
# from opencompass.configs.models.yi.hf_yi_1_5_34b_chat import models as hf_yi_1_5_34b_chat_model
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
# with read_base():
core_summary_groups = [
{
'name':
'core_average',
'subsets': [['mmlu', 'accuracy'], ['mmlu_pro', 'accuracy'],
['cmmlu', 'accuracy'], ['bbh', 'naive_average'],
['hellaswag', 'accuracy'], ['drop', 'accuracy'],
['math', 'accuracy'], ['gsm8k', 'accuracy'],
['mathbench-t (average)', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['openai_humaneval', 'humaneval_pass@1'],
['IFEval', 'Prompt-level-strict-accuracy'],
['sanitized_mbpp', 'score'],
['mathbench-t (average)', 'naive_average']],
},
]
summarizer = dict(
dataset_abbrs=[
['mmlu', 'accuracy'],
['mmlu_pro', 'accuracy'],
['cmmlu', 'accuracy'],
['bbh', 'naive_average'],
['hellaswag', 'accuracy'],
['drop', 'accuracy'],
['math', 'accuracy'],
['gsm8k', 'accuracy'],
['mathbench-t (average)', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['openai_humaneval', 'humaneval_pass@1'],
['IFEval', 'Prompt-level-strict-accuracy'],
['sanitized_mbpp', 'score'],
'mathbench-a (average)',
'mathbench-t (average)'
'',
['mmlu', 'accuracy'],
['mmlu-stem', 'accuracy'],
['mmlu-social-science', 'accuracy'],
['mmlu-humanities', 'accuracy'],
['mmlu-other', 'accuracy'],
'',
['mmlu_pro', 'accuracy'],
['mmlu_pro_math', 'accuracy'],
['mmlu_pro_physics', 'accuracy'],
['mmlu_pro_chemistry', 'accuracy'],
['mmlu_pro_law', 'accuracy'],
['mmlu_pro_engineering', 'accuracy'],
['mmlu_pro_other', 'accuracy'],
['mmlu_pro_economics', 'accuracy'],
['mmlu_pro_health', 'accuracy'],
['mmlu_pro_psychology', 'accuracy'],
['mmlu_pro_business', 'accuracy'],
['mmlu_pro_biology', 'accuracy'],
['mmlu_pro_philosophy', 'accuracy'],
['mmlu_pro_computer_science', 'accuracy'],
['mmlu_pro_history', 'accuracy'],
'',
['cmmlu', 'accuracy'],
['cmmlu-stem', 'accuracy'],
['cmmlu-social-science', 'accuracy'],
['cmmlu-humanities', 'accuracy'],
['cmmlu-other', 'accuracy'],
['cmmlu-china-specific', 'accuracy'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Local Runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask)),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
base_exp_dir = 'outputs/corebench_2409_objective/'
work_dir = osp.join(base_exp_dir, 'base_objective')
================================================
FILE: examples/eval_corebench_2409_chat_objective.py
================================================
import os.path as osp
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets Part
## Core Set
# ## Examination
# ## Reasoning
from opencompass.configs.datasets.bbh.bbh_gen_4a31fa import bbh_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_0shot_cot_gen_305931 import \
cmmlu_datasets
from opencompass.configs.datasets.drop.drop_openai_simple_evals_gen_3857b0 import \
drop_datasets
# ## Scientific
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_a58960 import \
gsm8k_datasets
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import \
hellaswag_datasets
# ## Coding
from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import \
humaneval_datasets
# TODO: Add LiveCodeBench
# ## Instruction Following
from opencompass.configs.datasets.IFEval.IFEval_gen_3321a3 import \
ifeval_datasets
# ## Math
from opencompass.configs.datasets.math.math_0shot_gen_393424 import \
math_datasets
from opencompass.configs.datasets.MathBench.mathbench_2024_gen_50a320 import \
mathbench_datasets
from opencompass.configs.datasets.mbpp.sanitized_mbpp_mdblock_gen_a447ff import \
sanitized_mbpp_datasets
from opencompass.configs.datasets.mmlu.mmlu_openai_simple_evals_gen_b618ea import \
mmlu_datasets
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets
from opencompass.configs.summarizers.groups.bbh import bbh_summary_groups
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups
# Summarizer
from opencompass.configs.summarizers.groups.mmlu import mmlu_summary_groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
# Model List
# from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import models as lmdeploy_qwen2_1_5b_instruct_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import models as hf_internlm2_5_7b_chat_model
# from opencompass.configs.models.openbmb.hf_minicpm_2b_sft_bf16 import models as hf_minicpm_2b_sft_bf16_model
# from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import models as hf_yi_1_5_6b_chat_model
# from opencompass.configs.models.gemma.hf_gemma_2b_it import models as hf_gemma_2b_it_model
# from opencompass.configs.models.yi.hf_yi_1_5_34b_chat import models as hf_yi_1_5_34b_chat_model
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
# with read_base():
core_summary_groups = [
{
'name':
'core_average',
'subsets': [['mmlu', 'accuracy'], ['mmlu_pro', 'accuracy'],
['cmmlu', 'accuracy'], ['bbh', 'score'],
['math', 'accuracy'],
['openai_humaneval', 'humaneval_pass@1'],
['GPQA_diamond', 'accuracy'],
['IFEval', 'Prompt-level-strict-accuracy'],
['drop', 'accuracy'], ['sanitized_mbpp', 'score'],
['gsm8k', 'accuracy'], ['hellaswag', 'accuracy'],
['mathbench-t (average)', 'naive_average']],
},
]
summarizer = dict(
dataset_abbrs=[
['core_average', 'naive_average'],
['mmlu', 'accuracy'],
['mmlu_pro', 'accuracy'],
['cmmlu', 'accuracy'],
['bbh', 'score'],
['math', 'accuracy'],
['openai_humaneval', 'humaneval_pass@1'],
['GPQA_diamond', 'accuracy'],
['IFEval', 'Prompt-level-strict-accuracy'],
['drop', 'accuracy'],
['sanitized_mbpp', 'score'],
['gsm8k', 'accuracy'],
['hellaswag', 'accuracy'],
'mathbench-a (average)',
'mathbench-t (average)'
'',
['mmlu', 'accuracy'],
['mmlu-stem', 'accuracy'],
['mmlu-social-science', 'accuracy'],
['mmlu-humanities', 'accuracy'],
['mmlu-other', 'accuracy'],
'',
['mmlu_pro', 'accuracy'],
['mmlu_pro_math', 'accuracy'],
['mmlu_pro_physics', 'accuracy'],
['mmlu_pro_chemistry', 'accuracy'],
['mmlu_pro_law', 'accuracy'],
['mmlu_pro_engineering', 'accuracy'],
['mmlu_pro_other', 'accuracy'],
['mmlu_pro_economics', 'accuracy'],
['mmlu_pro_health', 'accuracy'],
['mmlu_pro_psychology', 'accuracy'],
['mmlu_pro_business', 'accuracy'],
['mmlu_pro_biology', 'accuracy'],
['mmlu_pro_philosophy', 'accuracy'],
['mmlu_pro_computer_science', 'accuracy'],
['mmlu_pro_history', 'accuracy'],
'',
['cmmlu', 'accuracy'],
['cmmlu-stem', 'accuracy'],
['cmmlu-social-science', 'accuracy'],
['cmmlu-humanities', 'accuracy'],
['cmmlu-other', 'accuracy'],
['cmmlu-china-specific', 'accuracy'],
'',
['bbh', 'extract_rate'],
['math', 'extract_rate'],
# ['openai_humaneval', 'extract_rate'],
['GPQA_diamond', 'extract_rate'],
# ['IFEval', 'extract_rate'],
'',
['mmlu', 'extract_rate'],
['mmlu-stem', 'extract_rate'],
['mmlu-social-science', 'extract_rate'],
['mmlu-humanities', 'extract_rate'],
['mmlu-other', 'extract_rate'],
'',
['mmlu_pro', 'extract_rate'],
['mmlu_pro_math', 'extract_rate'],
['mmlu_pro_physics', 'extract_rate'],
['mmlu_pro_chemistry', 'extract_rate'],
['mmlu_pro_law', 'extract_rate'],
['mmlu_pro_engineering', 'extract_rate'],
['mmlu_pro_other', 'extract_rate'],
['mmlu_pro_economics', 'extract_rate'],
['mmlu_pro_health', 'extract_rate'],
['mmlu_pro_psychology', 'extract_rate'],
['mmlu_pro_business', 'extract_rate'],
['mmlu_pro_biology', 'extract_rate'],
['mmlu_pro_philosophy', 'extract_rate'],
['mmlu_pro_computer_science', 'extract_rate'],
['mmlu_pro_history', 'extract_rate'],
'',
['cmmlu', 'extract_rate'],
['cmmlu-stem', 'extract_rate'],
['cmmlu-social-science', 'extract_rate'],
['cmmlu-humanities', 'extract_rate'],
['cmmlu-other', 'extract_rate'],
['cmmlu-china-specific', 'extract_rate'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Local Runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask)),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
base_exp_dir = 'outputs/corebench_2409_objective/'
work_dir = osp.join(base_exp_dir, 'chat_objective')
================================================
FILE: examples/eval_corebench_2409_longcontext.py
================================================
import os.path as osp
from copy import deepcopy
from mmengine.config import read_base
from opencompass.models import (HuggingFacewithChatTemplate,
TurboMindModelwithChatTemplate)
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import DLCRunner, LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
from opencompass.configs.datasets.longbench.longbench import \
longbench_datasets
from opencompass.configs.datasets.needlebench.needlebench_8k.needlebench_8k import \
needlebench_datasets as needlebench_8k_datasets
from opencompass.configs.datasets.needlebench.needlebench_32k.needlebench_32k import \
needlebench_datasets as needlebench_32k_datasets
from opencompass.configs.datasets.needlebench.needlebench_128k.needlebench_128k import \
needlebench_datasets as needlebench_128k_datasets
from opencompass.configs.datasets.ruler.ruler_8k_gen import \
ruler_datasets as ruler_8k_datasets
from opencompass.configs.datasets.ruler.ruler_32k_gen import \
ruler_datasets as ruler_32k_datasets
from opencompass.configs.datasets.ruler.ruler_128k_gen import \
ruler_datasets as ruler_128k_datasets
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat_1m import \
models as lmdeploy_internlm2_5_7b_1m_chat_model
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
models as llama3_1_8b_instruct_model
# Instruct models
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import \
models as lmdeploy_qwen2_7b_instruct_model
# Summary Groups
from opencompass.configs.summarizers.groups.longbench import \
longbench_summary_groups
from opencompass.configs.summarizers.groups.ruler import \
ruler_summary_groups
from opencompass.configs.summarizers.needlebench import (
needlebench_8k_summarizer, needlebench_32k_summarizer,
needlebench_128k_summarizer)
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
needlebench_8k_summary_groups = needlebench_8k_summarizer['summary_groups']
needlebench_32k_summary_groups = needlebench_32k_summarizer['summary_groups']
needlebench_128k_summary_groups = needlebench_128k_summarizer['summary_groups']
# Instruct models summarizer
summarizer = dict(
dataset_abbrs=[
['ruler_8k', 'naive_average'],
['ruler_32k', 'naive_average'],
['ruler_128k', 'naive_average'],
['NeedleBench-Overall-Score-8K', 'weighted_average'],
['NeedleBench-Overall-Score-32K', 'weighted_average'],
['NeedleBench-Overall-Score-128K', 'weighted_average'],
['longbench', 'naive_average'],
['longbench_zh', 'naive_average'],
['longbench_en', 'naive_average'],
'',
'longbench_single-document-qa',
'longbench_multi-document-qa',
'longbench_summarization',
'longbench_few-shot-learning',
'longbench_synthetic-tasks',
'longbench_code-completion',
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
#######################################################################
# PART 3 Models List #
#######################################################################
lmdeploy_qwen2_7b_instruct_model[0]['max_seq_len'] = 1048576
lmdeploy_qwen2_7b_instruct_model[0]['engine_config']['session_len'] = 1048576
lmdeploy_qwen2_7b_instruct_model[0]['engine_config']['tp'] = 4
lmdeploy_qwen2_7b_instruct_model[0]['engine_config']['rope_scaling_factor'] = 4
lmdeploy_qwen2_7b_instruct_model[0]['run_cfg']['num_gpus'] = 4
llama3_1_8b_instruct_model[0]['max_seq_len'] = 1048576
llama3_1_8b_instruct_model[0]['engine_config']['session_len'] = 1048576
llama3_1_8b_instruct_model[0]['engine_config']['tp'] = 4
llama3_1_8b_instruct_model[0]['engine_config']['rope_scaling_factor'] = 4
llama3_1_8b_instruct_model[0]['run_cfg']['num_gpus'] = 4
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Local Runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask)),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
base_exp_dir = 'outputs/corebench/'
work_dir = osp.join(base_exp_dir, 'long_context')
================================================
FILE: examples/eval_corebench_2409_subjective.py
================================================
import os.path as osp
from copy import deepcopy
from mmengine.config import read_base
from opencompass.models import (HuggingFacewithChatTemplate,
TurboMindModelwithChatTemplate)
from opencompass.models.openai_api import OpenAI, OpenAISDK
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import DLCRunner, LocalRunner
from opencompass.summarizers import SubjectiveSummarizer
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets Part
from opencompass.configs.datasets.subjective.alignbench.alignbench_v1_1_judgeby_critiquellm import \
alignbench_datasets
from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare import \
arenahard_datasets
from opencompass.configs.datasets.subjective.multiround.mtbench_single_judge_diff_temp import \
mtbench_datasets
# Summarizer
# Model List
# from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import models as lmdeploy_qwen2_1_5b_instruct_model
# from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import models as hf_internlm2_5_7b_chat_model
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
summarizer = dict(type=SubjectiveSummarizer, function='subjective')
#######################################################################
# PART 3 Models List #
#######################################################################
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='internlm2_5-7b-chat-turbomind',
path='internlm/internlm2_5-7b-chat',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=40,
temperature=1.0,
top_p=0.9,
max_new_tokens=4096),
max_seq_len=16384,
max_out_len=4096,
batch_size=16,
run_cfg=dict(num_gpus=1),
)
]
models = sum([v for k, v in locals().items() if k.endswith('_model')], models)
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# Local Runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask)),
)
# JudgeLLM
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
judge_models = [
dict(
type=OpenAISDK,
abbr='gpt-4o-2024-08-06',
path='gpt-4o-2024-08-06',
# openai_api_base=
# 'http://10.140.1.86:10001/v1', # Change to your own url if needed.
key='YOUR_API_KEY',
retry=10,
meta_template=api_meta_template,
rpm_verbose=True,
query_per_second=1,
max_out_len=4096,
max_seq_len=16384,
batch_size=16,
temperature=0.01,
tokenizer_path='gpt-4o-2024-08-06')
]
# Evaluation with local runner
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
base_exp_dir = 'outputs/corebench/'
work_dir = osp.join(base_exp_dir, 'chat_subjective')
================================================
FILE: examples/eval_deepseek_r1.py
================================================
# Support AIME-2024 with Repeat8
# Support MATH-500
# Support OlympiadBench
# Support OmniMath
# Support LiveMathBench-202412-Hard
import os.path as osp
from itertools import product
from opencompass.models import OpenAISDK
from mmengine.config import read_base
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
from opencompass.runners import LocalRunner
from opencompass.models import (
TurboMindModelwithChatTemplate,
)
#######################################################################
# PART 1 Datasets List #
#######################################################################
with read_base():
# You can comment out the datasets you don't want to evaluate
# Datasets
# from opencompass.configs.datasets.math.math_prm800k_500_llmverify_gen_6ff468 import math_datasets # 1 Run
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets # 8 Run
# from opencompass.configs.datasets.OlympiadBench.OlympiadBench_0shot_llmverify_gen_be8b13 import olympiadbench_datasets
# from opencompass.configs.datasets.omni_math.omni_math_llmverify_gen_ccf9c0 import omnimath_datasets # 1 Run
# from opencompass.configs.datasets.livemathbench.livemathbench_hard_custom_llmverify_gen_85d0ef import livemathbench_datasets
# Summarizer
from opencompass.configs.summarizers.groups.OlympiadBench import OlympiadBenchMath_summary_groups
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# Set LLM Verifier used for each dataset
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # You need to set your own judge model path
key='sk-1234', # You need to set your own API key
openai_api_base=[
'http://172.30.56.1:4000/v1', # You need to set your own API base
],
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
),
query_per_second=16,
batch_size=1024,
temperature=0.001,
tokenizer_path='gpt-4o-2024-05-13',
verbose=True,
max_out_len=16384,
# max_seq_len=32768,
max_seq_len=49152,
)
for item in datasets:
# item['infer_cfg']['inferencer']['max_out_len'] = 32768 # You can unset this line if you want to avoid length cutoff
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
#######################################################################
# PART 2 Model List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
models += [
# You can comment out the models you don't want to evaluate
# All models use sampling mode
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# dict(
# type=TurboMindModelwithChatTemplate,
# abbr='deepseek-r1-distill-qwen-14b-turbomind',
# path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
# engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
# gen_config=dict(
# do_sample=True,
# temperature=0.6,
# top_p=0.95,
# max_new_tokens=32768),
# max_seq_len=32768,
# max_out_len=32768,
# batch_size=128,
# run_cfg=dict(num_gpus=2),
# pred_postprocessor=dict(type=extract_non_reasoning_content)
# ),
# dict(
# type=TurboMindModelwithChatTemplate,
# abbr='deepseek-r1-distill-qwen-32b-turbomind',
# path='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
# engine_config=dict(session_len=32768, max_batch_size=128, tp=4),
# gen_config=dict(
# do_sample=True,
# temperature=0.6,
# top_p=0.95,
# max_new_tokens=16384),
# max_seq_len=32768,
# max_out_len=16384,
# batch_size=128,
# run_cfg=dict(num_gpus=4),
# pred_postprocessor=dict(type=extract_non_reasoning_content)
# ),
]
#######################################################################
# PART 3 Inference/Evaluation #
#######################################################################
# Inference configuration
infer = dict(
partitioner=dict(
type=NumWorkerPartitioner,
num_worker=1
# Similar with data-parallelism, how many workers for evaluation,
# each worker will evaluate a part of the dataset. Total GPUs = num_worker * num_gpus_per_worker
# For example, If you have 8 GPUs, for 7B model using 1 GPU for one instance, you can set num_worker=8
# to max-utilize the GPUs.
# If you have 8 GPUs, for 14B model using 2 GPUs for one instance, you can set num_worker=4
),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLInferTask)
),
)
# Evaluation configuration
eval = dict(
partitioner=dict(
type=NaivePartitioner, n=8
),
runner=dict(
type=LocalRunner,
task=dict(
type=OpenICLEvalTask)
),
)
#######################################################################
# PART 4 Summarizer #
#######################################################################
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
)
summary_groups.extend([
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
{
'name': 'LiveMathBench-v202412-Hard-Aveage8',
'subsets':[[
f'livemathbench_hard_custom_{split}_run{run_idx}', 'accuracy']
for split, run_idx in product(['hard_cn', 'hard_en'], range(8))
]
}
])
# Summarizer
summarizer = dict(
dataset_abbrs=[
'MATH',
# ['LiveMathBench-k1-n1', 'pass@1'],
# ['LiveMathBench-v202412-greedy', 'G-Pass@1_0.0'],
# ['aime2024', 'accuracy'],
['math_prm800k_500-llmjudge', 'accuracy'],
['AIME2024-Aveage8', 'naive_average'],
['LiveMathBench-v202412-Hard-Aveage8', 'naive_average'],
['OlympiadBenchMath', 'accuracy'],
['OmniMath', 'accuracy'],
],
summary_groups=summary_groups,
)
#######################################################################
# PART 5 Utils #
#######################################################################
work_dir = 'outputs/deepseek_r1_reasoning'
================================================
FILE: examples/eval_dingo.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.dingo.dingo_gen import datasets
from opencompass.configs.models.hf_internlm.hf_internlm_7b import models
work_dir = './outputs/eval_dingo'
================================================
FILE: examples/eval_ds1000_interpreter.py
================================================
from mmengine.config import read_base
from opencompass.lagent.actions.python_interpreter import PythonInterpreter
from opencompass.models import OpenAI
from opencompass.models.lagent import CodeAgent
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
PYTHON_INTERPRETER_DESCRIPTION = """\
It can run a Python code. The code must be a valid code that contains only python method.
"""
actions = [
dict(
type=PythonInterpreter,
description=PYTHON_INTERPRETER_DESCRIPTION,
answer_expr=None,
)
]
with read_base():
from opencompass.configs.datasets.ds1000.ds1000_gen_5c4bec import \
ds1000_datasets as datasets
models = [
dict(abbr='gpt-3.5-react',
type=CodeAgent,
llm=dict(
type=OpenAI,
path='gpt-3.5-turbo',
key='ENV',
query_per_second=1,
max_seq_len=4096,
),
actions=actions,
batch_size=8),
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_edgellm_demo.py
================================================
from mmengine.config import read_base
with read_base():
# datasets
from opencompass.configs.datasets.bbh.bbh_gen import bbh_datasets
from opencompass.configs.datasets.commonsenseqa.commonsenseqa_7shot_cot_gen_734a22 import \
commonsenseqa_datasets
from opencompass.configs.datasets.FewCLUE_chid.FewCLUE_chid_gen import \
chid_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import gsm8k_datasets
from opencompass.configs.datasets.humaneval.humaneval_gen import \
humaneval_datasets
from opencompass.configs.datasets.longbench.longbench import \
longbench_datasets
from opencompass.configs.datasets.truthfulqa.truthfulqa_gen import \
truthfulqa_datasets
# models
from opencompass.configs.models.hf_llama.hf_llama3_8b import \
models as hf_llama3_8b_model
from opencompass.configs.models.others.hf_phi_2 import \
models as hf_phi_2_model
from opencompass.configs.models.qwen.hf_qwen2_7b import \
models as hf_qwen2_7b_model
datasets = sum([
v
for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'
], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
work_dir = './outputs/edgellm/'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# dataset version metric mode phi-2_hf
# ------------------------------------------- --------- ---------------- ------ ----------
# commonsense_qa c946f2 accuracy gen 65.19
# openai_humaneval 8e312c humaneval_pass@1 gen 30.49
# truthful_qa 5ddc62 rouge_max gen 0.08
# truthful_qa 5ddc62 rouge_diff gen -0.00
# truthful_qa 5ddc62 rouge_acc gen 0.41
# gsm8k 1d7fe4 accuracy gen 62.40
# chid-dev 211ee7 accuracy gen 12.87
# chid-test 211ee7 accuracy gen 14.34
# bbh - naive_average gen 59.50
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# dataset version metric mode Meta-Llama-3-8B_hf
# ------------------------------------------- --------- ---------------- ------ --------------------
# commonsense_qa c946f2 accuracy gen 70.11
# openai_humaneval 8e312c humaneval_pass@1 gen 26.22
# truthful_qa 5ddc62 rouge_max gen 0.07
# truthful_qa 5ddc62 rouge_diff gen -0.01
# truthful_qa 5ddc62 rouge_acc gen 0.41
# gsm8k 1d7fe4 accuracy gen 55.80
# chid-dev 211ee7 accuracy gen 40.59
# chid-test 211ee7 accuracy gen 36.66
# bbh - naive_average gen 61.62
# 20240816_060452
# tabulate format
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# dataset version metric mode qwen2-7b-hf
# -------------- --------- ---------- ------ -------------
# commonsense_qa 734a22 accuracy gen 65.19
# truthful_qa 5ddc62 rouge_max gen 0.08
# truthful_qa 5ddc62 rouge_diff gen -0.02
# truthful_qa 5ddc62 rouge_acc gen 0.44
================================================
FILE: examples/eval_eese_api_judge.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.eese.eese_judge_gen import \
eese_datasets
# 选择一个感兴趣的模型
from opencompass.configs.models.openai.gpt_4o_2024_05_13 import \
models as gpt4
from opencompass.models import OpenAISDK
# 配置评判模型
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
], )
judge_cfg = dict(
abbr='model-judge',
type=OpenAISDK,
path='model-name',
key='your-api-key',
openai_api_base=['openai-url'],
meta_template=api_meta_template,
query_per_second=16,
batch_size=1,
temperature=0.001,
tokenizer_path='gpt-4o',
verbose=True,
max_out_len=16384,
max_seq_len=49152,
)
datasets = eese_datasets
models = gpt4
# 为每个数据集增加judge_cfg信息,而不是覆盖
for dataset in datasets:
if 'eval_cfg' in dataset and 'evaluator' in dataset['eval_cfg']:
# 获取现有的judge_cfg,如果不存在则创建空字典
existing_judge_cfg = dataset['eval_cfg']['evaluator'].get('judge_cfg', {})
# 更新现有的judge_cfg,保留原有配置并添加新配置
existing_judge_cfg.update(judge_cfg)
# 将更新后的配置设置回去
dataset['eval_cfg']['evaluator']['judge_cfg'] = existing_judge_cfg
================================================
FILE: examples/eval_gpt3.5.py
================================================
from mmengine.config import read_base
from opencompass.models import OpenAI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.collections.chat_medium import datasets
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
], )
models = [
dict(
abbr='GPT-3.5-turbo-0613',
type=OpenAI,
path='gpt-3.5-turbo-0613',
key=
'ENV', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048,
max_seq_len=4096,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_gpt4.py
================================================
from mmengine.config import read_base
from opencompass.models import OpenAI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.collections.chat_medium import datasets
from opencompass.configs.summarizers.medium import summarizer
# GPT4 needs a special humaneval postprocessor
from opencompass.datasets.humaneval import humaneval_gpt_postprocess
for _dataset in datasets:
if _dataset['path'] == 'openai_humaneval':
_dataset['eval_cfg']['pred_postprocessor'][
'type'] = humaneval_gpt_postprocess
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
], )
models = [
dict(
abbr='GPT4',
type=OpenAI,
path='gpt-4-0613',
key=
'ENV', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=4,
task=dict(type=OpenICLInferTask)),
)
================================================
FILE: examples/eval_hellobench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.hellobench.hellobench import hellobench_datasets
from opencompass.models import HuggingFacewithChatTemplate, OpenAI
from opencompass.partitioners import NaivePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import DefaultSubjectiveSummarizer
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
# make sure your models' generation parameters are set properly, for example, if you set temperature=0.8, make sure you set all models' temperature to 0.8
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='glm-4-9b-chat-hf',
path='THUDM/glm-4-9b-chat',
max_out_len=16384,
generation_kwargs=dict(
temperature=0.8,
do_sample=
True, #For subjective evaluation, we suggest you do set do_sample when running model inference!
),
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
batch_size=1,
run_cfg=dict(num_gpus=2, num_procs=1),
stop_words=['<|endoftext|>', '<|user|>', '<|observation|>'],
)
]
datasets = [*hellobench_datasets] # add datasets you want
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
# ------------- JudgeLLM Configuration
# we recommand to use gpt4o-mini as the judge model
# if you want to use open-source LLMs as judge models, you can uncomment the following code
# judge_models = [
# dict(
# type=HuggingFacewithChatTemplate,
# abbr='glm-4-9b-chat-hf',
# path='THUDM/glm-4-9b-chat',
# max_out_len=16384,
# generation_kwargs=dict(
# temperature=0.8,
# do_sample=True, #For subjective evaluation, we suggest you do set do_sample when running model inference!
# ),
# model_kwargs=dict(
# device_map='auto',
# trust_remote_code=True,
# ),
# batch_size=1,
# run_cfg=dict(num_gpus=2, num_procs=1),
# stop_words=['<|endoftext|>', '<|user|>', '<|observation|>'],
# )
# ]
judge_models = [
dict(
abbr='GPT4o',
type=OpenAI,
path='gpt-4o',
key=
'xxxx', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=16,
max_out_len=4096,
batch_size=1,
temperature=0.8,
seed=42,
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=DefaultSubjectiveSummarizer)
work_dir = 'outputs/hellobench/'
================================================
FILE: examples/eval_hf_llama2.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.agieval.agieval_mixed_713d14 import \
agieval_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_3309bd import \
gsm8k_datasets
from opencompass.configs.datasets.hellaswag.hellaswag_ppl_a6e128 import \
hellaswag_datasets
from opencompass.configs.datasets.humaneval.deprecated_humaneval_gen_a82cae import \
humaneval_datasets
from opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d import mmlu_datasets
from opencompass.configs.datasets.nq.nq_open_gen_e93f8a import nq_datasets
from opencompass.configs.datasets.obqa.obqa_ppl_6aac9e import obqa_datasets
from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_ppl_314797 import \
BoolQ_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_gen_d18bf4 import \
triviaqa_datasets
from opencompass.configs.datasets.winogrande.winogrande_ll_c5cf57 import \
winogrande_datasets
from opencompass.configs.models.hf_llama.hf_llama2_7b import models
from opencompass.configs.summarizers.example import summarizer
datasets = sum([
v
for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'
], [])
work_dir = './outputs/llama2/'
================================================
FILE: examples/eval_hf_llama_7b.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.collections.base_medium_llama import (
piqa_datasets, siqa_datasets)
from opencompass.configs.models.hf_llama.hf_llama_7b import models
datasets = [*piqa_datasets, *siqa_datasets]
================================================
FILE: examples/eval_inference_ppl.py
================================================
from mmengine.config import read_base
with read_base():
# Inference PPL datasets
from opencompass.configs.datasets.inference_ppl.inference_ppl import inference_ppl_datasets
# Model configs
from opencompass.configs.models.qwen.hf_qwen1_5_7b import models as qwen1_5_7b
from opencompass.configs.models.qwen.hf_qwen1_5_14b import models as qwen1_5_14b
from opencompass.configs.models.hf_llama.hf_llama2_7b import models as llama2_7b
from opencompass.configs.models.hf_llama.hf_llama2_13b import models as llama2_13b
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
# -------------Inference Stage ----------------------------------------
datasets = [*inference_ppl_datasets]
workdir = 'outputs/inference_ppl'
models = [
*qwen1_5_7b,
*qwen1_5_14b,
*llama2_7b,
*llama2_13b,
]
# Set custom batch_size and num_gpus for faster loss calculation
# Smaller batch_size should give more precise results, at the cost of worse efficiency
model_cfg = dict(batch_size=8, run_cfg=dict(num_gpus=4, num_procs=1))
for mdl in models:
mdl.update(model_cfg)
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLInferTask),
max_num_workers=256, # Maximum concurrent evaluation task count
),
)
# -------------Evaluation Stage ----------------------------------------
eval = dict(partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLEvalTask),
max_num_workers=256,
))
================================================
FILE: examples/eval_internLM.py
================================================
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.collections.base_medium import datasets
# choose a model of interest
from opencompass.configs.models.internlm.internlm_7b import models
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
================================================
FILE: examples/eval_intern_s1_pro.py
================================================
# flake8: noqa
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
from copy import deepcopy
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
from opencompass.models import OpenAISDKStreaming
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_nocot_genericllmeval_gen_08c1de import (
mmlu_pro_datasets,
)
from opencompass.configs.datasets.gpqa.gpqa_cascade_eval_gen_772ea0 import (
gpqa_datasets,
)
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import (
aime2025_datasets,
)
from opencompass.configs.chatml_datasets.IMO_Bench_AnswerBench.IMO_Bench_AnswerBench_gen import (
datasets as IMO_Bench_AnswerBench_chatml
)
from opencompass.configs.datasets.IFBench.IFBench_gen import (
ifbench_datasets,
)
from opencompass.configs.datasets.livecodebench.livecodebench_gen_a4f90b import (
LCBCodeGeneration_dataset,
)
from opencompass.configs.datasets.SmolInstruct.smolinstruct_0shot_instruct_gen import (
smolinstruct_datasets_0shot_instruct as smolinstruct_datasets,
)
from opencompass.configs.datasets.matbench.matbench_llm_judge_gen_0e9276 import (
matbench_datasets,
)
from opencompass.configs.datasets.biodata.biodata_task_gen import (
biodata_task_datasets
)
from opencompass.configs.datasets.MolInstructions_chem.mol_instructions_chem_gen import (
mol_gen_selfies_datasets
)
# Summary Groups
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups
from opencompass.configs.summarizers.groups.biodata import (
biodata_summary_groups,
)
LCBCodeGeneration_v6_datasets = deepcopy(LCBCodeGeneration_dataset)
LCBCodeGeneration_v6_datasets['abbr'] = 'lcb_code_generation_v6'
LCBCodeGeneration_v6_datasets['release_version'] = 'v6'
LCBCodeGeneration_v6_datasets['eval_cfg']['evaluator'][
'release_version'
] = 'v6'
LCBCodeGeneration_v6_datasets = [LCBCodeGeneration_v6_datasets]
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
repeated_info = [
(gpqa_datasets, 8),
(aime2025_datasets, 32),
]
for datasets_, num in repeated_info:
for dataset_ in datasets_:
dataset_['n'] = num
dataset_['k'] = num
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
chatml_datasets = sum(
(v for k, v in locals().items() if k.endswith('_chatml')),
[],
)
# LLM judge config: using LLM to evaluate predictions
judge_cfg = dict(
abbr='YOUR_JUDGE_MODEL',
type=OpenAISDKStreaming,
path='YOUR_JUDGE_MODEL',
key='YOUR_JUDGE_KEY',
openai_api_base='YOUR_JUDGE_URL',
mode='mid',
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
),
query_per_second=16,
batch_size=64,
temperature=0.001,
max_out_len=8192,
max_seq_len=32768,
)
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
if 'llm_evaluator' in item['eval_cfg']['evaluator'].keys() and 'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
for item in chatml_datasets:
if item['evaluator']['type'] == 'llm_evaluator':
item['evaluator']['judge_cfg'] = judge_cfg
if item['evaluator']['type'] == 'cascade_evaluator':
item['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
#######################################################################
# PART 2 Datset Summarizer #
#######################################################################
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
)
summarizer = dict(
dataset_abbrs=[
['mmlu_pro', 'accuracy'],
['IFBench', 'score'],
['GPQA_diamond', 'accuracy (8 runs average)'],
['aime2025', 'accuracy (32 runs average)'],
['lcb_code_generation_v6', 'pass@1'],
['bio_data', 'naive_average'],
['IMO-Bench-AnswerBench', 'accuracy'],
'',
'Mol_Instruct',
['FS-selfies', 'score'],
['MC-selfies', 'score'],
['MG-selfies', 'score'],
['PP-selfies', 'score'],
['RP-selfies', 'score'],
['RS-selfies', 'score'],
'',
'SmolInstruct',
['NC-I2F-0shot-instruct', 'score'],
['NC-I2S-0shot-instruct', 'score'],
['NC-S2F-0shot-instruct', 'score'],
['NC-S2I-0shot-instruct', 'score'],
['PP-ESOL-0shot-instruct', 'score'],
['PP-Lipo-0shot-instruct', 'score'],
['PP-BBBP-0shot-instruct', 'accuracy'],
['PP-ClinTox-0shot-instruct', 'accuracy'],
['PP-HIV-0shot-instruct', 'accuracy'],
['PP-SIDER-0shot-instruct', 'accuracy'],
['MC-0shot-instruct', 'score'],
['MG-0shot-instruct', 'score'],
['FS-0shot-instruct', 'score'],
['RS-0shot-instruct', 'score'],
'',
['matbench_expt_gap', 'mae'],
['matbench_steels', 'mae'],
['matbench_expt_is_metal', 'accuracy'],
['matbench_glass', 'accuracy'],
],
summary_groups=summary_groups,
)
#######################################################################
# PART 3 Models #
#######################################################################
api_meta_template = dict(
round=[
dict(role='SYSTEM', api_role='SYSTEM'), # System prompt is only needed when evaluating Bio_data and Mol_instructions
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
models = [
dict(
abbr='intern-s1-pro',
type=OpenAISDKStreaming,
path='intern-s1-pro',
key='YOUR_API_KEY',
openai_api_base='YOUR_API_BASE',
meta_template=api_meta_template,
query_per_second=16,
batch_size=8,
temperature=0.8,
retry=10,
max_out_len=65536,
max_seq_len=65536,
extra_body={
'chat_template_kwargs': {'enable_thinking': True} # Disable thinking when evaluating scientific benchmarks
},
pred_postprocessor=dict(
type=extract_non_reasoning_content,
),
),
]
#######################################################################
# PART 4 Inference/Evaluation Configuaration #
#######################################################################
# infer with local runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask),
),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(
type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)
),
)
#######################################################################
# PART 5 Utils Configuaration #
#######################################################################
work_dir = './outputs/oc_intern_s1_pro_eval'
================================================
FILE: examples/eval_internlm2_chat_keyset.py
================================================
from copy import deepcopy
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.agieval.agieval_gen_64afd3 import \
agieval_datasets
from opencompass.configs.datasets.bbh.bbh_gen_5b92b0 import bbh_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
gsm8k_datasets
from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import \
humaneval_datasets
from opencompass.configs.datasets.math.math_evaluatorv2_gen_cecb31 import \
math_datasets
from opencompass.configs.datasets.mbpp.deprecated_sanitized_mbpp_gen_1e1056 import \
sanitized_mbpp_datasets
from opencompass.configs.datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \
models as hf_internlm2_chat_7b_model
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_20b import \
models as hf_internlm2_chat_20b_model
from opencompass.configs.summarizers.internlm2_keyset import summarizer
work_dir = './outputs/internlm2-chat-keyset/'
_origin_datasets = sum(
[v for k, v in locals().items() if k.endswith('_datasets')], [])
_origin_models = sum([v for k, v in locals().items() if k.endswith('_model')],
[])
_vanilla_datasets = [deepcopy(d) for d in _origin_datasets]
_vanilla_models = []
for m in _origin_models:
m = deepcopy(m)
if 'meta_template' in m and 'round' in m['meta_template']:
round = m['meta_template']['round']
if any(r['role'] == 'SYSTEM' for r in round):
new_round = [r for r in round if r['role'] != 'SYSTEM']
print(
f'WARNING: remove SYSTEM round in meta_template for {m.get("abbr", None)}'
)
m['meta_template']['round'] = new_round
_vanilla_models.append(m)
datasets = _vanilla_datasets
models = _vanilla_models
================================================
FILE: examples/eval_internlm2_keyset.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.agieval.agieval_mixed_713d14 import \
agieval_datasets
from opencompass.configs.datasets.bbh.bbh_gen_5b92b0 import bbh_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
gsm8k_datasets
from opencompass.configs.datasets.humaneval.deprecated_humaneval_gen_a82cae import \
humaneval_datasets
from opencompass.configs.datasets.math.math_gen_265cce import math_datasets
from opencompass.configs.datasets.mbpp.deprecated_sanitized_mbpp_gen_1e1056 import \
sanitized_mbpp_datasets
from opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d import mmlu_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_7b import \
models as hf_internlm2_7b_model
from opencompass.configs.models.hf_internlm.hf_internlm2_20b import \
models as hf_internlm2_20b_model
from opencompass.configs.summarizers.internlm2_keyset import summarizer
work_dir = './outputs/internlm2-keyset/'
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
================================================
FILE: examples/eval_internlm3_math500_thinking.py
================================================
# To run this example, you need to do the following steps:
# 1. Install latest opencompass
# 2. Start a local server with Qwen2.5-72B-Instruct as LLMJudge server (i.e. using vLLM or LMDeploy)
# 3. Change the judge_cfg openai_api_base to your corresponindg local server address
# 4. Start this evaluation by running 'opencompass eval_internlm3_math500_thinking.py'
from opencompass.models import VLLMwithChatTemplate, OpenAISDK
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.math.math_prm800k_500_0shot_nocot_genericllmeval_gen_63a000 import (
math_datasets,
)
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
)
judge_cfg = dict(
abbr='qwen2-5-72b-instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-72B-Instruct',
key='YOUR_API_KEY',
openai_api_base=[
'http://172.30.56.81:23333/v1/', ### Change to your own server
],
meta_template=api_meta_template,
query_per_second=16,
batch_size=16,
temperature=0.001,
max_seq_len=32768,
max_completion_tokens=32768,
)
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# set max_out_len for inference
for item in datasets:
item['infer_cfg']['inferencer']['max_out_len'] = 16384
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
reasoning_chat_template = """You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:
## Deep Understanding
Take time to fully comprehend the problem before attempting a solution. Consider:
- What is the real question being asked?
- What are the given conditions and what do they tell us?
- Are there any special restrictions or assumptions?
- Which information is crucial and which is supplementary?
## Multi-angle Analysis
Before solving, conduct thorough analysis:
- What mathematical concepts and properties are involved?
- Can you recall similar classic problems or solution methods?
- Would diagrams or tables help visualize the problem?
- Are there special cases that need separate consideration?
## Systematic Thinking
Plan your solution path:
- Propose multiple possible approaches
- Analyze the feasibility and merits of each method
- Choose the most appropriate method and explain why
- Break complex problems into smaller, manageable steps
## Rigorous Proof
During the solution process:
- Provide solid justification for each step
- Include detailed proofs for key conclusions
- Pay attention to logical connections
- Be vigilant about potential oversights
## Repeated Verification
After completing your solution:
- Verify your results satisfy all conditions
- Check for overlooked special cases
- Consider if the solution can be optimized or simplified
- Review your reasoning process
Remember:
1. Take time to think thoroughly rather than rushing to an answer
2. Rigorously prove each key conclusion
3. Keep an open mind and try different approaches
4. Summarize valuable problem-solving methods
5. Maintain healthy skepticism and verify multiple times
Your response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others.
When you're ready, present your complete solution with:
- Clear problem understanding
- Detailed solution process
- Key insights
- Thorough verification
Focus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer.
"""
reasoning_meta_template = dict(
begin=dict(
role='SYSTEM', api_role='SYSTEM', prompt=reasoning_chat_template
),
round=[
dict(role='HUMAN', api_role='HUMAN'),
# XXX: all system roles are mapped to human in purpose
dict(role='BOT', api_role='BOT', generate=True),
],
)
models = [
dict(
type=VLLMwithChatTemplate,
abbr='internlm3-8b-instruct-vllm',
path='internlm/internlm3-8b-instruct',
model_kwargs=dict(tensor_parallel_size=1),
generation_kwargs=dict(do_sample=False), # greedy
max_seq_len=32768,
max_out_len=16384,
batch_size=16,
run_cfg=dict(num_gpus=1),
meta_template=reasoning_meta_template,
)
]
datasets = math_datasets
================================================
FILE: examples/eval_internlm_7b.py
================================================
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.collections.base_medium import datasets
# choose a model of interest
from opencompass.configs.models.hf_internlm.hf_internlm_7b import models
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
================================================
FILE: examples/eval_internlm_chat_lmdeploy_apiserver.py
================================================
from mmengine.config import read_base
from opencompass.models.turbomind_api import TurboMindAPIModel
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import \
ceval_datasets
from opencompass.configs.datasets.crowspairs.crowspairs_gen_381af0 import \
crowspairs_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
gsm8k_datasets
from opencompass.configs.datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from opencompass.configs.datasets.race.race_gen_69ee4f import race_datasets
from opencompass.configs.datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import \
WiC_datasets
from opencompass.configs.datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import \
WSC_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_gen_2121ce import \
triviaqa_datasets
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
meta_template = dict(round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='\n', generate=True),
],
eos_token_id=103028)
internlm_chat_20b = dict(
type=TurboMindAPIModel,
abbr='internlm-chat-20b-turbomind',
api_addr='http://0.0.0.0:23333',
api_key='internlm-chat-20b', # api_key
max_out_len=100,
max_seq_len=2048,
batch_size=8,
meta_template=meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='',
)
internlm_chat_7b = dict(
type=TurboMindAPIModel,
abbr='internlm-chat-7b-turbomind',
api_addr='http://0.0.0.0:23333',
api_key='interlm-chat-7b', # api_key
max_out_len=100,
max_seq_len=2048,
batch_size=16,
meta_template=meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='',
)
models = [internlm_chat_20b]
================================================
FILE: examples/eval_internlm_chat_turbomind.py
================================================
from mmengine.config import read_base
from opencompass.models.turbomind import TurboMindModel
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import \
ceval_datasets
from opencompass.configs.datasets.crowspairs.crowspairs_gen_381af0 import \
crowspairs_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
gsm8k_datasets
from opencompass.configs.datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from opencompass.configs.datasets.race.race_gen_69ee4f import race_datasets
from opencompass.configs.datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import \
WiC_datasets
from opencompass.configs.datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import \
WSC_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_gen_2121ce import \
triviaqa_datasets
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
internlm_meta_template = dict(round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='\n', generate=True),
],
eos_token_id=103028)
internlm2_meta_template = dict(round=[
dict(role='HUMAN', begin='<|im_start|>user\n', end='<|im_end|>\n'),
dict(role='BOT',
begin='<|im_start|>assistant\n',
end='<|im_end|>\n',
generate=True),
],
eos_token_id=92542)
# config for internlm-chat-7b
internlm_chat_7b = dict(
type=TurboMindModel,
abbr='internlm-chat-7b-turbomind',
path='internlm/internlm-chat-7b',
engine_config=dict(session_len=2048,
max_batch_size=32,
rope_scaling_factor=1.0),
gen_config=dict(top_k=1, top_p=0.8, temperature=1.0, max_new_tokens=100),
max_out_len=100,
max_seq_len=2048,
batch_size=32,
concurrency=32,
meta_template=internlm_meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='',
)
# config for internlm-chat-7b
internlm2_chat_7b = dict(type=TurboMindModel,
abbr='internlm2-chat-7b-turbomind',
path='internlm/internlm2-chat-7b',
engine_config=dict(session_len=2048,
max_batch_size=32,
rope_scaling_factor=1.0),
gen_config=dict(top_k=1,
top_p=0.8,
temperature=1.0,
max_new_tokens=100),
max_out_len=100,
max_seq_len=2048,
batch_size=32,
concurrency=32,
meta_template=internlm2_meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='<|im_end|>')
# config for internlm-chat-20b
internlm_chat_20b = dict(
type=TurboMindModel,
abbr='internlm-chat-20b-turbomind',
path='internlm/internlm-chat-20b',
engine_config=dict(session_len=2048,
max_batch_size=8,
rope_scaling_factor=1.0),
gen_config=dict(top_k=1, top_p=0.8, temperature=1.0, max_new_tokens=100),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
concurrency=8,
meta_template=internlm_meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='',
)
models = [internlm_chat_20b]
================================================
FILE: examples/eval_internlm_flames_chat.py
================================================
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM
from opencompass.partitioners import NaivePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import FlamesSummarizer
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
# -------------Inferen Stage ----------------------------------------
with read_base():
from opencompass.configs.datasets.flames.flames_gen import flames_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \
models
datasets = [*flames_datasets]
from opencompass.models import HuggingFaceCausalLM
_meta_template = dict(round=[
dict(role='HUMAN', begin='<|im_start|>user\n', end='<|im_end|>\n'),
dict(role='BOT',
begin='<|im_start|>assistant\n',
end='<|im_end|>\n',
generate=True),
], )
models = [
dict(
type=HuggingFaceCausalLM,
abbr='internlm2-chat-7b-hf',
path='internlm/internlm2-chat-7b',
tokenizer_path='internlm/internlm2-chat-7b',
model_kwargs=dict(
trust_remote_code=True,
device_map='auto',
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False,
trust_remote_code=True,
),
max_out_len=2048,
max_seq_len=2048,
batch_size=8,
meta_template=_meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='<|im_end|>',
generation_kwargs={
'eos_token_id': [2, 92542],
'do_sample': True
},
batch_padding=True,
)
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration---------------------------------
internlm1_chat_template = dict(round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='\n', generate=True),
], )
judge_models = [
dict(
type=HuggingFaceCausalLM,
abbr='flames-scorer',
path='CaasiHUANG/flames-scorer',
tokenizer_path='CaasiHUANG/flames-scorer',
model_kwargs=dict(
trust_remote_code=True,
device_map='auto',
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False,
trust_remote_code=True,
),
generation_kwargs={'do_sample': True},
max_out_len=512,
max_seq_len=4096,
batch_size=8,
meta_template=internlm1_chat_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='',
)
]
## ------------- Evaluation Configuration----------------
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
mode='singlescore',
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=FlamesSummarizer, judge_type='general')
work_dir = 'outputs/flames/'
================================================
FILE: examples/eval_internlm_lmdeploy_apiserver.py
================================================
from mmengine.config import read_base
from opencompass.models.turbomind_api import TurboMindAPIModel
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import \
ceval_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
gsm8k_datasets
from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import \
humaneval_datasets
from opencompass.configs.datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from opencompass.configs.datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import \
WiC_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_gen_2121ce import \
triviaqa_datasets
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
internlm_chat_20b = dict(
type=TurboMindAPIModel,
abbr='internlm-chat-20b-turbomind',
api_addr='http://0.0.0.0:23333',
max_out_len=100,
max_seq_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
internlm_chat_7b = dict(
type=TurboMindAPIModel,
abbr='internlm-chat-7b-turbomind',
api_addr='http://0.0.0.0:23333',
max_out_len=100,
max_seq_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1, num_procs=1),
)
models = [internlm_chat_20b]
================================================
FILE: examples/eval_internlm_math_chat.py
================================================
from mmengine.config import read_base
from opencompass.models.huggingface import HuggingFaceCausalLM
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import gsm8k_datasets
from opencompass.configs.datasets.math.math_gen_736506 import math_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_math_7b import \
models as internlm_math_chat_7b_models
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_math_20b import \
models as internlm_math_chat_20b_models
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# Eval Math and GSM8k for both Internlm-Math-Chat-7B and 20b
datasets = [*math_datasets, *gsm8k_datasets]
models = [*internlm_math_chat_7b_models, *internlm_math_chat_20b_models]
================================================
FILE: examples/eval_internlm_turbomind.py
================================================
from mmengine.config import read_base
from opencompass.models.turbomind import TurboMindModel
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import \
ceval_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import \
gsm8k_datasets
from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import \
humaneval_datasets
from opencompass.configs.datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from opencompass.configs.datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import \
WiC_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_gen_2121ce import \
triviaqa_datasets
# and output the results in a choosen format
from opencompass.configs.summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# # config for internlm-7b model
internlm_7b = dict(
type=TurboMindModel,
abbr='internlm-7b-turbomind',
path='internlm/internlm-7b',
engine_config=dict(session_len=2048,
max_batch_size=32,
rope_scaling_factor=1.0),
gen_config=dict(top_k=1, top_p=0.8, temperature=1.0, max_new_tokens=100),
max_out_len=100,
max_seq_len=2048,
batch_size=32,
concurrency=32,
run_cfg=dict(num_gpus=1, num_procs=1),
)
# config for internlm-20b model
internlm_20b = dict(
type=TurboMindModel,
abbr='internlm-20b-turbomind',
path='internlm/internlm-20b',
engine_config=dict(session_len=2048,
max_batch_size=8,
rope_scaling_factor=1.0),
gen_config=dict(top_k=1, top_p=0.8, temperature=1.0, max_new_tokens=100),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
concurrency=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
models = [internlm_20b]
================================================
FILE: examples/eval_judge_dataset_all.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.judge.judgerbenchv2 import get_judgerbenchv2_dataset as get_judgerbenchv2_datasets
from opencompass.configs.datasets.judge.rmb import get_rmb_dataset as get_rmb_datasets
from opencompass.configs.datasets.judge.rewardbench import get_rewardbench_datasets
from opencompass.configs.datasets.judge.judgebench import get_judgebench_datasets
from opencompass.configs.summarizers.judgedataset_all import summarizer
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner, NumWorkerPartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner, DLCRunner, VOLCRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
from opencompass.models import TurboMindModelwithChatTemplate
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='qwen-7b-hf',
path='Qwen/Qwen-7B',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
),
]
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=72,
task=dict(type=OpenICLInferTask),
),
)
work_dir = './outputs/judge_dataset_all/'
================================================
FILE: examples/eval_judgebench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.judge.judgebench import get_judgebench_datasets
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner, NumWorkerPartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner, DLCRunner, VOLCRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
datasets = [*get_judgebench_datasets]
from opencompass.models import TurboMindModelwithChatTemplate
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='qwen-7b-hf',
path='Qwen/Qwen-7B',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=72,
task=dict(type=OpenICLInferTask),
),
)
work_dir = './outputs/judgebench/'
================================================
FILE: examples/eval_judgerbench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.judgerbench.judgerbench import judgerbench_datasets
from opencompass.models import (HuggingFace, HuggingFaceCausalLM,
HuggingFaceChatGLM3, OpenAI,
TurboMindModelwithChatTemplate)
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='CompassJudger-1-7B-Instruct',
path='opencompass/CompassJudger-1-7B-Instruct',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1,
temperature=1e-6,
top_p=0.9,
max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
)
]
datasets = judgerbench_datasets
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(
type=NaivePartitioner,
n=10,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
work_dir = 'outputs/judgerbench/'
================================================
FILE: examples/eval_judgerbenchv2.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.judge.judgerbenchv2 import get_judgerbenchv2_dataset
from opencompass.configs.summarizers.judgerbenchv2 import summarizer
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner, NumWorkerPartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner, DLCRunner, VOLCRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
datasets = [*get_judgerbenchv2_dataset]
from opencompass.models import TurboMindModelwithChatTemplate
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='qwen-7b-hf',
path='Qwen/Qwen-7B',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
),
]
infer = dict(
# partitioner=dict(type=NaivePartitioner),
partitioner=dict(type=NumWorkerPartitioner, num_worker=2),
runner=dict(
type=LocalRunner,
max_num_workers=72,
task=dict(type=OpenICLInferTask),
),
)
work_dir = './outputs/judgerbenchv2/'
================================================
FILE: examples/eval_korbench.py
================================================
from mmengine import read_base
with read_base():
from opencompass.configs.datasets.korbench.korbench_mixed_gen_d00bdd import \
korbench_mixed_datasets as mixed_datasets
from opencompass.configs.datasets.korbench.korbench_single_0_shot_gen import \
korbench_0shot_single_datasets as zero_shot_datasets
from opencompass.configs.datasets.korbench.korbench_single_3_shot_gen import \
korbench_3shot_single_datasets as three_shot_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b import \
models as hf_internlm2_5_7b
datasets = zero_shot_datasets + three_shot_datasets + mixed_datasets
models = hf_internlm2_5_7b
================================================
FILE: examples/eval_lightllm.py
================================================
from mmengine.config import read_base
from opencompass.models import LightllmAPI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.humaneval.deprecated_humaneval_gen_a82cae import \
humaneval_datasets
from opencompass.configs.summarizers.leaderboard import summarizer
datasets = [*humaneval_datasets]
'''
# Prompt template for InternLM2-Chat
# https://github.com/InternLM/InternLM/blob/main/chat/chat_format.md
_meta_template = dict(
begin='<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n',
round=[
dict(role='HUMAN', begin='<|im_start|>user\n', end='<|im_end|>\n'),
dict(role='BOT', begin='<|im_start|>assistant\n', end='<|im_end|>\n', generate=True),
]
)
'''
_meta_template = None
models = [
dict(
abbr='LightllmAPI',
type=LightllmAPI,
url='http://localhost:1030/generate',
meta_template=_meta_template,
batch_size=32,
max_workers_per_task=128,
rate_per_worker=1024,
retry=4,
generation_kwargs=dict(do_sample=False,
ignore_eos=False,
max_new_tokens=1024),
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=32,
task=dict(type=OpenICLInferTask),
),
)
================================================
FILE: examples/eval_livestembench.py
================================================
from mmengine.config import read_base
from opencompass.models import OpenAISDK
with read_base():
# 选择一个数据集列表
from opencompass.configs.datasets.livestembench.livestembench_gen_3e3c50 import \
livestembench_datasets
# 选择一个感兴趣的模型
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import \
models as qwen2_5_7b_instruct_lmdeploy_model
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import \
models as qwen2_5_72b_instruct_lmdeploy_model
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
models = [
*qwen2_5_7b_instruct_lmdeploy_model, *qwen2_5_72b_instruct_lmdeploy_model
]
# Judge 模型配置
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
], )
judge_cfg = dict(
abbr='qwen2-5-72b-instruct',
type=OpenAISDK,
path='YOUR_SERVER_MODEL_NAME', # 你的部署的模型名称
key='None',
openai_api_base=[
'http://localhost:23333/v1', # 你的模型部署的地址
],
meta_template=api_meta_template,
query_per_second=16,
batch_size=16,
temperature=0.001,
max_completion_tokens=32768,
)
for dataset in datasets:
dataset['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
# -------------Inferen Stage ----------------------------------------
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLEvalTask),
),
)
work_dir = './outputs/livestembench'
================================================
FILE: examples/eval_llama2_7b.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.collections.base_medium_llama import (
piqa_datasets, siqa_datasets)
from opencompass.configs.models.llama.llama2_7b import models
datasets = [*piqa_datasets, *siqa_datasets]
================================================
FILE: examples/eval_llama2_7b_lveval.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.lveval.lveval import \
LVEval_datasets as datasets
from opencompass.configs.models.hf_llama.hf_llama2_7b_chat import models
from opencompass.configs.summarizers.lveval import summarizer
models[0]['path'] = '/path/to/your/huggingface_models/Llama-2-7b-chat-hf'
models[0][
'tokenizer_path'] = '/path/to/your/huggingface_models/Llama-2-7b-chat-hf'
models[0]['max_seq_len'] = 4096
models[0]['generation_kwargs'] = dict(do_sample=False)
models[0]['mode'] = 'mid' # truncate in the middle
================================================
FILE: examples/eval_llama3_instruct.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.dataset_collections.chat_OC15 import datasets
from opencompass.configs.models.hf_llama.hf_llama3_8b_instruct import \
models as hf_llama3_8b_instruct_model
from opencompass.configs.summarizers.chat_OC15 import summarizer
work_dir = 'outputs/debug/llama3-instruct'
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
# dataset version metric mode llama-3-8b-instruct-hf
# -------------------- --------- ---------------------------- ------ ------------------------
# average - naive_average gen 55.64
# mmlu - naive_average gen 68.30
# cmmlu - naive_average gen 53.29
# ceval - naive_average gen 52.32
# GaokaoBench - weighted_average gen 45.91
# triviaqa_wiki_1shot eaf81e score gen 79.01
# nq_open_1shot 01cf41 score gen 30.25
# race-high 9a54b6 accuracy gen 81.22
# winogrande b36770 accuracy gen 66.46
# hellaswag e42710 accuracy gen 74.33
# bbh - naive_average gen 67.25
# gsm8k 1d7fe4 accuracy gen 79.08
# math 393424 accuracy gen 27.78
# TheoremQA 6f0af8 score gen 19.50
# openai_humaneval 8e312c humaneval_pass@1 gen 55.49
# sanitized_mbpp 830460 score gen 66.54
# GPQA_diamond 4baadb accuracy gen 25.76
# IFEval 3321a3 Prompt-level-strict-accuracy gen 67.84
# - - - -
# mmlu - naive_average gen 68.30
# mmlu-stem - naive_average gen 57.92
# mmlu-social-science - naive_average gen 77.83
# mmlu-humanities - naive_average gen 71.20
# mmlu-other - naive_average gen 71.79
# cmmlu - naive_average gen 53.29
# cmmlu-stem - naive_average gen 45.40
# cmmlu-social-science - naive_average gen 54.63
# cmmlu-humanities - naive_average gen 54.14
# cmmlu-other - naive_average gen 59.52
# cmmlu-china-specific - naive_average gen 49.33
# ceval - naive_average gen 52.32
# ceval-stem - naive_average gen 48.16
# ceval-social-science - naive_average gen 57.50
# ceval-humanities - naive_average gen 53.26
# ceval-other - naive_average gen 54.26
# ceval-hard - naive_average gen 35.59
================================================
FILE: examples/eval_llm_compression.py
================================================
from mmengine.config import read_base
with read_base():
# LLM compression datasets
from opencompass.configs.datasets.llm_compression.llm_compression import llm_compression_datasets
# Model configs
from opencompass.configs.models.qwen.hf_qwen1_5_7b import models as qwen1_5_7b
from opencompass.configs.models.qwen.hf_qwen1_5_14b import models as qwen1_5_14b
from opencompass.configs.models.hf_llama.hf_llama2_7b import models as llama2_7b
from opencompass.configs.models.hf_llama.hf_llama2_13b import models as llama2_13b
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import LLMCompressionSummarizer
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
# -------------Inference Stage ----------------------------------------
datasets = [*llm_compression_datasets]
workdir = 'outputs/llm_compression'
models = [
*qwen1_5_7b,
*qwen1_5_14b,
*llama2_7b,
*llama2_13b,
]
# Set custom batch_size and num_gpus for faster loss calculation
# Smaller batch_size should give more precise results, at the cost of worse performance
model_cfg = dict(batch_size=8, run_cfg=dict(num_gpus=4, num_procs=1))
for mdl in models:
mdl.update(model_cfg)
infer = dict(
# The OpenCompass implementation of BPC currently only supports NaivePartitioner, as the sliding window approach requires the dataset to be loaded sequentially. Using other partitioner types may produce incorrect results.
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLInferTask),
max_num_workers=256, # Maximum concurrent evaluation task count
),
)
# -------------Evaluation Stage ----------------------------------------
eval = dict(partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLEvalTask),
max_num_workers=256,
))
# -------------Summarization Stage ----------------------------------------
summarizer = dict(type=LLMCompressionSummarizer)
================================================
FILE: examples/eval_llm_judge.py
================================================
from mmengine.config import read_base
from opencompass.models.openai_api import OpenAISDK
# Import pre-configured models from OpenCompass
with read_base():
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
models as lmdeploy_qwen2_5_7b_instruct_model,
)
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
models as lmdeploy_qwen2_5_14b_instruct_model,
)
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets import CustomDataset
# Dataset reader configuration
math_reader_cfg = dict(input_columns=['problem'], output_column='answer')
# Inference configuration
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}\nRemember to put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Template for the LLM judge
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{problem}\n\n\n
: \n{answer}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
# Evaluation configuration using LLM as judge
math_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='opencompass/math',
file_name='test_prm800k_500.jsonl',
reader_cfg=math_reader_cfg,
),
judge_cfg=lmdeploy_qwen2_5_14b_instruct_model[0],
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
# Dataset configuration
datasets = [
dict(
type=CustomDataset,
path='opencompass/math',
file_name='test_prm800k_500.jsonl',
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg,
)
]
# Model to be evaluated
models = lmdeploy_qwen2_5_7b_instruct_model
# Limiting test to first 8 examples for quick testing
math_reader_cfg['test_range'] = '[0:8]'
# Output directory
work_dir = 'outputs/llm_judge'
================================================
FILE: examples/eval_lmdeploy_demo.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.demo.demo_gsm8k_chat_gen import \
gsm8k_datasets
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_1_8b_chat import \
models
datasets = gsm8k_datasets
models = models
================================================
FILE: examples/eval_longbenchv2.py
================================================
from mmengine.config import read_base
with read_base():
# Models
# Datasets
from opencompass.configs.datasets.longbenchv2.longbenchv2_gen import \
LongBenchv2_datasets as LongBenchv2_datasets
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import \
models as lmdeploy_glm4_9b_chat_model
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
models as lmdeploy_llama3_1_8b_instruct_model
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import \
models as lmdeploy_qwen2_5_7b_instruct_model
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for model in models:
model['max_seq_len'] = 128 * 1024
model['engine_config']['session_len'] = 128 * 1024
model['engine_config']['tp'] = 2
model['run_cfg']['num_gpus'] = 2
# Drop middle tokens to make input length shorter than session_len, use 128k to keep sync with Longbenchv2 original code
# Drop middle now only support LMDeploy models
model['drop_middle'] = True
work_dir = './outputs/longbenchv2'
================================================
FILE: examples/eval_math_llm_judge.py
================================================
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
from mmengine.config import read_base
with read_base():
from opencompass.configs.models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
from opencompass.configs.models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model # noqa: F401, F403
from opencompass.configs.datasets.math.math_llm_judge import math_datasets # noqa: F401, F403
from opencompass.datasets import math_judement_preprocess
from opencompass.openicl.icl_evaluator import LMEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.summarizers import AllObjSummarizer
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
# -------------Prompt Settings ----------------------------------------
eng_obj_prompt = """
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
[Yes]
Expression 1: 3/2
Expression 2: 1.5
[Yes]
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
[No]
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
[Yes]
Expression 1: 3245/5
Expression 2: 649
[No]
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
[Yes]
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2:
[No]
(only mark as equivalent if both expressions are nonempty)
---
YOUR TASK
Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
Expression 1: {obj_gold}
Expression 2: {prediction}
"""
# -------------Inferen Stage ----------------------------------------
# eval models
models = [*hf_llama3_8b_instruct_model]
# judge models
judge_models = hf_llama3_70b_instruct_model
eng_datasets = [*math_datasets]
chn_datasets = []
datasets = eng_datasets + chn_datasets
work_dir = 'outputs/obj_all/'
for d in eng_datasets:
d['eval_cfg'] = dict(
evaluator=dict(
type=LMEvaluator,
# If you need to preprocess the prediction before judging,
# you can specify the pred_postprocessor function here
pred_postprocessor=dict(type=math_judement_preprocess),
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role='HUMAN', prompt=eng_obj_prompt),
]),
),
),
pred_role='BOT',
)
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLInferTask)),
)
# ------------- Evaluation Configuration --------------------------------
eval = dict(
partitioner=dict(
type=SubjectiveSizePartitioner,
max_task_size=80000,
mode='singlescore',
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=AllObjSummarizer)
================================================
FILE: examples/eval_math_llm_judge_internal.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.math.math_0shot_llm_judge_v2_gen_31d777 import \
math_datasets
# 选择一个感兴趣的模型
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import \
models as qwen2_5_72b_instruct_model
eval_model_name = 'eval_model_name'
postprocessor_model_name = 'postprocessor_model_name'
eval_model_urls = ['http://0.0.0.0:23333/v1']
postprocessor_model_urls = ['http://0.0.0.0:23333/v1']
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for dataset in datasets:
dataset['eval_cfg']['evaluator']['model_name'] = eval_model_name
dataset['eval_cfg']['evaluator']['url'] = eval_model_urls
dataset['eval_cfg']['evaluator']['post_url'] = postprocessor_model_urls
dataset['eval_cfg']['evaluator'][
'post_model_name'] = postprocessor_model_name
# -------------Inferen Stage ----------------------------------------
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLEvalTask)),
)
================================================
FILE: examples/eval_math_verify.py
================================================
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
with read_base():
from opencompass.configs.datasets.math.math_500_gen import math_datasets
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-llama-8b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
engine_config=dict(session_len=32768, max_batch_size=8, tp=1),
gen_config=dict(
top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=4096
),
max_seq_len=32768,
max_out_len=32768,
batch_size=32,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content),
),
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=8, tp=1),
gen_config=dict(
temperature=0.6,
top_p=0.95,
max_new_tokens=32768,
do_sample=True,
),
max_seq_len=32768,
max_out_len=32768,
batch_size=32,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content),
),
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-1_5b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',
engine_config=dict(session_len=32768, max_batch_size=16, tp=1),
gen_config=dict(
top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=4096
),
max_seq_len=32768,
max_out_len=32768,
batch_size=32,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content),
),
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-14b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
engine_config=dict(session_len=32768, max_batch_size=16, tp=2),
gen_config=dict(
top_k=1,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768,
do_sample=True,
),
max_seq_len=32768,
max_out_len=32768,
batch_size=16,
run_cfg=dict(num_gpus=2),
pred_postprocessor=dict(type=extract_non_reasoning_content),
),
]
datasets = [*math_datasets]
work_dir = './outputs/math_500'
================================================
FILE: examples/eval_mathbench.py
================================================
from mmengine.config import read_base
with read_base():
# Import models
# Import datasets
from opencompass.configs.datasets.MathBench.mathbench_gen import \
mathbench_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \
models as internlm2_chat_7b_model
from opencompass.configs.models.hf_llama.hf_llama3_8b_instruct import \
models as llama3_8b_instruct_model
# Import summarizers for display results
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
summarizer # Grouped results for MathBench-A and MathBench-T separately
# from opencompass.configs.summarizers.mathbench_v1 import summarizer # Detailed results for every sub-dataset
# from opencompass.configs.summarizers.groups.mathbench_v1_2024_lang import summarizer # Grouped results for bilingual results
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLEvalTask)),
)
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLInferTask)),
)
work_dir = './outputs/mathbench_results'
================================================
FILE: examples/eval_mmlu_cf.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.mmlu_cf.mmlu_cf_gen_040615 import \
mmlu_cf_datasets
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
models as lmdeploy_llama3_8b_instruct_model
from opencompass.configs.models.qwen2_5.hf_qwen2_5_7b_instruct import \
models as hf_qwen2_5_7b_instruct_model
from opencompass.configs.summarizers.mmlu_cf import summarizer
datasets = sum([
v
for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'
], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLEvalTask)),
)
work_dir = 'outputs/debug/mmlu_cf'
================================================
FILE: examples/eval_mmlu_pro.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_gen_cdbebf import \
mmlu_pro_datasets
from opencompass.configs.internal.clusters.local import eval
from opencompass.configs.internal.clusters.local import \
infer_num_worker as infer
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
models as lmdeploy_llama3_8b_instruct_model
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import \
models as lmdeploy_qwen2_7b_instruct_model
from opencompass.configs.summarizers.mmlu_pro import summarizer
datasets = sum([
v
for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'
], [])
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
work_dir = 'outputs/debug/mmlu_pro'
# dataset version metric mode qwen2-7b-instruct-turbomind llama-3-8b-instruct-turbomind
# ------------------------- --------- ------------- ------ ----------------------------- -------------------------------
# mmlu_pro - naive_average gen 46.18 43.92
# mmlu_pro_biology 736233 accuracy gen 63.74 64.02
# mmlu_pro_business 736233 accuracy gen 53.23 46.01
# mmlu_pro_chemistry 736233 accuracy gen 35.25 32.42
# mmlu_pro_computer_science 736233 accuracy gen 47.07 44.88
# mmlu_pro_economics 736233 accuracy gen 59.00 53.79
# mmlu_pro_engineering 736233 accuracy gen 26.73 33.54
# mmlu_pro_health 736233 accuracy gen 47.31 51.34
# mmlu_pro_history 736233 accuracy gen 42.78 42.26
# mmlu_pro_law 736233 accuracy gen 28.07 26.98
# mmlu_pro_math 736233 accuracy gen 53.59 37.53
# mmlu_pro_philosophy 736233 accuracy gen 42.28 42.48
# mmlu_pro_physics 736233 accuracy gen 39.11 33.64
# mmlu_pro_psychology 736233 accuracy gen 60.90 59.65
# mmlu_pro_other 736233 accuracy gen 47.40 46.32
================================================
FILE: examples/eval_mmlu_with_zero_retriever_overwritten.py
================================================
from copy import deepcopy
from mmengine.config import read_base
from opencompass.openicl.icl_retriever import ZeroRetriever
with read_base():
from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import \
mmlu_datasets # this is a dataset evaluated with 5-shot
from opencompass.configs.models.qwen.hf_qwen_7b_chat import models
datasets = []
for d in mmlu_datasets:
d = deepcopy(d)
d['infer_cfg']['retriever'] = dict(type=ZeroRetriever)
datasets.append(d)
================================================
FILE: examples/eval_model_rollout.py
================================================
# flake8: noqa
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
from opencompass.models import OpenAISDKRollout
#######################################################################
# PART 0 Essential Configs #
#######################################################################
with read_base():
# Datasets
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
#######################################################################
# PART 1 Datasets List #
#######################################################################
# datasets list for evaluation
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')),
[])
num_repeat = 4
for item in datasets:
item['abbr'] += f'_rollout_rep{num_repeat}'
item['n'] = num_repeat
#######################################################################
# PART 2 Models List #
#######################################################################
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
models = [
dict(
abbr='YOUR_MODEL',
key='YOUR_API_KEY',
openai_api_base='YOUR_API_BASE',
type=OpenAISDKRollout,
path='YOUR_MODEL_PATH',
temperature=1.0,
meta_template=api_meta_template,
query_per_second=1,
batch_size=16,
max_out_len=65536,
max_seq_len=65536,
retry=10,
extra_body=dict(
top_k=20,
),
openai_extra_kwargs=dict(
top_p=0.95,
),
)
]
#######################################################################
# PART 3 Inference/Evaluation Configuaration #
#######################################################################
# infer with local runner
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
retry=0, # Modify if needed
task=dict(type=OpenICLInferTask),
),
)
# eval with local runner
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)),
)
#######################################################################
# PART 4 Utils Configuaration #
#######################################################################
work_dir = './outputs/oc_rollout_eval'
================================================
FILE: examples/eval_modelscope_datasets.py
================================================
# export DATASET_SOURCE='ModelScope' # before run this script
from datasets import Dataset, DatasetDict
from mmengine.config import read_base
from tqdm import tqdm
with read_base():
from opencompass.configs.datasets.agieval.agieval_gen import \
agieval_datasets as agieval_v2_datasets # ok
from opencompass.configs.datasets.agieval.agieval_gen_a0c741 import \
agieval_datasets as agieval_v1_datasets # ok
from opencompass.configs.datasets.ARC_c.ARC_c_clean_ppl import \
ARC_c_datasets as ARC_c_clean_datasets # ok
from opencompass.configs.datasets.ARC_c.ARC_c_gen import \
ARC_c_datasets # ok
from opencompass.configs.datasets.ARC_e.ARC_e_gen import \
ARC_e_datasets # ok
from opencompass.configs.datasets.bbh.bbh_gen import bbh_datasets
from opencompass.configs.datasets.ceval.ceval_clean_ppl import \
ceval_datasets as ceval_clean_datasets # ok
from opencompass.configs.datasets.ceval.ceval_gen import \
ceval_datasets # ok
from opencompass.configs.datasets.CLUE_afqmc.CLUE_afqmc_gen import \
afqmc_datasets # ok
from opencompass.configs.datasets.CLUE_cmnli.CLUE_cmnli_gen import \
cmnli_datasets # ok
from opencompass.configs.datasets.CLUE_cmnli.CLUE_cmnli_ppl import \
cmnli_datasets as cmnli_ppl_datasets # ok
from opencompass.configs.datasets.CLUE_ocnli.CLUE_ocnli_gen import \
ocnli_datasets # ok
from opencompass.configs.datasets.cmmlu.cmmlu_gen import \
cmmlu_datasets # ok
from opencompass.configs.datasets.commonsenseqa.commonsenseqa_gen import \
commonsenseqa_datasets # 额外处理gpt
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_gen import \
GaokaoBench_datasets # ok
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_mixed import \
GaokaoBench_datasets as GaokaoBench_mixed_datasets # ok
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_4c31db import \
GaokaoBench_datasets as GaokaoBench_no_subjective_datasets # ok
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # ok
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import \
hellaswag_datasets as hellaswag_ice_datasets # ok
from opencompass.configs.datasets.hellaswag.hellaswag_clean_ppl import \
hellaswag_datasets as hellaswag_clean_datasets # ok
from opencompass.configs.datasets.hellaswag.hellaswag_gen import \
hellaswag_datasets as hellaswag_v2_datasets # ok
from opencompass.configs.datasets.hellaswag.hellaswag_ppl_9dbb12 import \
hellaswag_datasets as hellaswag_v1_datasets # ok
from opencompass.configs.datasets.hellaswag.hellaswag_ppl_a6e128 import \
hellaswag_datasets as hellaswag_v3_datasets # ok
from opencompass.configs.datasets.humaneval.humaneval_gen import \
humaneval_datasets # ok
from opencompass.configs.datasets.humaneval.humaneval_repeat10_gen_8e312c import \
humaneval_datasets as humaneval_repeat10_datasets # ok
from opencompass.configs.datasets.lambada.lambada_gen import \
lambada_datasets # ok
from opencompass.configs.datasets.lcsts.lcsts_gen import \
lcsts_datasets # ok
from opencompass.configs.datasets.math.math_gen import math_datasets # ok
from opencompass.configs.datasets.mbpp.mbpp_gen import \
mbpp_datasets as mbpp_v1_datasets # ok
from opencompass.configs.datasets.mbpp.mbpp_passk_gen_830460 import \
mbpp_datasets as mbpp_v2_datasets # ok
from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_830460 import \
sanitized_mbpp_datasets # ok
from opencompass.configs.datasets.mmlu.mmlu_clean_ppl import \
mmlu_datasets as mmlu_clean_datasets # ok
from opencompass.configs.datasets.mmlu.mmlu_gen import mmlu_datasets # ok
from opencompass.configs.datasets.nq.nq_gen import nq_datasets # ok
from opencompass.configs.datasets.obqa.obqa_gen import obqa_datasets # ok
from opencompass.configs.datasets.obqa.obqa_ppl_6aac9e import \
obqa_datasets as obqa_ppl_datasets # ok
from opencompass.configs.datasets.piqa.piqa_gen import \
piqa_datasets as piqa_v2_datasets # ok
from opencompass.configs.datasets.piqa.piqa_ppl import \
piqa_datasets as piqa_v1_datasets # ok
from opencompass.configs.datasets.piqa.piqa_ppl_0cfff2 import \
piqa_datasets as piqa_v3_datasets # ok
from opencompass.configs.datasets.race.race_ppl import race_datasets # ok
from opencompass.configs.datasets.siqa.siqa_gen import \
siqa_datasets as siqa_v2_datasets # ok
from opencompass.configs.datasets.siqa.siqa_gen_18632c import \
siqa_datasets as siqa_v3_datasets # ok
from opencompass.configs.datasets.siqa.siqa_ppl_42bc6e import \
siqa_datasets as siqa_ppl_datasets # ok
from opencompass.configs.datasets.storycloze.storycloze_gen import \
storycloze_datasets # ok
from opencompass.configs.datasets.storycloze.storycloze_ppl import \
storycloze_datasets as storycloze_ppl_datasets # ok
from opencompass.configs.datasets.strategyqa.strategyqa_gen import \
strategyqa_datasets
from opencompass.configs.datasets.summedits.summedits_gen import \
summedits_datasets as summedits_v2_datasets # ok
from opencompass.configs.datasets.triviaqa.triviaqa_gen import \
triviaqa_datasets # ok
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_20a989 import \
triviaqa_datasets as triviaqa_wiki_1shot_datasets # ok
from opencompass.configs.datasets.tydiqa.tydiqa_gen import \
tydiqa_datasets # ok
from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
winogrande_datasets as winogrande_5shot_ll_datasets # ok
from opencompass.configs.datasets.winogrande.winogrande_gen import \
winogrande_datasets
from opencompass.configs.datasets.winogrande.winogrande_ll import \
winogrande_datasets as winogrande_ll_datasets # ok
from opencompass.configs.datasets.Xsum.Xsum_gen import Xsum_datasets
from opencompass.configs.models.opt.hf_opt_125m import models
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
for d in datasets:
d['reader_cfg'].update({'train_range': '[0:5]', 'test_range': '[0:5]'})
================================================
FILE: examples/eval_multi_prompt_demo.py
================================================
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM
with read_base():
from opencompass.configs.datasets.winogrande.winogrande_gen_a027b6 import \
winogrande_datasets
datasets = [*winogrande_datasets]
_meta_template = dict(round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='\n', generate=True),
], )
models = [
dict(
type=HuggingFaceCausalLM,
abbr='internlm-chat-7b-hf',
path='internlm/internlm-chat-7b',
tokenizer_path='internlm/internlm-chat-7b',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
use_fast=False,
trust_remote_code=True,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
meta_template=_meta_template,
model_kwargs=dict(
trust_remote_code=True,
device_map='auto',
),
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
_winogrande_all = [d['abbr'] for d in winogrande_datasets]
summarizer = dict(summary_groups=[
{
'name': 'winogrande',
'subsets': _winogrande_all
},
{
'name': 'winogrande_std',
'subsets': _winogrande_all,
'std': True
},
])
================================================
FILE: examples/eval_musr.py
================================================
import os.path as osp
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.musr.musr_gen_3c6e15 import musr_datasets
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import \
models as lmdeploy_glm4_9b_chat_model
from opencompass.configs.models.gemma.lmdeploy_gemma_9b_it import \
models as lmdeploy_gemma_9b_it_model
from opencompass.configs.models.gemma.lmdeploy_gemma_27b_it import \
models as lmdeploy_gemma_27b_it_model
# from opencompass.configs.models.hf_internlm.hf_internlm2_5_1_8b_chat import models
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
models as lmdeploy_llama3_1_8b_instruct_model
from opencompass.configs.models.mistral.lmdeploy_ministral_8b_instruct_2410 import \
models as lmdeploy_ministral_8b_instruct_2410_model
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import \
models as lmdeploy_qwen2_5_7b_instruct_model
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import \
models as lmdeploy_qwen2_5_14b_instruct_model
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import \
models as lmdeploy_qwen2_5_32b_instruct_model
from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b_chat import \
models as lmdeploy_yi_1_5_9b_chat_model
from opencompass.configs.summarizers.groups.musr_average import summarizer
datasets = [*musr_datasets]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
base_exp_dir = 'outputs/musr/'
work_dir = osp.join(base_exp_dir, 'musr_eval')
================================================
FILE: examples/eval_needlebench_v2.py
================================================
from mmengine.config import read_base
# we use mmengine.config to import other config files
with read_base():
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
# Evaluate needlebench_32k, adjust the configuration to use 4k, 32k, 128k, 200k, or 1000k if necessary.
# from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_32k import needlebench_datasets
# from opencompass.configs.summarizers.needlebench import needlebench_32k_summarizer as summarizer
# only eval original "needle in a haystack test" in needlebench_32k
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_zh_datasets, needlebench_en_datasets
from opencompass.configs.summarizers.needlebench import needlebench_v2_32k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC)
# from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_datasets
# ATC use default summarizer thus no need to import summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b:
m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the full long text; for other models, adjust according to their supported maximum sequence length.
m['max_out_len'] = 4096
models = internlm2_chat_7b
work_dir = './outputs/needlebench'
================================================
FILE: examples/eval_qwen3.py
================================================
import os.path as osp
from opencompass.models import OpenAISDK
from mmengine.config import read_base
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
from opencompass.runners import LocalRunner
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
with read_base():
from opencompass.configs.datasets.aime2024.aime2024_cascade_eval_gen_5e9f4f import aime2024_datasets
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
from opencompass.configs.datasets.math.math_500_cascade_eval_gen_6ff468 import math_datasets
#######################################################################
# PART 0 Meta Info #
#######################################################################
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
)
judge_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct',
key='sk-1234',
openai_api_base=[
'http://x.x.x.x:4000/v1',
],
meta_template=api_meta_template,
query_per_second=8,
batch_size=256,
temperature=0.001,
# max_completion_tokens=32768,
tokenizer_path='gpt-4o-2024-05-13',
# verbose=True,
max_out_len=16384,
max_seq_len=32768,
# max_seq_len=49152,
mode='mid',
retry=10
)
#######################################################################
# PART 1 Datasets List #
#######################################################################
repeated_info = [
(math_datasets, 4),
(aime2024_datasets, 32),
(aime2025_datasets, 32),
]
for datasets_, num in repeated_info:
for dataset_ in datasets_:
dataset_['n'] = num
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
for item in datasets:
item['infer_cfg']['inferencer']['max_out_len'] = 32768
try:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
elif'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
except:
pass
#######################################################################
# PART 2 Dataset Summarizer #
#######################################################################
summarizer = dict(
dataset_abbrs=[
'MATH',
['math_prm800k_500', 'accuracy (4 runs average)'],
['aime2024', 'accuracy (32 runs average)'],
['aime2025', 'accuracy (32 runs average)'],
['livemathbench_hard', 'naive_average'],
['OlympiadBenchMath', 'accuracy'],
['olymmath', 'naive_average'],
],
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
models += [
dict(
abbr='Qwen_Qwen3-235B-A22B',
type=OpenAISDK,
path='Qwen/Qwen3-235B-A22B',
key='sk-admin',
openai_api_base=[
'http://106.15.231.215:40007/v1/',
],
meta_template=dict(
# begin=dict(role='SYSTEM', api_role='SYSTEM', prompt=''),
round=[
dict(role='HUMAN', api_role='HUMAN'),
# XXX: all system roles are mapped to human in purpose
dict(role='BOT', api_role='BOT', generate=True),
]
),
query_per_second=16,
batch_size=128,
# batch_size=1,
temperature=0.6,
# max_completion_tokens=32768,
tokenizer_path='gpt-4',
# verbose=True,
max_out_len=32768,
max_seq_len=32768,
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)),
)
base_exp_dir = 'outputs/qwen3_reasoning'
work_dir = osp.join(base_exp_dir, 'chat_objective')
================================================
FILE: examples/eval_qwen_7b.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.collections.leaderboard.qwen import \
datasets
from opencompass.configs.models.qwen.hf_qwen_7b import models
from opencompass.configs.summarizers.leaderboard import summarizer
'''
dataset version metric mode qwen-7b-hf
-------------------------------------- --------- ---------------- ------ ------------
--------- 考试 Exam --------- - - - -
ceval - naive_average ppl 58.65
agieval - naive_average mixed 40.49
mmlu - naive_average ppl 57.78
cmmlu - naive_average ppl 58.57
GaokaoBench - weighted_average mixed 51.76
ARC-c 72cf91 accuracy gen 83.73
ARC-e 72cf91 accuracy gen 90.65
--------- 语言 Language --------- - - - -
WiC ce62e6 accuracy ppl 51.10
chid-dev 25f3d3 accuracy ppl 86.63
afqmc-dev cc328c accuracy ppl 69.00
WSC 678cb5 accuracy ppl 63.46
tydiqa-goldp - naive_average gen 19.98
flores_100 - naive_average gen 3.20
--------- 知识 Knowledge --------- - - - -
BoolQ 463fee accuracy ppl 83.00
commonsense_qa 0d8e25 accuracy ppl 67.49
triviaqa b6904f score gen 40.45
nq b6904f score gen 14.16
--------- 理解 Understanding --------- - - - -
C3 e6778d accuracy gen 75.29
race-middle 73bdec accuracy ppl 90.53
race-high 73bdec accuracy ppl 87.71
openbookqa_fact fa871c accuracy gen 92.20
csl_dev 3c4211 accuracy ppl 56.25
lcsts 0b3969 rouge1 gen 12.38
Xsum 207e69 rouge1 gen 36.00
eprstmt-dev 101429 accuracy gen 89.38
lambada de1af2 accuracy gen 67.88
--------- 推理 Reasoning --------- - - - -
cmnli 15e783 accuracy ppl 54.85
ocnli 1471e7 accuracy gen 42.34
AX_b 793c72 accuracy gen 58.61
AX_g c4c886 accuracy gen 69.10
RTE c4c886 accuracy gen 57.76
COPA 59f42c accuracy gen 88.00
ReCoRD 3e0689 score gen 27.78
hellaswag 06a1e2 accuracy gen 92.47
piqa 24369d accuracy gen 78.02
siqa ea30d1 accuracy ppl 75.03
math 2c0b9e accuracy gen 11.06
gsm8k 4c7f6e accuracy gen 50.87
drop 53a0a7 score gen 44.95
openai_humaneval dd0dff humaneval_pass@1 gen 23.78
mbpp 60ca11 score gen 31.20
bbh - naive_average gen 40.03
'''
================================================
FILE: examples/eval_qwen_7b_chat.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.collections.leaderboard.qwen_chat import \
datasets
from opencompass.configs.models.qwen.hf_qwen_7b_chat import models
from opencompass.configs.summarizers.leaderboard import summarizer
'''
dataset version metric mode qwen-7b-chat-hf
-------------------------------------- --------- ---------------- ------ -----------------
--------- 考试 Exam --------- - - - -
ceval - naive_average gen 56.07
agieval - naive_average mixed 39.51
mmlu - naive_average gen 53.49
cmmlu - naive_average gen 55.29
GaokaoBench - weighted_average gen 48.01
ARC-c ca1e8e accuracy ppl 74.92
ARC-e ca1e8e accuracy ppl 85.71
--------- 语言 Language --------- - - - -
WiC efbd01 accuracy gen 51.41
chid-dev 25f3d3 accuracy ppl 77.72
afqmc-dev 4a1636 accuracy gen 69.00
WSC 678cb5 accuracy ppl 67.31
tydiqa-goldp - naive_average gen 15.32
flores_100 - naive_average gen 10.00
--------- 知识 Knowledge --------- - - - -
BoolQ 463fee accuracy ppl 83.18
commonsense_qa ddaabf accuracy gen 76.41
triviaqa b6904f score gen 43.25
nq 23dc1a score gen 16.26
--------- 理解 Understanding --------- - - - -
C3 e6778d accuracy gen 81.53
race-middle e0908b accuracy gen 83.01
race-high e0908b accuracy gen 77.79
openbookqa_fact 49689a accuracy ppl 86.40
csl_dev 3c4211 accuracy ppl 64.38
lcsts 0b3969 rouge1 gen 12.75
Xsum 207e69 rouge1 gen 20.21
eprstmt-dev ed0c5d accuracy ppl 85.00
lambada de1af2 accuracy gen 59.19
--------- 推理 Reasoning --------- - - - -
cmnli 15e783 accuracy ppl 48.08
ocnli 15e783 accuracy ppl 51.40
AX_b 689df1 accuracy ppl 65.67
AX_g 808a19 accuracy ppl 76.12
RTE 808a19 accuracy ppl 68.95
COPA 59f42c accuracy gen 92.00
ReCoRD 6f7cfc score gen 0.16
hellaswag 8d79e0 accuracy ppl 69.28
piqa 34eee7 accuracy ppl 72.20
siqa ea30d1 accuracy ppl 72.88
math 2c0b9e accuracy gen 7.84
gsm8k 4c7f6e accuracy gen 45.41
drop 53a0a7 score gen 39.62
openai_humaneval dd0dff humaneval_pass@1 gen 10.98
mbpp 60ca11 score gen 20.60
bbh - naive_average gen 42.61
'''
================================================
FILE: examples/eval_qwen_7b_chat_lawbench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.lawbench.lawbench_one_shot_gen_002588 import \
lawbench_datasets as lawbench_one_shot_datasets
from opencompass.configs.datasets.lawbench.lawbench_zero_shot_gen_002588 import \
lawbench_datasets as lawbench_zero_shot_datasets
from opencompass.configs.models.qwen.hf_qwen_7b_chat import models
from opencompass.configs.summarizers.lawbench import summarizer
datasets = lawbench_zero_shot_datasets + lawbench_one_shot_datasets
for d in datasets:
d['infer_cfg']['inferencer']['save_every'] = 1
================================================
FILE: examples/eval_rewardbench.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.judge.rewardbench import get_rewardbench_datasets
from opencompass.configs.summarizers.rewardbench import summarizer
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner, NumWorkerPartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner, DLCRunner, VOLCRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
datasets = [*get_rewardbench_datasets]
from opencompass.models import TurboMindModelwithChatTemplate
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='qwen-7b-hf',
path='Qwen/Qwen-7B',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=72,
task=dict(type=OpenICLInferTask),
),
)
work_dir = './outputs/rewardbench/'
================================================
FILE: examples/eval_rmb.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.judge.rmb import get_rmb_dataset
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner, NumWorkerPartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner, DLCRunner, VOLCRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
datasets = [*get_rmb_dataset]
from opencompass.models import TurboMindModelwithChatTemplate
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='qwen-7b-hf',
path='Qwen/Qwen-7B',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
),
]
infer = dict(
# partitioner=dict(type=NaivePartitioner),
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=72,
task=dict(type=OpenICLInferTask),
),
)
work_dir = './outputs/rmb/'
================================================
FILE: examples/eval_ruler.py
================================================
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
with read_base():
from opencompass.configs.datasets.ruler.ruler_cwe_gen import cwe_datasets # CWE
from opencompass.configs.datasets.ruler.ruler_fwe_gen import fwe_datasets # FWE
from opencompass.configs.datasets.ruler.ruler_niah_gen import niah_datasets # Niah
from opencompass.configs.datasets.ruler.ruler_qa_gen import qa_datasets # QA
from opencompass.configs.datasets.ruler.ruler_vt_gen import vt_datasets # VT
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat_1m import (
models as internlm2_5_7b_chat_1m,
)
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import (
models as llama3_8b_instruct_model,
)
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import (
models as qwen2_7b_instruct_model,
)
from opencompass.configs.summarizers.groups.ruler import ruler_summary_groups
import_datasets = sum(
[niah_datasets, vt_datasets, fwe_datasets, cwe_datasets, qa_datasets], [])
# Evaluation config
NUM_SAMPLES = 500
# Change the context lengths to be tested
max_seq_lens = [1024 * 4, 1024 * 8, 1024 * 16, 1024 * 32]
abbr_suffixs = ['4k', '8k', '16k', '32k']
work_dir = './outputs/ruler'
# Model Settings
qwen2_7b_instruct_model[0]['max_seq_len'] = 33792
qwen2_7b_instruct_model[0]['engine_config']['session_len'] = 33792
qwen2_7b_instruct_model[0]['engine_config']['tp'] = 2
qwen2_7b_instruct_model[0]['run_cfg']['num_gpus'] = 2
llama3_8b_instruct_model[0]['max_seq_len'] = 33792
llama3_8b_instruct_model[0]['engine_config']['session_len'] = 33792
llama3_8b_instruct_model[0]['engine_config']['tp'] = 2
llama3_8b_instruct_model[0]['run_cfg']['num_gpus'] = 2
model_settings = [
[qwen2_7b_instruct_model[0], 'Qwen/Qwen2-7B-Instruct'],
[llama3_8b_instruct_model[0], 'meta-llama/Meta-Llama-3-8B-Instruct'],
[internlm2_5_7b_chat_1m[0], 'internlm/internlm2_5-7b-chat-1m'],
]
# Dataset Model Combination
datasets = []
models = []
model_dataset_combinations = []
# Different seq length
for max_seq_len, abbr_suffix in zip(max_seq_lens, abbr_suffixs):
for model, model_path in model_settings:
_tmp_datasets = []
for dataset in import_datasets:
tmp_dataset = dataset.deepcopy()
tmp_dataset['tokenizer_model'] = model_path
tmp_dataset['abbr'] = tmp_dataset['abbr'] + '_' + abbr_suffix
tmp_dataset['num_samples'] = NUM_SAMPLES
tmp_dataset['max_seq_length'] = max_seq_len
_tmp_datasets.append(tmp_dataset)
model_dataset_combinations.append(
dict(models=[model], datasets=_tmp_datasets))
models.append(model)
datasets.extend(_tmp_datasets)
infer = dict(
partitioner=dict(type=NumWorkerPartitioner),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask),
retry=5),
)
eval = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=32,
task=dict(type=OpenICLEvalTask)),
)
summarizer = dict(
dataset_abbrs=abbr_suffixs,
summary_groups=sum([ruler_summary_groups], []),
)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# dataset version metric mode qwen2-7b-instruct-turbomind llama-3-8b-instruct-turbomind internlm2_5-7b-chat-1m-turbomind
# --------- --------- ------------- ------ ----------------------------- ------------------------------- ----------------------------------
# 4k - naive_average gen 93.66 93.48 91.20
# 8k - naive_average gen 88.38 89.95 89.07
# 16k - naive_average gen 84.27 0.14 87.61
# 32k - naive_average gen 81.36 0.00 84.59
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
================================================
FILE: examples/eval_ruler_fix_tokenizer.py
================================================
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
with read_base():
from opencompass.configs.datasets.ruler.ruler_combined_gen import \
ruler_combined_datasets
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat_1m import \
models as internlm2_5_7b_chat_1m
from opencompass.configs.summarizers.groups.ruler import \
ruler_summary_groups
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
models = internlm2_5_7b_chat_1m
work_dir = './outputs/ruler'
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=2),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask),
retry=5),
)
eval = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=32,
task=dict(type=OpenICLEvalTask)),
)
summarizer = dict(
dataset_abbrs=['ruler_4k', 'ruler_8k', 'ruler_16k', 'ruler_32k'],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)
================================================
FILE: examples/eval_rwkv5_3b.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.collections.base_medium_llama import \
datasets
from opencompass.configs.models.rwkv.rwkv5_3b import models
from opencompass.configs.summarizers.leaderboard import summarizer
================================================
FILE: examples/eval_scireasoner.py
================================================
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
with read_base():
# If you want to evaluate the full scireasoner dataset (more than one million samples)
from opencompass.configs.datasets.SciReasoner.scireasoner_gen import scireasoner_full_datasets
# If you only want to evaluate the miniset
from opencompass.configs.datasets.SciReasoner.scireasoner_gen import scireasoner_mini_datasets
from opencompass.configs.summarizers.scireasoner import SciReasonerSummarizer
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
summarizer = dict(
type=SciReasonerSummarizer,
mini_set=False, # When evaluating miniset, please set True
show_details=False # Whether you want to see the detailed results for each subset
)
system_prompt = [
dict(
role='SYSTEM',
prompt='You are a professional science expert, able to reason across science fields. You answer scientific questions by integrating theory, empirical evidence, and quantitative reasoning. Provide responses that are accurate, well-justified, and as concise as possible, and clearly distinguish established facts from assumptions, approximations, and remaining uncertainties.',
),
]
judge_cfg = () # Config your judge model here.
for item in datasets:
item['infer_cfg']['prompt_template']['template']['round'] = system_prompt + item['infer_cfg']['prompt_template']['template']['round']
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
elif 'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(
type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask),
),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=10),
runner=dict(
type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLEvalTask)
),
)
work_dir = './outputs/eval_scireasoner'
================================================
FILE: examples/eval_simpleqa.py
================================================
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import DefaultSubjectiveSummarizer
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.SimpleQA.simpleqa_gen import \
simpleqa_datasets
from opencompass.configs.models.openai.gpt_4o_2024_05_13 import \
models as gpt_4o_2024_05_13_model
models = gpt_4o_2024_05_13_model # model for generation
judge_models = gpt_4o_2024_05_13_model # model for evaluation
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
summarizer = dict(type=DefaultSubjectiveSummarizer)
# -------------Inferen Stage ----------------------------------------
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)),
)
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=256,
task=dict(type=SubjectiveEvalTask)),
)
================================================
FILE: examples/eval_subjective.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import alpacav2_datasets
from opencompass.configs.datasets.subjective.compassarena.compassarena_compare import compassarena_datasets
from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare import arenahard_datasets
from opencompass.configs.datasets.subjective.compassbench.compassbench_compare import compassbench_datasets
from opencompass.configs.datasets.subjective.fofo.fofo_judge import fofo_datasets
from opencompass.configs.datasets.subjective.wildbench.wildbench_pair_judge import wildbench_datasets
from opencompass.configs.datasets.subjective.multiround.mtbench_single_judge_diff_temp import mtbench_datasets
from opencompass.configs.datasets.subjective.multiround.mtbench101_judge import mtbench101_datasets
from opencompass.models import (HuggingFace, HuggingFaceCausalLM,
HuggingFaceChatGLM3, OpenAI)
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_num_worker import \
SubjectiveNumWorkerPartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.summarizers import SubjectiveSummarizer
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
models = [
dict(
type=HuggingFaceChatGLM3,
abbr='chatglm3-6b-hf',
path='THUDM/chatglm3-6b',
tokenizer_path='THUDM/chatglm3-6b',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
generation_kwargs=dict(
do_sample=
True, #For subjective evaluation, we suggest you do set do_sample when running model inference!
),
meta_template=api_meta_template,
max_out_len=2048,
max_seq_len=4096,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
datasets = [
*alignbench_datasets, *alpacav2_datasets, *arenahard_datasets,
*compassarena_datasets, *compassbench_datasets, *fofo_datasets,
*mtbench_datasets, *mtbench101_datasets, *wildbench_datasets
] # add datasets you want
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
judge_models = [
dict(
abbr='GPT4-Turbo',
type=OpenAI,
path='gpt-4-1106-preview',
key=
'xxxx', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=16,
max_out_len=2048,
max_seq_len=2048,
batch_size=8,
temperature=0,
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'
================================================
FILE: examples/eval_subjective_alpacaeval_official.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import subjective_datasets as alpacav2
from opencompass.models import (HuggingFace, HuggingFaceCausalLM,
HuggingFaceChatGLM3)
from opencompass.models.openai_api import OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.summarizers import AlpacaSummarizer
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.outer_eval.alpacaeval import AlpacaEvalTask
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
# To run this config, please ensure to successfully installed `alpaca-eval==0.6` and `scikit-learn==1.5`
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
models = [
dict(
type=HuggingFaceChatGLM3,
abbr='chatglm3-6b',
path='THUDM/chatglm3-6b',
tokenizer_path='THUDM/chatglm3-6b',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
generation_kwargs=dict(do_sample=True, ),
meta_template=api_meta_template,
max_out_len=2048,
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
datasets = [*alpacav2]
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
gpt4_judge = dict(
abbr='GPT4-Turbo',
path='gpt-4-1106-preview',
key=
'', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
config='weighted_alpaca_eval_gpt4_turbo')
## ------------- Evaluation Configuration
eval = dict(partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=AlpacaEvalTask, judge_cfg=gpt4_judge),
))
work_dir = 'outputs/alpaca/'
================================================
FILE: examples/eval_subjective_bradleyterry.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4_bradleyterry import (
alpacav2_datasets, )
from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare_bradleyterry import (
arenahard_datasets, )
from opencompass.configs.datasets.subjective.compassarena.compassarena_compare_bradleyterry import (
compassarena_datasets, )
from opencompass.configs.datasets.subjective.wildbench.wildbench_pair_judge_bradleyterry import (
wildbench_datasets, )
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import (
models as lmdeploy_internlm2_5_7b_chat, )
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import (
models as lmdeploy_internlm2_5_20b_chat, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
models as lmdeploy_qwen2_5_7b_instruct, )
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
models as lmdeploy_qwen2_5_14b_instruct, )
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import (
models as lmdeploy_qwen2_7b_instruct, )
from opencompass.models import (HuggingFace, HuggingFaceCausalLM,
HuggingFaceChatGLM3, OpenAI,
TurboMindModelwithChatTemplate)
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_num_worker import \
SubjectiveNumWorkerPartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner, SlurmSequentialRunner
from opencompass.summarizers import (CompassArenaBradleyTerrySummarizer,
SubjectiveSummarizer)
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
models = [
*lmdeploy_internlm2_5_7b_chat,
*lmdeploy_internlm2_5_20b_chat,
*lmdeploy_qwen2_5_14b_instruct,
*lmdeploy_qwen2_5_7b_instruct,
*lmdeploy_qwen2_7b_instruct,
]
datasets = [
*alpacav2_datasets,
*arenahard_datasets,
*compassarena_datasets,
*wildbench_datasets,
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
judge_models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='CompassJudger-1-32B-Instruct',
path='opencompass/CompassJudger-1-32B-Instruct',
engine_config=dict(session_len=16384, max_batch_size=16, tp=4),
gen_config=dict(top_k=1,
temperature=1e-6,
top_p=0.9,
max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=4),
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
## ------------- Summary Configuration
# This step fits a Bradley-Terry model (statistical model) with an option
# to include style features and control variables based on groups
# (group variables must be available in the input dataset for each observation).
summarizer = dict(
type=CompassArenaBradleyTerrySummarizer,
rating_system='bradleyterry',
report_pred_win_rates=True,
num_bootstrap=100,
num_cpu=None,
with_control_vars=True,
normalize_style_features=False,
odds_ratio=True,
)
work_dir = 'outputs/subjective/bradleyterry'
================================================
FILE: examples/eval_teval.py
================================================
from copy import deepcopy
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.teval.teval_en_gen_1ac254 import \
teval_datasets as teval_en_datasets
from opencompass.configs.datasets.teval.teval_zh_gen_1ac254 import \
teval_datasets as teval_zh_datasets
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \
models as hf_internlm2_chat_7b_model
from opencompass.configs.models.hf_llama.hf_llama2_7b_chat import \
models as hf_llama2_7b_chat_model
from opencompass.configs.models.qwen.hf_qwen_7b_chat import \
models as hf_qwen_7b_chat_model
from opencompass.configs.summarizers.teval import summarizer
meta_template_system_patches = {
'internlm2-chat-7b-hf':
dict(role='SYSTEM', begin='<|im_start|>system\n', end='<|im_end|>\n'),
'internlm2-chat-20b-hf':
dict(role='SYSTEM', begin='<|im_start|>system\n', end='<|im_end|>\n'),
}
_origin_models = sum([v for k, v in locals().items() if k.endswith('_model')],
[])
models = []
for m in _origin_models:
m = deepcopy(m)
if 'meta_template' in m and 'round' in m['meta_template']:
round = m['meta_template']['round']
if all(r['role'].upper() != 'SYSTEM'
for r in round): # no system round
if m['abbr'] in meta_template_system_patches:
system_round = meta_template_system_patches[m['abbr']]
else:
system_round = [
r for r in round if r['role'].upper() == 'HUMAN'
][0]
system_round = deepcopy(system_round)
system_round['role'] = 'SYSTEM'
m['meta_template']['round'].append(system_round)
else:
raise ValueError(f'no meta_template.round in {m.get("abbr", None)}')
print(
f'model {m["abbr"]} is using the following meta_template: {m["meta_template"]}'
)
models.append(m)
datasets = teval_en_datasets + teval_zh_datasets
work_dir = './outputs/teval'
"""Dataset version metric mode
qwen-7b-chat-hf internlm2-chat-7b-hf llama-2-7b-chat-hf.
------------------------------------------- --------- -------------- ------- ----------------- ---------------------- --------------------
teval - naive_average unknown 57.69 78.18 36.63
teval-instruct_v1 10482d string_metric unknown 28.83 98.08 50.27
teval-instruct_v1 10482d json_metric unknown 94.32 97.08 0.15
teval-plan_str_v1 10482d f1_score unknown 66.24 84.12 45.72
teval-plan_json_v1 10482d f1_score unknown 63.62 77.71 19.95
teval-reason_str_v1 10482d thought unknown 54.14 63.58 44.92
teval-reason_retrieve_understand_json_v1 10482d thought unknown 33.77 54.72 21.49
teval-retrieve_str_v1 10482d name unknown 73.89 85.28 60.6
teval-reason_retrieve_understand_json_v1 10482d name unknown 31.15 68.97 15.34
teval-understand_str_v1 10482d args unknown 77.76 93.03 65.61
teval-reason_retrieve_understand_json_v1 10482d args unknown 44.16 72.23 26.84
teval-review_str_v1 10482d review_quality unknown 62.22 71.66 44.35
teval_zh - naive_average unknown 61.31 75.01 32.33
teval-instruct_v1_zh 10482d string_metric unknown 88.69 98.19 23.64
teval-instruct_v1_zh 10482d json_metric unknown 75.77 96.62 0.89
teval-plan_str_v1_zh 10482d f1_score unknown 62.43 70.69 47.82
teval-plan_json_v1_zh 10482d f1_score unknown 61.46 68.95 15.87
teval-reason_str_v1_zh 10482d thought unknown 59.43 68.14 46.96
teval-reason_retrieve_understand_json_v1_zh 10482d thought unknown 39.19 60.37 23.91
teval-retrieve_str_v1_zh 10482d name unknown 69.41 84.22 54.44
teval-reason_retrieve_understand_json_v1_zh 10482d name unknown 32.87 70.46 14.16
teval-understand_str_v1_zh 10482d args unknown 84.39 88.62 77.29
teval-reason_retrieve_understand_json_v1_zh 10482d args unknown 48.71 72.71 28.83
teval-review_str_v1_zh 10482d review_quality unknown 56.67 60.57 27.1
"""
================================================
FILE: examples/eval_with_model_dataset_combinations.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import \
ceval_datasets as chat_ceval_datasets
from opencompass.configs.datasets.ceval.ceval_ppl_578f8d import \
ceval_datasets as base_ceval_datasets
from opencompass.configs.internal.clusters.slurm import eval, infer
from opencompass.configs.models.qwen.hf_qwen_7b import \
models as hf_qwen_7b_base_models
from opencompass.configs.models.qwen.hf_qwen_7b_chat import \
models as hf_qwen_7b_chat_models
# from opencompass.configs.internal.clusters.slurm import infer_split as infer, eval
# from opencompass.configs.internal.clusters.slurm import infer_size as infer, eval
# from opencompass.configs.internal.clusters.slurm import infer_size_split as infer, eval
base_ceval_datasets = base_ceval_datasets[:1]
chat_ceval_datasets = chat_ceval_datasets[-1:]
# If you do not want to run all the combinations of models and datasets, you
# can specify the combinations you want to run here. This is useful when you
# deleberately want to skip some subset of the combinations.
# Models and datasets in different combinations are recommended to be disjoint
# (different `abbr` in model & dataset configs), as we haven't tested this case
# throughly.
model_dataset_combinations = [
dict(models=hf_qwen_7b_base_models, datasets=base_ceval_datasets),
dict(models=hf_qwen_7b_chat_models, datasets=chat_ceval_datasets),
# dict(models=[model_cfg1, ...], datasets=[dataset_cfg1, ...]),
]
# This union of models and datasets in model_dataset_combinations should be
# stored in the `models` and `datasets` variables below. Otherwise, modules
# like summarizer will miss out some information.
models = [*hf_qwen_7b_base_models, *hf_qwen_7b_chat_models]
datasets = [*base_ceval_datasets, *chat_ceval_datasets]
work_dir = './outputs/default/mdcomb/'
"""
dataset version metric mode qwen-7b-hf qwen-7b-chat-hf
---------------------- --------- -------- ------ ------------ -----------------
ceval-computer_network 9b9417 accuracy ppl 52.63 -
ceval-physician 6e277d accuracy gen - 59.18
"""
================================================
FILE: opencompass/__init__.py
================================================
__version__ = '0.5.2'
================================================
FILE: opencompass/cli/__init__.py
================================================
================================================
FILE: opencompass/cli/main.py
================================================
# flake8: noqa
# yapf: disable
import argparse
import copy
import getpass
import os
import os.path as osp
import threading
from datetime import datetime
from mmengine.config import Config, DictAction
from opencompass.registry import PARTITIONERS, RUNNERS, build_from_cfg
from opencompass.runners import SlurmRunner
from opencompass.summarizers import DefaultSummarizer
from opencompass.utils import (HeartBeatManager, LarkReporter, get_logger,
pretty_print_config, read_from_station,
save_to_station)
from opencompass.utils.run import (fill_eval_cfg, fill_infer_cfg,
get_config_from_arg)
def _run_eval_tasks(runner, tasks):
if isinstance(tasks, list) and len(tasks) != 0 and isinstance(tasks[0],
list):
for task_part in tasks:
runner(task_part)
else:
runner(tasks)
def _is_eval_daemon(task_type) -> bool:
if isinstance(task_type, str):
return task_type.endswith('OpenICLEvalWatchTask')
return getattr(task_type, '__name__', '') == 'OpenICLEvalWatchTask'
def parse_args():
parser = argparse.ArgumentParser(description='Run an evaluation task')
parser.add_argument('config', nargs='?', help='Train config file path')
# add mutually exclusive args `--slurm` `--dlc`, defaults to local runner
# if "infer" or "eval" not specified
launch_method = parser.add_mutually_exclusive_group()
launch_method.add_argument('--slurm',
action='store_true',
default=False,
help='Whether to force tasks to run with srun. '
'If True, `--partition(-p)` must be set. '
'Defaults to False')
launch_method.add_argument('--dlc',
action='store_true',
default=False,
help='Whether to force tasks to run on dlc. If '
'True, `--aliyun-cfg` must be set. Defaults'
' to False')
# Add shortcut parameters (models, datasets and summarizer)
parser.add_argument('--models', nargs='+', help='', default=None)
parser.add_argument('--datasets', nargs='+', help='', default=None)
parser.add_argument('--summarizer', help='', default=None)
# add general args
parser.add_argument('--debug',
help='Debug mode, in which scheduler will run tasks '
'in the single process, and output will not be '
'redirected to files',
action='store_true',
default=False)
parser.add_argument('--dry-run',
help='Dry run mode, in which the scheduler will not '
'actually run the tasks, but only print the commands '
'to run',
action='store_true',
default=False)
parser.add_argument(
'-a', '--accelerator',
help='Infer accelerator, support vllm and lmdeploy now.',
choices=['vllm', 'lmdeploy', None],
default=None,
type=str)
parser.add_argument('-m',
'--mode',
help='Running mode. You can choose "infer" if you '
'only want the inference results, or "eval" if you '
'already have the results and want to evaluate them, '
'or "viz" if you want to visualize the results.',
choices=['all', 'infer', 'eval', 'viz'],
default='all',
type=str)
parser.add_argument('-r',
'--reuse',
nargs='?',
type=str,
const='latest',
help='Reuse previous outputs & results, and run any '
'missing jobs presented in the config. If its '
'argument is not specified, the latest results in '
'the work_dir will be reused. The argument should '
'also be a specific timestamp, e.g. 20230516_144254')
parser.add_argument('-w',
'--work-dir',
help='Work path, all the outputs will be '
'saved in this path, including the slurm logs, '
'the evaluation results, the summary results, etc.'
'If not specified, the work_dir will be set to '
'outputs/default.',
default=None,
type=str)
parser.add_argument(
'--config-dir',
default='configs',
help='Use the custom config directory instead of config/ to '
'search the configs for datasets, models and summarizers',
type=str)
parser.add_argument(
'--config-verbose',
default=False,
action='store_true',
help='Whether to print the config in verbose mode.')
parser.add_argument('-l',
'--lark',
help='Report the running status to lark bot',
action='store_true',
default=False)
parser.add_argument('--max-num-workers',
help='Max number of workers to run in parallel. '
'Will be overrideen by the "max_num_workers" argument '
'in the config.',
type=int,
default=1)
parser.add_argument('--max-workers-per-gpu',
help='Max task to run in parallel on one GPU. '
'It will only be used in the local runner.',
type=int,
default=1)
parser.add_argument(
'--retry',
help='Number of retries if the job failed when using slurm or dlc. '
'Will be overrideen by the "retry" argument in the config.',
type=int,
default=2)
parser.add_argument(
'--dump-eval-details',
help='Whether to dump the evaluation details, including the '
'correctness of each sample, bpb, etc. Defaults to True.',
nargs='?',
const=True,
default=True,
type=lambda x: False if x and x.lower() == 'false' else True
)
parser.add_argument('--dump-res-length',
help='dump the length of model responses',
action='store_true',
default=False)
parser.add_argument(
'--dump-extract-rate',
help='Whether to dump the evaluation details, including the '
'correctness of each sample, bpb, etc.',
action='store_true',
)
# for the results persistence
parser.add_argument('-sp',
'--station-path',
help='Path to your results station.',
type=str,
default=None,
)
parser.add_argument('--station-overwrite',
help='Whether to overwrite the results at station.',
action='store_true',
)
parser.add_argument(
'--read-from-station',
help='Whether to save the evaluation results to the '
'data station.',
action='store_true',
)
# for evaluation with multiple runs
parser.add_argument('--dataset-num-runs',
help='How many runs for one dataset',
type=int,
default=1,
)
parser.add_argument(
'--dump-only-message-path',
help='Where to dump message only',
type=str,
default=None,
)
# set srun args
slurm_parser = parser.add_argument_group('slurm_args')
parse_slurm_args(slurm_parser)
# set dlc args
dlc_parser = parser.add_argument_group('dlc_args')
parse_dlc_args(dlc_parser)
# set hf args
hf_parser = parser.add_argument_group('hf_args')
parse_hf_args(hf_parser)
# set custom dataset args
custom_dataset_parser = parser.add_argument_group('custom_dataset_args')
parse_custom_dataset_args(custom_dataset_parser)
args = parser.parse_args()
if args.slurm:
assert args.partition is not None, (
'--partition(-p) must be set if you want to use slurm')
if args.dlc:
assert os.path.exists(args.aliyun_cfg), (
'When launching tasks using dlc, it needs to be configured '
'in "~/.aliyun.cfg", or use "--aliyun-cfg $ALiYun-CFG_Path"'
' to specify a new path.')
return args
def parse_slurm_args(slurm_parser):
"""These args are all for slurm launch."""
slurm_parser.add_argument('-p',
'--partition',
help='Slurm partition name',
default=None,
type=str)
slurm_parser.add_argument('-q',
'--quotatype',
help='Slurm quota type',
default=None,
type=str)
slurm_parser.add_argument('--qos',
help='Slurm quality of service',
default=None,
type=str)
def parse_dlc_args(dlc_parser):
"""These args are all for dlc launch."""
dlc_parser.add_argument('--aliyun-cfg',
help='The config path for aliyun config',
default='~/.aliyun.cfg',
type=str)
def parse_hf_args(hf_parser):
"""These args are all for the quick construction of HuggingFace models."""
hf_parser.add_argument('--hf-type', type=str, choices=['base', 'chat'], default='chat', help='The type of the HuggingFace model, base or chat')
hf_parser.add_argument('--hf-path', type=str, help='The path to the HuggingFace model, e.g. "facebook/opt-125m", required')
hf_parser.add_argument('--model-kwargs', nargs='+', action=DictAction, default={}, help='The kwargs for the HuggingFace model')
hf_parser.add_argument('--tokenizer-path', type=str, help='The path to the HuggingFace tokenizer, same as --hf-path if not specified')
hf_parser.add_argument('--tokenizer-kwargs', nargs='+', action=DictAction, default={}, help='The kwargs for the tokenizer')
hf_parser.add_argument('--peft-path', type=str, help='The path to the PEFT model')
hf_parser.add_argument('--peft-kwargs', nargs='+', action=DictAction, default={}, help='The kwargs for the PEFT model')
hf_parser.add_argument('--generation-kwargs', nargs='+', action=DictAction, default={}, help='The kwargs for the generation')
hf_parser.add_argument('--max-seq-len', type=int, help='The max sequence length for the HuggingFace model')
hf_parser.add_argument('--max-out-len', type=int, default=256, help='The max output length for the HuggingFace model')
hf_parser.add_argument('--min-out-len', type=int, default=1, help='The min output length for the HuggingFace model')
hf_parser.add_argument('--batch-size', type=int, default=8, help='The batch size for the HuggingFace model')
hf_parser.add_argument('--num-gpus', type=int, default=None, help='Deprecated, please use --hf-num-gpus instead')
hf_parser.add_argument('--hf-num-gpus', type=int, default=1, help='The number of GPUs for the HuggingFace model passed via cli')
hf_parser.add_argument('--pad-token-id', type=int, help='The pad token id for the HuggingFace model')
hf_parser.add_argument('--stop-words', nargs='+', default=[], help='The stop words for the HuggingFace model')
def parse_custom_dataset_args(custom_dataset_parser):
"""These args are all for the quick construction of custom datasets."""
custom_dataset_parser.add_argument('--custom-dataset-path', type=str)
custom_dataset_parser.add_argument('--custom-dataset-meta-path', type=str)
custom_dataset_parser.add_argument('--custom-dataset-data-type',
type=str,
choices=['mcq', 'qa'])
custom_dataset_parser.add_argument('--custom-dataset-infer-method',
type=str,
choices=['gen', 'ppl'])
def main():
args = parse_args()
if args.num_gpus is not None:
raise ValueError('The `--num-gpus` argument is deprecated, please use '
'`--hf-num-gpus` to describe number of gpus used for '
'the HuggingFace model instead.')
if args.dry_run:
args.debug = True
# initialize logger
logger = get_logger(log_level='DEBUG' if args.debug else 'INFO')
cfg = get_config_from_arg(args)
if args.work_dir is not None:
cfg['work_dir'] = args.work_dir
else:
cfg.setdefault('work_dir', os.path.join('outputs', 'default'))
# cfg_time_str defaults to the current time
cfg_time_str = dir_time_str = datetime.now().strftime('%Y%m%d_%H%M%S')
if args.reuse:
if args.reuse == 'latest':
if not os.path.exists(cfg.work_dir) or not os.listdir(
cfg.work_dir):
logger.warning('No previous results to reuse!')
else:
dirs = os.listdir(cfg.work_dir)
dir_time_str = sorted(dirs)[-1]
else:
dir_time_str = args.reuse
logger.info(f'Reusing experiements from {dir_time_str}')
elif args.mode in ['eval', 'viz'] and not args.read_from_station:
raise ValueError(
'You must specify -r or --reuse, or you have to specify '
'--read-from-station and --station-path when running in eval '
'or viz mode!')
# update "actual" work_dir
cfg['work_dir'] = osp.join(cfg.work_dir, dir_time_str)
current_workdir = cfg['work_dir']
logger.info(f'Current exp folder: {current_workdir}')
os.makedirs(osp.join(cfg.work_dir, 'configs'), exist_ok=True)
# dump config
output_config_path = osp.join(cfg.work_dir, 'configs',
f'{cfg_time_str}_{os.getpid()}.py')
cfg.dump(output_config_path)
# Config is intentally reloaded here to avoid initialized
# types cannot be serialized
cfg = Config.fromfile(output_config_path, format_python_code=False)
# get existed results from station
if args.read_from_station:
existing_results_list = read_from_station(cfg, args)
rs_exist_results = [comb['combination'] for comb in existing_results_list]
cfg['rs_exist_results'] = rs_exist_results
# report to lark bot if specify --lark
if not args.lark:
cfg['lark_bot_url'] = None
elif cfg.get('lark_bot_url', None):
content = f'{getpass.getuser()}\'s task has been launched!'
LarkReporter(cfg['lark_bot_url']).post(content)
# print config if specified --config-verbose
if args.config_verbose:
pretty_print_config(cfg)
infer_tasks = None
infer_runner = None
eval_tasks = None
eval_runner = None
eval_daemon = False
# ========================
# Setup Configuration
# ========================
if args.mode in ['all', 'infer']:
# When user have specified --slurm or --dlc, or have not set
# "infer" in config, we will provide a default configuration
# for infer
if (args.dlc or args.slurm) and cfg.get('infer', None):
logger.warning('You have set "infer" in the config, but '
'also specified --slurm or --dlc. '
'The "infer" configuration will be overridden by '
'your runtime arguments.')
if args.dlc or args.slurm or cfg.get('infer', None) is None:
fill_infer_cfg(cfg, args)
if args.partition is not None:
if RUNNERS.get(cfg.infer.runner.type) == SlurmRunner:
cfg.infer.runner.partition = args.partition
cfg.infer.runner.quotatype = args.quotatype
else:
logger.warning('SlurmRunner is not used, so the partition '
'argument is ignored.')
if args.debug:
cfg.infer.runner.debug = True
if args.lark:
cfg.infer.runner.lark_bot_url = cfg['lark_bot_url']
cfg.infer.partitioner['out_dir'] = osp.join(cfg['work_dir'],
'predictions/')
partitioner = PARTITIONERS.build(cfg.infer.partitioner)
tasks = partitioner(cfg)
if args.dry_run:
return
runner = RUNNERS.build(cfg.infer.runner)
# Add extra attack config if exists
if hasattr(cfg, 'attack'):
for task in tasks:
cfg.attack.dataset = task.datasets[0][0].abbr
task.attack = cfg.attack
if args.dump_res_length:
for task in tasks:
task.dump_res_length = True
if args.dump_only_message_path:
for task in tasks:
task.dump_only_message_path = args.dump_only_message_path
infer_tasks = tasks
infer_runner = runner
# evaluate
if args.mode in ['all', 'eval']:
# When user have specified --slurm or --dlc, or have not set
# "eval" in config, we will provide a default configuration
# for eval
if (args.dlc or args.slurm) and cfg.get('eval', None):
logger.warning('You have set "eval" in the config, but '
'also specified --slurm or --dlc. '
'The "eval" configuration will be overridden by '
'your runtime arguments.')
if args.dlc or args.slurm or cfg.get('eval', None) is None:
fill_eval_cfg(cfg, args)
if args.dump_eval_details:
logger.warning('Default to dump eval details, it might take extra'
'space to save all the evaluation details. '
'Set --dump-eval-details False to skip the details dump')
cfg.eval.runner.task.dump_details = True
if args.dump_extract_rate:
cfg.eval.runner.task.cal_extract_rate = True
if args.partition is not None:
if RUNNERS.get(cfg.eval.runner.type) == SlurmRunner:
cfg.eval.runner.partition = args.partition
cfg.eval.runner.quotatype = args.quotatype
else:
logger.warning('SlurmRunner is not used, so the partition '
'argument is ignored.')
if args.debug:
cfg.eval.runner.debug = True
if args.lark:
cfg.eval.runner.lark_bot_url = cfg['lark_bot_url']
cfg.eval.partitioner['out_dir'] = osp.join(cfg['work_dir'], 'results/')
partitioner = PARTITIONERS.build(cfg.eval.partitioner)
tasks = partitioner(cfg)
if args.dry_run:
return
runner = RUNNERS.build(cfg.eval.runner)
task_type = getattr(cfg.eval.runner, 'task', {}).get('type', '')
eval_daemon = _is_eval_daemon(task_type)
eval_tasks = tasks
eval_runner = runner
# =================
# Startup Runner
# =================
if infer_runner and eval_runner and eval_daemon:
heartbeat = HeartBeatManager(cfg['work_dir'])
stop_event, hb_thread = heartbeat.start_heartbeat()
eval_thread = threading.Thread(target=_run_eval_tasks,
args=(eval_runner, eval_tasks),
daemon=True)
eval_thread.start()
infer_runner(infer_tasks)
stop_event.set()
hb_thread.join()
logger.info('All infer tasks finished, stop heartbeat.')
eval_thread.join()
else:
if infer_runner is not None:
infer_runner(infer_tasks)
if eval_runner is not None:
_run_eval_tasks(eval_runner, eval_tasks)
# save to station
if args.station_path is not None or cfg.get('station_path') is not None:
save_to_station(cfg, args)
# visualize
if args.mode in ['all', 'eval', 'viz']:
summarizer_cfg = cfg.get('summarizer', {})
# For subjective summarizer
if summarizer_cfg.get('function', None):
main_summarizer_cfg = copy.deepcopy(summarizer_cfg)
grouped_datasets = {}
for dataset in cfg.datasets:
prefix = dataset['abbr'].split('_')[0]
if prefix not in grouped_datasets:
grouped_datasets[prefix] = []
grouped_datasets[prefix].append(dataset)
all_grouped_lists = []
for prefix in grouped_datasets:
all_grouped_lists.append(grouped_datasets[prefix])
dataset_score_container = []
for dataset in all_grouped_lists:
temp_cfg = copy.deepcopy(cfg)
temp_cfg.datasets = dataset
summarizer_cfg = dict(type=dataset[0]['summarizer']['type'], config=temp_cfg)
summarizer = build_from_cfg(summarizer_cfg)
dataset_score = summarizer.summarize(time_str=cfg_time_str)
if dataset_score:
dataset_score_container.append(dataset_score)
main_summarizer_cfg['config'] = cfg
main_summarizer = build_from_cfg(main_summarizer_cfg)
main_summarizer.summarize(time_str=cfg_time_str, subjective_scores=dataset_score_container)
else:
if not summarizer_cfg or summarizer_cfg.get('type', None) is None:
summarizer_cfg['type'] = DefaultSummarizer
summarizer_cfg['config'] = cfg
summarizer = build_from_cfg(summarizer_cfg)
summarizer.summarize(time_str=cfg_time_str)
if __name__ == '__main__':
main()
================================================
FILE: opencompass/configs/chatml_datasets/AMO_Bench/AMO_Bench_gen.py
================================================
datasets = [
dict(
abbr='AMO-Bench',
path='./data/amo-bench.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
n=1,
),
]
================================================
FILE: opencompass/configs/chatml_datasets/CPsyExam/CPsyExam_gen.py
================================================
datasets = [
dict(
abbr='CPsyExam',
path='./data/CPsyExam/merged_train_dev.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
n=1,
),
]
================================================
FILE: opencompass/configs/chatml_datasets/CS_Bench/CS_Bench_gen.py
================================================
subset_list = [
'test',
'valid',
]
language_list = [
'CN',
'EN',
]
datasets = []
for subset in subset_list:
for language in language_list:
datasets.append(
dict(
abbr=f'CS-Bench_{language}_{subset}',
path=f'./data/csbench/CSBench-{language}/{subset}.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
)
)
================================================
FILE: opencompass/configs/chatml_datasets/C_MHChem/C_MHChem_gen.py
================================================
datasets = [
dict(
abbr='C-MHChem',
path='./data/C-MHChem2.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
n=1,
),
]
================================================
FILE: opencompass/configs/chatml_datasets/HMMT2025/HMMT2025_gen.py
================================================
datasets = [
dict(
abbr='HMMT2025',
path='./data/hmmt2025.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
n=1,
),
]
================================================
FILE: opencompass/configs/chatml_datasets/IMO_Bench_AnswerBench/IMO_Bench_AnswerBench_gen.py
================================================
datasets = [
dict(
abbr='IMO-Bench-AnswerBench',
path='opencompass/IMO-Answer-Bench',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
n=1,
),
]
================================================
FILE: opencompass/configs/chatml_datasets/MaScQA/MaScQA_gen.py
================================================
datasets = [
dict(
abbr='MaScQA',
path='./data/MaScQA/MaScQA.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
n=1,
),
]
================================================
FILE: opencompass/configs/chatml_datasets/UGD_hard/UGD_hard_gen.py
================================================
EVAL_PROMPT = (
"You are a helpful assistant who evaluates the correctness and quality of models' outputs.\nPlease as a grading "
'expert, judge whether the final answers given by the candidates below are consistent with the standard answers, '
'that is, whether the candidates answered correctly. \n \n Here are some evaluation criteria:\n '
"1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because "
"the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the "
"standard answer according to the form of the question. Don't try to answer the original question. You can assume "
"that the standard answer is definitely correct.\n 2. Because the candidate's answer may be different from the "
'standard answer in the form of expression, before making a judgment, please understand the question and the '
"standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to "
'answer the original question.\n 3. Some answers may contain multiple items, such as multiple-choice questions, '
'multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard '
'answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate '
'needs to answer all the corresponding options or blanks correctly to be considered correct.\n 4. Some answers '
'may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a '
'textual description, as long as the meaning expressed is the same. And some formulas are expressed in different '
'ways, but they are equivalent and correct.\n 5. If the prediction is given with \\boxed{{}}, please ignore '
"the \\boxed{{}} and only judge whether the candidate's answer is consistent with the standard answer.\n\n "
'Please judge whether the following answers are consistent with the standard answer based on the above criteria. '
'Grade the predicted answer of this new question as one of:\n A: CORRECT \n B: INCORRECT\n Just return '
"the letters \"A\" or \"B\", with no text around it.\n\n Here is your task. Simply reply with either CORRECT, "
"INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer."
'\n\n\n : \n\n{question}\n\n\n\n\n '
': \n{answer}\n\n\n\n : \n{prediction}\n'
"\n\n\n \n Judging the correctness of candidates' answers:\"\n"
)
datasets = [
dict(
abbr='UGD_hard',
path='./data/UGD_hard_oc.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
prompt=EVAL_PROMPT,
),
n=1,
),
]
================================================
FILE: opencompass/configs/chatml_datasets/UGPhysics/UGPhysics_gen.py
================================================
subset_list = [
'AtomicPhysics',
'ClassicalElectromagnetism',
'ClassicalMechanics',
'Electrodynamics',
'GeometricalOptics',
'QuantumMechanics',
'Relativity',
'Solid-StatePhysics',
'StatisticalMechanics',
'SemiconductorPhysics',
'Thermodynamics',
'TheoreticalMechanics',
'WaveOptics',
]
language_list = [
'zh',
'en',
]
datasets = []
for subset in subset_list:
for language in language_list:
datasets.append(
dict(
abbr=f'UGPhysics_{subset}_{language}',
path=f'./data/ugphysics/{subset}/{language}.jsonl',
evaluator=dict(
type='llm_evaluator',
judge_cfg=dict(),
),
)
)
================================================
FILE: opencompass/configs/dataset_collections/chat_OC15.py
================================================
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import mmlu_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_gen_c13365 import cmmlu_datasets
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_4c31db import GaokaoBench_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_bc5f21 import triviaqa_datasets
from opencompass.configs.datasets.nq.nq_open_1shot_gen_2e45e5 import nq_datasets
from opencompass.configs.datasets.race.race_gen_69ee4f import race_datasets
from opencompass.configs.datasets.winogrande.winogrande_5shot_gen_b36770 import winogrande_datasets
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import hellaswag_datasets
from opencompass.configs.datasets.bbh.bbh_gen_2879b0 import bbh_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from opencompass.configs.datasets.math.math_0shot_gen_393424 import math_datasets
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import TheoremQA_datasets
from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_830460 import sanitized_mbpp_datasets
from opencompass.configs.datasets.gpqa.gpqa_gen_4baadb import gpqa_datasets
from opencompass.configs.datasets.IFEval.IFEval_gen_3321a3 import ifeval_datasets
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
================================================
FILE: opencompass/configs/datasets/ARC_Prize_Public_Evaluation/README.md
================================================
# ARC Prize Public Evaluation
#### Overview
The spirit of ARC Prize is to open source progress towards AGI. To win prize money, you will be required to publish reproducible code/methods into public domain.
ARC Prize measures AGI progress using the [ARC-AGI private evaluation set](https://arcprize.org/guide#private), [the leaderboard is here](https://arcprize.org/leaderboard), and the Grand Prize is unlocked once the first team reaches [at least 85%](https://arcprize.org/guide#grand-prize-goal).
Note: the private evaluation set imposes limitations on solutions (eg. no internet access, so no GPT-4/Claude/etc). There is a [secondary leaderboard](https://arcprize.org/leaderboard) called ARC-AGI-Pub, it measures the [public evaluation set](https://arcprize.org/guide#public-tasks) and imposes no limits but it is not part of ARC Prize 2024 at this time.
#### Tasks
ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.

Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.
A successful submission is a pixel-perfect description (color and position) of the final task's output.
#### Format
As mentioned above, tasks are stored in JSON format. Each JSON file consists of two key-value pairs.
`train`: a list of two to ten input/output pairs (typically three.) These are used for your algorithm to infer a rule.
`test`: a list of one to three input/output pairs (typically one.) Your model should apply the inferred rule from the train set and construct an output solution. You will have access to the output test solution on the public data. The output solution on the private evaluation set will not be revealed.
Here is an example of a simple ARC-AGI task that has three training pairs along with a single test pair. Each pair is shown as a 2x2 grid. There are four colors represented by the integers 1, 4, 6, and 8. Which actual color (red/green/blue/black) is applied to each integer is arbitrary and up to you.
```json
{
"train": [
{"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]},
{"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]},
{"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]}
],
"test": [
{"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]}
]
}
```
#### Performance
| Qwen2.5-72B-Instruct | LLaMA3.1-70B-Instruct | gemma-2-27b-it |
| ----- | ----- | ----- |
| 0.09 | 0.06 | 0.05 |
================================================
FILE: opencompass/configs/datasets/ARC_Prize_Public_Evaluation/arc_agi_2_public_evaluation_gen.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.arc_prize_public_evaluation import ARCPrizeDataset, ARCPrizeEvaluator
# The system_prompt defines the initial instructions for the model,
# setting the context for solving ARC tasks.
system_prompt = '''You are a puzzle solving wizard. You are given a puzzle from the abstraction and reasoning corpus developed by Francois Chollet.'''
# User message template is a template for creating user prompts. It includes placeholders for training data and test input data,
# guiding the model to learn the rule and apply it to solve the given puzzle.
user_message_template = '''Here are the example input and output pairs from which you should learn the underlying rule to later predict the output for the given test input:
----------------------------------------
{training_data}
----------------------------------------
Now, solve the following puzzle based on its input grid by applying the rules you have learned from the training data.:
----------------------------------------
[{{'input': {input_test_data}, 'output': [[]]}}]
----------------------------------------
What is the output grid? Only provide the output grid in the form as in the example input and output pairs. Do not provide any additional information:'''
arc_agi_2_public_evaluation_reader_cfg = dict(
input_columns=['training_data', 'input_test_data'],
output_column='output_test_data'
)
arc_agi_2_public_evaluation_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(role='SYSTEM', fallback_role='HUMAN', prompt=system_prompt),
],
round=[
dict(role='HUMAN', prompt=user_message_template),
],
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer)
)
arc_agi_2_public_evaluation_eval_cfg = dict(
evaluator=dict(type=ARCPrizeEvaluator)
)
arc_agi_2_public_evaluation_datasets = [
dict(
abbr='ARC_AGI_2_Public_Evaluation',
type=ARCPrizeDataset,
version='arc_agi_2',
path='opencompass/arc_agi_2_public_evaluation',
reader_cfg=arc_agi_2_public_evaluation_reader_cfg,
infer_cfg=arc_agi_2_public_evaluation_infer_cfg,
eval_cfg=arc_agi_2_public_evaluation_eval_cfg,
n=2,
k=2,
)
]
================================================
FILE: opencompass/configs/datasets/ARC_Prize_Public_Evaluation/arc_prize_public_evaluation_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .arc_prize_public_evaluation_gen_872059 import arc_prize_public_evaluation_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ARC_Prize_Public_Evaluation/arc_prize_public_evaluation_gen_872059.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.arc_prize_public_evaluation import ARCPrizeDataset, ARCPrizeEvaluator
# The system_prompt defines the initial instructions for the model,
# setting the context for solving ARC tasks.
system_prompt = '''You are a puzzle solving wizard. You are given a puzzle from the abstraction and reasoning corpus developed by Francois Chollet.'''
# User message template is a template for creating user prompts. It includes placeholders for training data and test input data,
# guiding the model to learn the rule and apply it to solve the given puzzle.
user_message_template = '''Here are the example input and output pairs from which you should learn the underlying rule to later predict the output for the given test input:
----------------------------------------
{training_data}
----------------------------------------
Now, solve the following puzzle based on its input grid by applying the rules you have learned from the training data.:
----------------------------------------
[{{'input': {input_test_data}, 'output': [[]]}}]
----------------------------------------
What is the output grid? Only provide the output grid in the form as in the example input and output pairs. Do not provide any additional information:'''
arc_prize_public_evaluation_reader_cfg = dict(
input_columns=['training_data', 'input_test_data'],
output_column='output_test_data'
)
arc_prize_public_evaluation_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='SYSTEM', prompt=system_prompt),
dict(role='HUMAN', prompt=user_message_template),
],
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=2048)
)
arc_prize_public_evaluation_eval_cfg = dict(
evaluator=dict(type=ARCPrizeEvaluator)
)
arc_prize_public_evaluation_datasets = [
dict(
abbr='ARC_Prize_Public_Evaluation',
type=ARCPrizeDataset,
version='arc_agi_1',
path='opencompass/arc_prize_public_evaluation',
reader_cfg=arc_prize_public_evaluation_reader_cfg,
infer_cfg=arc_prize_public_evaluation_infer_cfg,
eval_cfg=arc_prize_public_evaluation_eval_cfg
)
]
================================================
FILE: opencompass/configs/datasets/ARC_Prize_Public_Evaluation/arc_prize_public_evaluation_gen_fedd04.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.arc_prize_public_evaluation import ARCPrizeDataset, ARCPrizeEvaluator
# The system_prompt defines the initial instructions for the model,
# setting the context for solving ARC tasks.
system_prompt = '''You are a puzzle solving wizard. You are given a puzzle from the abstraction and reasoning corpus developed by Francois Chollet.'''
# User message template is a template for creating user prompts. It includes placeholders for training data and test input data,
# guiding the model to learn the rule and apply it to solve the given puzzle.
user_message_template = '''Here are the example input and output pairs from which you should learn the underlying rule to later predict the output for the given test input:
----------------------------------------
{training_data}
----------------------------------------
Now, solve the following puzzle based on its input grid by applying the rules you have learned from the training data.:
----------------------------------------
[{{'input': {input_test_data}, 'output': [[]]}}]
----------------------------------------
What is the output grid? Only provide the output grid in the form as in the example input and output pairs. Do not provide any additional information:'''
arc_prize_public_evaluation_reader_cfg = dict(
input_columns=['training_data', 'input_test_data'],
output_column='output_test_data'
)
arc_prize_public_evaluation_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='SYSTEM',fallback_role='HUMAN', prompt=system_prompt),
dict(role='HUMAN', prompt=user_message_template),
],
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer)
)
arc_prize_public_evaluation_eval_cfg = dict(
evaluator=dict(type=ARCPrizeEvaluator)
)
arc_prize_public_evaluation_datasets = [
dict(
abbr='ARC_Prize_Public_Evaluation',
type=ARCPrizeDataset,
version='arc_agi_1',
path='opencompass/arc_prize_public_evaluation',
reader_cfg=arc_prize_public_evaluation_reader_cfg,
infer_cfg=arc_prize_public_evaluation_infer_cfg,
eval_cfg=arc_prize_public_evaluation_eval_cfg
)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_clean_ppl.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccContaminationEvaluator
from opencompass.datasets import ARCDatasetClean as ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}')
], ),
'B':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}')
], ),
'C':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}')
], ),
'D':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}')
], ),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccContaminationEvaluator),
analyze_contamination=True)
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c-test',
path='opencompass/ai2_arc-test',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_cot_gen_926652.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_option_postprocess, match_answer_pattern
QUERY_TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
{question}
A. {textA}
B. {textB}
C. {textC}
D. {textD}
""".strip()
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=QUERY_TEMPLATE)
], ),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ARC_c_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_option_postprocess, options='ABCD'),
)
ARC_c_datasets = [
dict(
abbr='ARC-c',
type=ARCDataset,
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_few_shot_gen_e9b043.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever, FixKRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_capital_postprocess
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey',
)
ARC_c_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=dict(
begin='',
round=[
dict(
role='HUMAN',
prompt='Question: {question}\nA. {textA}\nB. {textB}\nC. {textC}\nD. {textD}\nAnswer:',
),
dict(role='BOT', prompt='{answerKey}'),
],
),
ice_token='',
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 2, 4, 6, 8]),
inferencer=dict(type=GenInferencer, max_out_len=50),
)
ARC_c_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
ARC_c_datasets = [
dict(
abbr='ARC-c',
type=ARCDataset,
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_few_shot_ppl.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever, FixKRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey',
)
ARC_c_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template={
'A': dict(
begin='',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}'),
],
),
'B': dict(
begin='',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}'),
],
),
'C': dict(
begin='',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}'),
],
),
'D': dict(
begin='',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}'),
],
),
},
ice_token='',
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 2, 4, 6, 8]),
inferencer=dict(type=PPLInferencer),
)
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .ARC_c_gen_1e0de5 import ARC_c_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_gen_1e0de5.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_option_postprocess
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=
'Question: {question}\nA. {textA}\nB. {textB}\nC. {textC}\nD. {textD}\nAnswer:'
)
], ),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ARC_c_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_option_postprocess, options='ABCD'),
)
ARC_c_datasets = [
dict(
abbr='ARC-c',
type=ARCDataset,
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_ppl.py
================================================
from mmengine.config import read_base
with read_base():
from .ARC_c_ppl_a450bd import ARC_c_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_ppl_2ef631.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
opt: dict(
round=[
dict(role='HUMAN', prompt=f'{{question}}\nA. {{textA}}\nB. {{textB}}\nC. {{textC}}\nD. {{textD}}'),
dict(role='BOT', prompt=f'Answer: {opt}'),
]
) for opt in ['A', 'B', 'C', 'D']
},
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_ppl_a450bd.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}')
], ),
'B':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}')
], ),
'C':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}')
], ),
'D':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}')
], ),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/ARC_c/ARC_c_ppl_d52a21.py
================================================
from mmengine.config import read_base
# with read_base():
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A': 'Question: {question}\nAnswer: {textA}',
'B': 'Question: {question}\nAnswer: {textB}',
'C': 'Question: {question}\nAnswer: {textC}',
'D': 'Question: {question}\nAnswer: {textD}'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/ARC_e/ARC_e_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .ARC_e_gen_1e0de5 import ARC_e_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ARC_e/ARC_e_gen_1e0de5.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_option_postprocess
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=
'Question: {question}\nA. {textA}\nB. {textB}\nC. {textC}\nD. {textD}\nAnswer:'
)
], ),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ARC_e_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_option_postprocess, options='ABCD'),
)
ARC_e_datasets = [
dict(
abbr='ARC-e',
type=ARCDataset,
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/ARC_e/ARC_e_ppl.py
================================================
from mmengine.config import read_base
with read_base():
from .ARC_e_ppl_a450bd import ARC_e_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ARC_e/ARC_e_ppl_2ef631.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
opt: dict(
round=[
dict(role='HUMAN', prompt=f'{{question}}\nA. {{textA}}\nB. {{textB}}\nC. {{textC}}\nD. {{textD}}'),
dict(role='BOT', prompt=f'Answer: {opt}'),
]
) for opt in ['A', 'B', 'C', 'D']
},
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_e_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_e_datasets = [
dict(
type=ARCDataset,
abbr='ARC-e',
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/ARC_e/ARC_e_ppl_a450bd.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}')
], ),
'B':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}')
], ),
'C':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}')
], ),
'D':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}')
], ),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_e_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_e_datasets = [
dict(
type=ARCDataset,
abbr='ARC-e',
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/ARC_e/ARC_e_ppl_d52a21.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A': 'Question: {question}\nAnswer: {textA}',
'B': 'Question: {question}\nAnswer: {textB}',
'C': 'Question: {question}\nAnswer: {textC}',
'D': 'Question: {question}\nAnswer: {textD}'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_e_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_e_datasets = [
dict(
type=ARCDataset,
abbr='ARC-e',
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/BeyondAIME/beyondaime_cascade_eval_gen_5e9f4f.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import BeyondAIMEDataset
from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator, MATHVerifyEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
beyondaime_reader_cfg = dict(input_columns=['question'], output_column='answer')
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{question}\n\n\n
: \n{answer}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
beyondaime_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{question}\nRemember to put your final answer within \\boxed{}.',
),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
beyondaime_cascade_evaluator = dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=BeyondAIMEDataset,
path='ByteDance-Seed/BeyondAIME',
reader_cfg=beyondaime_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
parallel=False,
)
beyondaime_eval_cfg = dict(
evaluator=beyondaime_cascade_evaluator,
)
beyondaime_datasets = [
dict(
type=BeyondAIMEDataset,
abbr='beyondaime',
path='ByteDance-Seed/BeyondAIME',
reader_cfg=beyondaime_reader_cfg,
infer_cfg=beyondaime_infer_cfg,
eval_cfg=beyondaime_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/BeyondAIME/beyondaime_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .beyondaime_cascade_eval_gen_5e9f4f import beyondaime_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CARDBiomedBench/CARDBiomedBench_llmjudge_gen_99a231.py
================================================
from opencompass.datasets import CARDBiomedBenchDataset
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.evaluator import GenericLLMEvaluator
ZERO_SHOT_PROMPT = 'You are an expert in {expert}.\n{question}\n'
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: Q: You are an expert in {expert}.\n{question}\n\n\n
: \n{answer}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
# Reader configuration
reader_cfg = dict(
input_columns=[
'question',
'answer',
'Bio_Category',
'SQL_Category',
'uuid',
'template uuid',
'expert',
],
output_column='answer',
)
# Inference configuration
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=ZERO_SHOT_PROMPT, # prompt mode: zero-shot
),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CARDBiomedBenchDataset,
path='NIH-CARD/CARDBiomedBench',
prompt_mode='zero-shot',
reader_cfg=reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
cardbiomedbench_dataset = dict(
type=CARDBiomedBenchDataset,
abbr='cardbiomedbench',
path='NIH-CARD/CARDBiomedBench',
prompt_mode='zero-shot',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
cardbiomedbench_datasets = [cardbiomedbench_dataset]
================================================
FILE: opencompass/configs/datasets/CARDBiomedBench/CARDBiomedBench_llmjudge_rawprompt_gen_b4d90c.py
================================================
from opencompass.datasets import CARDBiomedBenchDataset
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_raw_prompt_template import RawPromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.evaluator import GenericLLMEvaluator
ZERO_SHOT_PROMPT = 'You are an expert in {expert}.\n{question}\n'
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: Q: You are an expert in {expert}.\n{question}\n\n\n
: \n{answer}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
# Reader configuration
reader_cfg = dict(
input_columns=[
'question',
'answer',
'Bio_Category',
'SQL_Category',
'uuid',
'template uuid',
'expert',
],
output_column='answer',
)
# Inference configuration
infer_cfg = dict(
prompt_template=dict(
type=RawPromptTemplate,
messages=[
{'role': 'user', 'content': ZERO_SHOT_PROMPT},
],
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=RawPromptTemplate,
messages=[
{'role': 'system', 'content': "You are a helpful assistant who evaluates the correctness and quality of models' outputs."},
{'role': 'user', 'content': GRADER_TEMPLATE},
],
),
dataset_cfg=dict(
type=CARDBiomedBenchDataset,
path='NIH-CARD/CARDBiomedBench',
prompt_mode='zero-shot',
reader_cfg=reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
cardbiomedbench_dataset = dict(
type=CARDBiomedBenchDataset,
abbr='cardbiomedbench',
path='NIH-CARD/CARDBiomedBench',
prompt_mode='zero-shot',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
cardbiomedbench_datasets = [cardbiomedbench_dataset]
================================================
FILE: opencompass/configs/datasets/CHARM/README.md
================================================
# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
[](https://arxiv.org/abs/2403.14112)
[](./LICENSE)
## Dataset Description
**CHARM** is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors.
## Comparison of commonsense reasoning benchmarks
"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.
## 🛠️ How to Use
Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.
### 1. Download CHARM
```bash
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
```
### 2. Run Inference and Evaluation
```bash
cd ${path_to_opencompass}
# modify config file `examples/eval_charm_rea.py`: uncomment or add models you want to evaluate
python run.py examples/eval_charm_rea.py -r --dump-eval-details
# modify config file `examples/eval_charm_mem.py`: uncomment or add models you want to evaluate
python run.py examples/eval_charm_mem.py -r --dump-eval-details
```
The inference and evaluation results would be in `${path_to_opencompass}/outputs`, like this:
```bash
outputs
├── CHARM_mem
│ └── chat
│ └── 20240605_151442
│ ├── predictions
│ │ ├── internlm2-chat-1.8b-turbomind
│ │ ├── llama-3-8b-instruct-lmdeploy
│ │ └── qwen1.5-1.8b-chat-hf
│ ├── results
│ │ ├── internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
│ │ ├── llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
│ │ └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
│ └── summary
│ └── 20240605_205020 # MEMORY_SUMMARY_DIR
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
│ └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
└── chat
└── 20240605_152359
├── predictions
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
├── results # REASON_RESULTS_DIR
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
└── summary
├── summary_20240605_205328.csv # REASON_SUMMARY_CSV
└── summary_20240605_205328.txt
```
### 3. Generate Analysis Results
```bash
cd ${path_to_CHARM_repo}
# generate Table5, Table6, Table9 and Table10 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}
# generate Figure3 and Figure9 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}
# generate Table7, Table12, Table13 and Figure11 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}
```
## 🖊️ Citation
```bibtex
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
================================================
FILE: opencompass/configs/datasets/CHARM/README_ZH.md
================================================
# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
[](https://arxiv.org/abs/2403.14112)
[](./LICENSE)
## 🛠️ 如何使用
以下是快速下载 CHARM 并在 OpenCompass 上进行评估的步骤。
### 1. 下载 CHARM
```bash
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
```
### 2. 推理和评测
```bash
cd ${path_to_opencompass}
# 修改配置文件`examples/eval_charm_rea.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py examples/eval_charm_rea.py -r --dump-eval-details
# 修改配置文件`examples/eval_charm_mem.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py examples/eval_charm_mem.py -r --dump-eval-details
```
推理和评测的结果位于路径`${path_to_opencompass}/outputs`, 如下所示:
```bash
outputs
├── CHARM_mem
│ └── chat
│ └── 20240605_151442
│ ├── predictions
│ │ ├── internlm2-chat-1.8b-turbomind
│ │ ├── llama-3-8b-instruct-lmdeploy
│ │ └── qwen1.5-1.8b-chat-hf
│ ├── results
│ │ ├── internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
│ │ ├── llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
│ │ └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
│ └── summary
│ └── 20240605_205020 # MEMORY_SUMMARY_DIR
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
│ ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
│ └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
└── chat
└── 20240605_152359
├── predictions
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
├── results # REASON_RESULTS_DIR
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
└── summary
├── summary_20240605_205328.csv # REASON_SUMMARY_CSV
└── summary_20240605_205328.txt
```
### 3. 生成分析结果
```bash
cd ${path_to_CHARM_repo}
# 生成论文中的Table5, Table6, Table9 and Table10,详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}
# 生成论文中的Figure3 and Figure9,详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}
# 生成论文中的Table7, Table12, Table13 and Figure11,详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}
```
## 🖊️ 引用
```bibtex
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
================================================
FILE: opencompass/configs/datasets/CHARM/charm_memory_gen_bbbd53.py
================================================
import os
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CharmDataset, CharmMemoryEvaluator, LMEvaluator
with read_base():
from .charm_memory_settings import charm_memory_tasks, judge_system_prompts, dataset_path
charm_memory_datasets = []
for _task in charm_memory_tasks:
charm_memory_reader_cfg = dict(input_columns=['input'],
output_column='target')
charm_memory_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role='HUMAN', prompt='请尽可能简短地回答下述问题。\n问题:{input}\n答:')
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
if _task == 'Chinese_Movie_and_Music_Recommendation':
charm_memory_eval_cfg = dict(
evaluator=dict(type=CharmMemoryEvaluator),
pred_role='BOT',
)
else:
judge_system_prompt = judge_system_prompts[_task]
charm_memory_eval_cfg = dict(
evaluator=dict(
type=LMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=judge_system_prompt +
"\n\n[Question]\n{input}\n[The Start of Reference Answer]\n{target}\n[The End of Reference Answer]\n\n[The Start of Assistant's Answer]\n{prediction}\n[The End of Assistant's Answer]" # noqa
),
]),
),
),
pred_role='BOT',
)
charm_memory_datasets.append(
dict(
type=CharmDataset,
path=dataset_path,
name=_task,
abbr='charm-memory-' + _task,
reader_cfg=charm_memory_reader_cfg,
infer_cfg=charm_memory_infer_cfg.copy(),
eval_cfg=charm_memory_eval_cfg.copy(),
))
================================================
FILE: opencompass/configs/datasets/CHARM/charm_memory_settings.py
================================================
import os
charm_memory_tasks = [
'Chinese_Anachronisms_Judgment',
'Chinese_Movie_and_Music_Recommendation',
'Chinese_Sport_Understanding',
'Chinese_Time_Understanding',
]
dataset_path = 'data/CHARM/memorization'
system_prompt_template = """Please act as an impartial judge, comparing the responses of the AI assistants to the reference answer and determining if the answers are correct.
You will receive the reference answer provided by a human and the responses of the AI assistants.
Your task is to judge whether the AI assistant's answers is correct.
{task_specific_prompt}
After providing your explanation, strictly output your final judgment in the following format: “[正确]” if the AI assistant's response is correct, “[错误]” if the AI assistant's response is incorrect.
"""
task_specific_prompts = {
'Chinese_Anachronisms_Judgment':
"If the provided reference answer is a list, the model's prediction is considered correct if it matches any item in the list.",
'Chinese_Time_Understanding':
"When evaluating the AI assistant's response regarding Chinese solar terms, as long as the AI assistant's response falls within the time frame provided in the reference answer, consider it correct.",
'Chinese_Sport_Understanding':
"If the provided reference answer is a list, the model's prediction is considered correct if it matches any item in the list."
}
judge_system_prompts = {
k: system_prompt_template.format(task_specific_prompt=v)
for k, v in task_specific_prompts.items()
}
================================================
FILE: opencompass/configs/datasets/CHARM/charm_reason_cot_only_gen_f7b7d3.py
================================================
import os
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CharmDataset, charm_reason_postprocess, CharmReasonEvaluator
with read_base():
from .charm_reason_settings import charm_tasks, settings
settings = [s for s in settings if s[0] in ['ZH-CoT', 'EN-CoT']]
charm_reason_datasets = []
for _cot, _cot_prefix, dataset_path, fewshot_example_path, prompt_template in settings:
for _task in charm_tasks:
_fewshot_example_file = os.path.join(fewshot_example_path, f'{_task}_{_cot}.txt')
with open(_fewshot_example_file, 'r') as f:
_hint = f.read()
charm_reason_reader_cfg = dict(input_columns=['input'], output_column='target')
charm_reason_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role='HUMAN', prompt=prompt_template.format(_hint=_hint) + _cot_prefix)]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
charm_reason_eval_cfg = dict(
evaluator=dict(type=CharmReasonEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=charm_reason_postprocess),
dataset_postprocessor=dict(type=charm_reason_postprocess),
)
charm_reason_datasets.append(
dict(
type=CharmDataset,
path=dataset_path,
name=_task,
abbr='charm-reason-' + _task + '_' + _cot,
reader_cfg=charm_reason_reader_cfg,
infer_cfg=charm_reason_infer_cfg.copy(),
eval_cfg=charm_reason_eval_cfg.copy(),
)
)
================================================
FILE: opencompass/configs/datasets/CHARM/charm_reason_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .charm_reason_gen_f8fca2 import charm_reason_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CHARM/charm_reason_gen_f8fca2.py
================================================
import os
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CharmDataset, charm_reason_postprocess, CharmReasonEvaluator
with read_base():
from .charm_reason_settings import charm_tasks, settings
charm_reason_datasets = []
for _cot, _cot_prefix, dataset_path, fewshot_example_path, prompt_template in settings:
for _task in charm_tasks:
_fewshot_example_file = os.path.join(fewshot_example_path, f'{_task}_{_cot}.txt')
with open(_fewshot_example_file, 'r') as f:
_hint = f.read()
charm_reason_reader_cfg = dict(input_columns=['input'], output_column='target')
charm_reason_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role='HUMAN', prompt=prompt_template.format(_hint=_hint) + _cot_prefix)]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
charm_reason_eval_cfg = dict(
evaluator=dict(type=CharmReasonEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=charm_reason_postprocess),
dataset_postprocessor=dict(type=charm_reason_postprocess),
)
charm_reason_datasets.append(
dict(
type=CharmDataset,
path=dataset_path,
name=_task,
abbr='charm-reason-' + _task + '_' + _cot,
reader_cfg=charm_reason_reader_cfg,
infer_cfg=charm_reason_infer_cfg.copy(),
eval_cfg=charm_reason_eval_cfg.copy(),
)
)
================================================
FILE: opencompass/configs/datasets/CHARM/charm_reason_ppl_3da4de.py
================================================
import os
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.datasets import CharmDataset
from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator
charm_tasks = [
['Chinese_Anachronisms_Judgment', 'AB'],
['Chinese_Movie_and_Music_Recommendation', 'ABCD'],
['Chinese_Natural_Language_Inference', 'ABC'],
['Chinese_Reading_Comprehension', 'ABCD'],
['Chinese_Sequence_Understanding', 'ABCD'],
['Chinese_Sport_Understanding', 'AB'],
['Chinese_Time_Understanding', 'ABCD'],
['Global_Anachronisms_Judgment', 'AB'],
['Global_Movie_and_Music_Recommendation', 'ABCD'],
['Global_Natural_Language_Inference', 'ABC'],
['Global_Reading_Comprehension', 'ABCD'],
['Global_Sequence_Understanding', 'ABCD'],
['Global_Sport_Understanding', 'AB'],
['Global_Time_Understanding', 'ABCDEF'],
]
charm_reason_datasets = []
for task_name, options in charm_tasks:
with open(os.path.join(os.path.dirname(__file__), 'few-shot-examples', f'{task_name}_Direct.txt'), 'r') as f:
few_shot_example = f.read()
charm_reason_reader_cfg = dict(input_columns=['input'], output_column='target')
charm_reason_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
f'({opt})': f'{few_shot_example}\n{{input}}\nA: {opt}' for opt in options
},
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer),
)
charm_reason_eval_cfg = dict(evaluator=dict(type=AccwithDetailsEvaluator))
charm_reason_datasets.append(
dict(
type=CharmDataset,
abbr=f'charm-reason-{task_name}_Direct',
path=f'data/CHARM/reasoning',
name=task_name,
reader_cfg=charm_reason_reader_cfg,
infer_cfg=charm_reason_infer_cfg,
eval_cfg=charm_reason_eval_cfg,
)
)
================================================
FILE: opencompass/configs/datasets/CHARM/charm_reason_settings.py
================================================
import os
charm_tasks = [
'Chinese_Anachronisms_Judgment',
'Chinese_Movie_and_Music_Recommendation',
'Chinese_Natural_Language_Inference',
'Chinese_Reading_Comprehension',
'Chinese_Sequence_Understanding',
'Chinese_Sport_Understanding',
'Chinese_Time_Understanding',
'Global_Anachronisms_Judgment',
'Global_Movie_and_Music_Recommendation',
'Global_Natural_Language_Inference',
'Global_Reading_Comprehension',
'Global_Sequence_Understanding',
'Global_Sport_Understanding',
'Global_Time_Understanding',
]
XLT_template = 'Follow the given examples and answer the question.\n{_hint}\n\n I want you to act as an commonsense reasoning expert for Chinese. \n Request: {{input}}\n'
Translate_EN_template = 'Follow the given examples and answer the question.\n{_hint}\n\nQ: {{input}}\nA: '
Other_template = '请按照给定的例子回答问题。\n{_hint}\n\nQ:{{input}}\nA:'
data_dir = 'data/CHARM'
dataset_path_ZH = f'{data_dir}/reasoning'
dataset_path_TransEn = f'{data_dir}/reasoning_Translate-EN'
fewshot_example_path_ZH = os.path.join(os.path.dirname(__file__), 'few-shot-examples')
fewshot_example_path_TransEn = os.path.join(os.path.dirname(__file__), 'few-shot-examples_Translate-EN')
settings = [
('Direct', '', dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('ZH-CoT', '让我们一步一步来思考。', dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('EN-CoT', "Let's think step by step.", dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('XLT', """You should retell the request in English.\nYou should do the answer step by step to choose the right answer.\nYou should step-by-step answer the request.\nYou should tell me the answer in this format 'So the answer is'.""", dataset_path_ZH, fewshot_example_path_ZH, XLT_template),
('Translate-EN', "Let's think step by step.", dataset_path_TransEn, fewshot_example_path_TransEn, Translate_EN_template),
]
================================================
FILE: opencompass/configs/datasets/CIBench/CIBench_generation_gen_8ab0dc.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import AgentInferencer
from opencompass.datasets import CIBenchDataset, CIBenchEvaluator
cibench_reader_cfg = dict(
input_columns=['questions'],
output_column='references',
train_split='test',
test_split='test')
cibench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every'),
)
libs = ['matplotlib', 'opencv', 'pandas', 'pytorch', 'scipy', 'seaborn']
cibench_eval_cfg = dict(evaluator=dict(type=CIBenchEvaluator), pred_role='BOT')
cibench_datasets = [
dict(
abbr=f'cibench_generation/{lib}',
type=CIBenchDataset,
path=f'./data/cibench_dataset/cibench_generation/{lib}',
internet_check=False,
reader_cfg=cibench_reader_cfg,
infer_cfg=cibench_infer_cfg,
eval_cfg=cibench_eval_cfg,
) for lib in libs
]
================================================
FILE: opencompass/configs/datasets/CIBench/CIBench_generation_oracle_gen_c4a7c1.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import AgentInferencer
from opencompass.datasets import CIBenchDataset, CIBenchEvaluator
cibench_reader_cfg = dict(
input_columns=['questions'],
output_column='references',
train_split='test',
test_split='test')
cibench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)
libs = ['matplotlib', 'opencv', 'pandas', 'pytorch', 'scipy', 'seaborn']
cibench_eval_cfg = dict(evaluator=dict(type=CIBenchEvaluator), pred_role='BOT')
cibench_datasets = [
dict(
abbr=f'cibench_generation_oracle/{lib}',
type=CIBenchDataset,
path=f'./data/cibench_dataset/cibench_generation/{lib}',
internet_check=False,
reader_cfg=cibench_reader_cfg,
infer_cfg=cibench_infer_cfg,
eval_cfg=cibench_eval_cfg,
) for lib in libs
]
================================================
FILE: opencompass/configs/datasets/CIBench/CIBench_template_gen_e6b12a.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import AgentInferencer
from opencompass.datasets import CIBenchDataset, CIBenchEvaluator
cibench_reader_cfg = dict(
input_columns=['questions'],
output_column='references',
train_split='test',
test_split='test')
cibench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every'),
)
# no tensorboard
libs = ['/lightgbm', '/matplotlib', '/nltk', '/opencv', '/pandas', '/pytorch',
'/scipy', '/seaborn', '/sklearn', '/tensorflow',
'_chinese/lightgbm', '_chinese/matplotlib', '_chinese/nltk',
'_chinese/opencv', '_chinese/pandas', '_chinese/pytorch',
'_chinese/scipy', '_chinese/seaborn', '_chinese/sklearn', '_chinese/tensorflow']
cibench_eval_cfg = dict(evaluator=dict(type=CIBenchEvaluator), pred_role='BOT')
cibench_datasets = [
dict(
abbr=f'cibench_template{lib}',
type=CIBenchDataset,
path=f'./data/cibench_dataset/cibench_template{lib}',
internet_check=False,
reader_cfg=cibench_reader_cfg,
infer_cfg=cibench_infer_cfg,
eval_cfg=cibench_eval_cfg,
) for lib in libs
]
================================================
FILE: opencompass/configs/datasets/CIBench/CIBench_template_oracle_gen_fecda1.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import AgentInferencer
from opencompass.datasets import CIBenchDataset, CIBenchEvaluator
cibench_reader_cfg = dict(
input_columns=['questions'],
output_column='references',
train_split='test',
test_split='test')
cibench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="""{questions}""",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),
)
# no tensorboard
libs = ['/lightgbm', '/matplotlib', '/nltk', '/opencv', '/pandas', '/pytorch',
'/scipy', '/seaborn', '/sklearn', '/tensorflow',
'_chinese/lightgbm', '_chinese/matplotlib', '_chinese/nltk',
'_chinese/opencv', '_chinese/pandas', '_chinese/pytorch',
'_chinese/scipy', '_chinese/seaborn', '_chinese/sklearn', '_chinese/tensorflow']
cibench_eval_cfg = dict(evaluator=dict(type=CIBenchEvaluator), pred_role='BOT')
cibench_datasets = [
dict(
abbr=f'cibench_template_oracle{lib}',
type=CIBenchDataset,
path=f'./data/cibench_dataset/cibench_template{lib}',
internet_check=False,
reader_cfg=cibench_reader_cfg,
infer_cfg=cibench_infer_cfg,
eval_cfg=cibench_eval_cfg,
) for lib in libs
]
================================================
FILE: opencompass/configs/datasets/CLUE_C3/CLUE_C3_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_C3_gen_8c358f import C3_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_C3/CLUE_C3_gen_8c358f.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import C3Dataset_V2
from opencompass.utils.text_postprocessors import first_capital_postprocess
C3_reader_cfg = dict(
input_columns=[
'question',
'content',
'choice0',
'choice1',
'choice2',
'choice3',
'choices',
],
output_column='label',
)
C3_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
'{content}\n问:{question}\nA. {choice0}\nB. {choice1}\nC. {choice2}\nD. {choice3}\n请从“A”,“B”,“C”,“D”中进行选择。\n答:',
),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
C3_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
C3_datasets = [
dict(
abbr='C3',
type=C3Dataset_V2,
path='./data/CLUE/C3/dev_0.json',
reader_cfg=C3_reader_cfg,
infer_cfg=C3_infer_cfg,
eval_cfg=C3_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/CLUE_C3/CLUE_C3_ppl.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_C3_ppl_e24a31 import C3_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_C3/CLUE_C3_ppl_56b537.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import C3Dataset
C3_reader_cfg = dict(
input_columns=[
'question', 'content', 'choice0', 'choice1', 'choice2', 'choice3',
'choices'
],
output_column='label')
C3_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
0: '文章:{content}\n问题:{question}\n答案:{choice0}',
1: '文章:{content}\n问题:{question}\n答案:{choice1}',
2: '文章:{content}\n问题:{question}\n答案:{choice2}',
3: '文章:{content}\n问题:{question}\n答案:{choice3}'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
C3_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
C3_datasets = [
dict(
type=C3Dataset,
abbr='C3',
path='./data/CLUE/C3/dev_0.json',
reader_cfg=C3_reader_cfg,
infer_cfg=C3_infer_cfg,
eval_cfg=C3_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CLUE_C3/CLUE_C3_ppl_e24a31.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import C3Dataset
C3_reader_cfg = dict(
input_columns=[
'question', 'content', 'choice0', 'choice1', 'choice2', 'choice3',
'choices'
],
output_column='label')
C3_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
i: dict(round=[
dict(role='HUMAN', prompt='文章:{content}\n问题:{question}'),
dict(role='BOT', prompt=f'答案:{{choice{i}}}')
])
for i in range(4)
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
C3_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
C3_datasets = [
dict(
type=C3Dataset,
abbr='C3',
path='./data/CLUE/C3/dev_0.json',
reader_cfg=C3_reader_cfg,
infer_cfg=C3_infer_cfg,
eval_cfg=C3_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CLUE_CMRC/CLUE_CMRC_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_CMRC_gen_1bd3c8 import CMRC_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_CMRC/CLUE_CMRC_gen_1bd3c8.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import CMRCDataset, cmrc_postprocess
CMRC_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
CMRC_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt='根据文章回答问题。你的答案应该尽可能简练,请以 ‘答案是’ 开头的句式作答。\n文章:{context}\n问:{question}\n答:'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
CMRC_eval_cfg = dict(
evaluator=dict(type=EMEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=cmrc_postprocess),
)
CMRC_datasets = [
dict(
type=CMRCDataset,
abbr='CMRC_dev',
path='opencompass/cmrc_dev',
reader_cfg=CMRC_reader_cfg,
infer_cfg=CMRC_infer_cfg,
eval_cfg=CMRC_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_CMRC/CLUE_CMRC_gen_3749cd.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import CMRCDataset
CMRC_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
CMRC_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role='HUMAN', prompt='文章:{context}\n根据上文,回答如下问题:{question}'),
dict(role='BOT', prompt='答:'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
CMRC_eval_cfg = dict(
evaluator=dict(type=EMEvaluator),
pred_role='BOT',
)
CMRC_datasets = [
dict(
type=CMRCDataset,
abbr='CMRC_dev',
path='opencompass/cmrc_dev',
reader_cfg=CMRC_reader_cfg,
infer_cfg=CMRC_infer_cfg,
eval_cfg=CMRC_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_CMRC/CLUE_CMRC_gen_8484b9.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import CMRCDataset
CMRC_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
CMRC_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='文章:{context}\n根据上文,回答如下问题: {question}\n答:'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
CMRC_eval_cfg = dict(evaluator=dict(type=EMEvaluator), )
CMRC_datasets = [
dict(
type=CMRCDataset,
abbr='CMRC_dev',
path='opencompass/cmrc_dev',
reader_cfg=CMRC_reader_cfg,
infer_cfg=CMRC_infer_cfg,
eval_cfg=CMRC_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_CMRC/CLUE_CMRC_gen_941108.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import CMRCDataset
CMRC_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
CMRC_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt='文章:{context}\n根据上文,回答如下问题:\n{question}\n答:'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
CMRC_eval_cfg = dict(
evaluator=dict(type=EMEvaluator),
pred_role='BOT',
)
CMRC_datasets = [
dict(
type=CMRCDataset,
abbr='CMRC_dev',
path='opencompass/cmrc_dev',
reader_cfg=CMRC_reader_cfg,
infer_cfg=CMRC_infer_cfg,
eval_cfg=CMRC_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_DRCD/CLUE_DRCD_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_DRCD_gen_1bd3c8 import DRCD_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_DRCD/CLUE_DRCD_gen_1bd3c8.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import DRCDDataset, drcd_postprocess
DRCD_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
DRCD_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt='根据文章回答问题。你的答案应该尽可能简练,请以 ‘答案是’ 开头的句式作答。\n文章:{context}\n问:{question}\n答:'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
DRCD_eval_cfg = dict(
evaluator=dict(type=EMEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=drcd_postprocess),
)
DRCD_datasets = [
dict(
type=DRCDDataset,
abbr='DRCD_dev',
path='opencompass/drcd_dev',
reader_cfg=DRCD_reader_cfg,
infer_cfg=DRCD_infer_cfg,
eval_cfg=DRCD_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_DRCD/CLUE_DRCD_gen_3749cd.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import DRCDDataset
DRCD_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
DRCD_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role='HUMAN', prompt='文章:{context}\n根据上文,回答如下问题:{question}'),
dict(role='BOT', prompt='答:'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
DRCD_eval_cfg = dict(
evaluator=dict(type=EMEvaluator),
pred_role='BOT',
)
DRCD_datasets = [
dict(
type=DRCDDataset,
abbr='DRCD_dev',
path='opencompass/drcd_dev',
reader_cfg=DRCD_reader_cfg,
infer_cfg=DRCD_infer_cfg,
eval_cfg=DRCD_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_DRCD/CLUE_DRCD_gen_8484b9.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import DRCDDataset
DRCD_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
DRCD_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='文章:{context}\n根据上文,回答如下问题: {question}\n答:'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
DRCD_eval_cfg = dict(evaluator=dict(type=EMEvaluator), )
DRCD_datasets = [
dict(
type=DRCDDataset,
abbr='DRCD_dev',
path='opencompass/drcd_dev',
reader_cfg=DRCD_reader_cfg,
infer_cfg=DRCD_infer_cfg,
eval_cfg=DRCD_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_DRCD/CLUE_DRCD_gen_941108.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import EMEvaluator
from opencompass.datasets import DRCDDataset
DRCD_reader_cfg = dict(
input_columns=['question', 'context'], output_column='answers')
DRCD_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt='文章:{context}\n根据上文,回答如下问题:\n{question}\n答:'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
DRCD_eval_cfg = dict(
evaluator=dict(type=EMEvaluator),
pred_role='BOT',
)
DRCD_datasets = [
dict(
type=DRCDDataset,
abbr='DRCD_dev',
path='opencompass/drcd_dev',
reader_cfg=DRCD_reader_cfg,
infer_cfg=DRCD_infer_cfg,
eval_cfg=DRCD_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_afqmc/CLUE_afqmc_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_afqmc_gen_901306 import afqmc_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_afqmc/CLUE_afqmc_gen_901306.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import AFQMCDatasetV2
from opencompass.utils.text_postprocessors import first_capital_postprocess
afqmc_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
afqmc_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
'语句一:“{sentence1}”\n语句二:“{sentence2}”\n语句一与语句二是关于蚂蚁金融产品的疑问,两者所询问的内容是否完全一致?\nA. 不完全一致\nB. 完全一致\n请从“A”,“B”中进行选择。\n答:',
),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
afqmc_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
afqmc_datasets = [
dict(
abbr='afqmc-dev',
type=AFQMCDatasetV2,
path='opencompass/afqmc-dev',
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
================================================
FILE: opencompass/configs/datasets/CLUE_afqmc/CLUE_afqmc_ppl.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_afqmc_ppl_6507d7 import afqmc_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_afqmc/CLUE_afqmc_ppl_378c5b.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
afqmc_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
afqmc_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
0:
dict(round=[
dict(
role='HUMAN', prompt='“{sentence1}”与“{sentence2}”不同还是相似?'),
dict(role='BOT', prompt='不同。')
]),
1:
dict(round=[
dict(
role='HUMAN', prompt='“{sentence1}”与“{sentence2}”不同还是相似?'),
dict(role='BOT', prompt='相似')
]),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
afqmc_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
afqmc_datasets = [
dict(
type=HFDataset,
abbr='afqmc-dev',
path='json',
data_files='./data/CLUE/AFQMC/dev.json',
split='train',
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_afqmc/CLUE_afqmc_ppl_6507d7.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
afqmc_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
afqmc_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
0:
dict(round=[
dict(
role='HUMAN',
prompt=
'语句一:“{sentence1}”\n语句二:“{sentence2}”\n语句一与语句二是关于蚂蚁金融产品的疑问,两者所询问的内容是否完全一致?'
),
dict(role='BOT', prompt='不完全一致')
]),
1:
dict(round=[
dict(
role='HUMAN',
prompt=
'语句一:“{sentence1}”\n语句二:“{sentence2}”\n语句一与语句二是关于蚂蚁金融产品的疑问,两者所询问的内容是否完全一致?'
),
dict(role='BOT', prompt='完全一致')
]),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
afqmc_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
afqmc_datasets = [
dict(
type=HFDataset,
abbr='afqmc-dev',
path='json',
data_files='./data/CLUE/AFQMC/dev.json',
split='train',
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_afqmc/CLUE_afqmc_ppl_7b0c1e.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
afqmc_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
afqmc_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
0: '{sentence1},{sentence2}不同。',
1: '{sentence1},{sentence2}相似。'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
afqmc_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
afqmc_datasets = [
dict(
type=HFDataset,
abbr='afqmc-dev',
path='json',
data_files='./data/CLUE/AFQMC/dev.json',
split='train',
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg),
]
================================================
FILE: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_cmnli_gen_1abf97 import cmnli_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_gen_1abf97.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMNLIDatasetV2
from opencompass.utils.text_postprocessors import first_capital_postprocess
cmnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
cmnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
'语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?\nA. 蕴含\nB. 矛盾\nC. 无关\n请从“A”,“B”,“C”中进行选择。\n答:'
),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
cmnli_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
cmnli_datasets = [
dict(
abbr='cmnli',
type=CMNLIDatasetV2,
path='opencompass/cmnli-dev',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_gen_51e956.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMNLIDatasetV2
from opencompass.utils.text_postprocessors import first_capital_postprocess
cmnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
cmnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
'阅读文章:{sentence1}\n根据上文,回答如下问题:{sentence2}\nA. 对\nB. 错\nC. 可能\n请从“A”,“B”,“C”中进行选择。\n答:'
),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
cmnli_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
cmnli_datasets = [
dict(
abbr='cmnli',
type=CMNLIDatasetV2,
path='opencompass/cmnli-dev',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_ppl.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_cmnli_ppl_fdc6de import cmnli_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_ppl_98dd6e.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMNLIDataset
cmnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
cmnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'contradiction':
'阅读文章:{sentence1}\n根据上文,回答如下问题: {sentence2}?\n答:错',
'entailment': '阅读文章:{sentence1}\n根据上文,回答如下问题: {sentence2}?\n答:对',
'neutral': '如果{sentence1}为真,那么{sentence2}也为真吗?可能'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
cmnli_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
cmnli_datasets = [
dict(
abbr='cmnli',
type=CMNLIDataset,
path='opencompass/cmnli-dev',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_ppl_ef69e7.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMNLIDataset
cmnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
cmnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'contradiction':
dict(round=[
dict(
role='HUMAN',
prompt='阅读文章:{sentence1}\n根据上文,回答如下问题:{sentence2}?'),
dict(role='BOT', prompt='错')
]),
'entailment':
dict(round=[
dict(
role='HUMAN',
prompt='阅读文章:{sentence1}\n根据上文,回答如下问题:{sentence2}?'),
dict(role='BOT', prompt='对')
]),
'neutral':
dict(round=[
dict(
role='HUMAN', prompt='如果{sentence1}为真,那么{sentence2}也为真吗?'),
dict(role='BOT', prompt='可能')
]),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
cmnli_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
cmnli_datasets = [
dict(
abbr='cmnli',
type=CMNLIDataset,
path='opencompass/cmnli-dev',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CLUE_cmnli/CLUE_cmnli_ppl_fdc6de.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMNLIDataset
cmnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
test_split='train')
cmnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'contradiction':
dict(round=[
dict(
role='HUMAN',
prompt='语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?'
),
dict(role='BOT', prompt='矛盾')
]),
'entailment':
dict(round=[
dict(
role='HUMAN',
prompt='语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?'
),
dict(role='BOT', prompt='蕴含')
]),
'neutral':
dict(round=[
dict(
role='HUMAN',
prompt='语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?'
),
dict(role='BOT', prompt='无关')
]),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
cmnli_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
cmnli_datasets = [
dict(
abbr='cmnli',
type=CMNLIDataset,
path='opencompass/cmnli-dev',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_ocnli_gen_c4cb6c import ocnli_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_gen_51e956.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMNLIDatasetV2
from opencompass.utils.text_postprocessors import first_capital_postprocess
ocnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
)
# TODO: two prompt templates for ocnli
ocnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
'阅读文章:{sentence1}\n根据上文,回答如下问题:{sentence2}\nA. 对\nB. 错\nC. 可能\n请从“A”,“B”,“C”中进行选择。\n答:'
),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ocnli_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
ocnli_datasets = [
dict(
abbr='ocnli',
type=CMNLIDatasetV2, # ocnli share the same format with cmnli
path='opencompass/OCNLI-dev',
reader_cfg=ocnli_reader_cfg,
infer_cfg=ocnli_infer_cfg,
eval_cfg=ocnli_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_gen_c4cb6c.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMNLIDatasetV2
from opencompass.utils.text_postprocessors import first_capital_postprocess
ocnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'],
output_column='label',
)
# TODO: two prompt templates for ocnli
ocnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
'语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?\nA. 蕴含\n B. 矛盾\n C. 无关\n请从“A”,“B”,“C”中进行选择。\n答:'
),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ocnli_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
ocnli_datasets = [
dict(
abbr='ocnli',
type=CMNLIDatasetV2, # ocnli share the same format with cmnli
path='opencompass/OCNLI-dev',
reader_cfg=ocnli_reader_cfg,
infer_cfg=ocnli_infer_cfg,
eval_cfg=ocnli_eval_cfg,
)
]
================================================
FILE: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_ppl.py
================================================
from mmengine.config import read_base
with read_base():
from .CLUE_ocnli_ppl_fdc6de import ocnli_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_ppl_98dd6e.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
ocnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'], output_column='label')
# TODO: two prompt templates for ocnli
ocnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'contradiction':
'阅读文章:{sentence1}\n根据上文,回答如下问题: {sentence2}?\n答:错',
'entailment': '阅读文章:{sentence1}\n根据上文,回答如下问题: {sentence2}?\n答:对',
'neutral': '如果{sentence1}为真,那么{sentence2}也为真吗?可能'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ocnli_eval_cfg = dict(evaluator=dict(type=AccEvaluator), )
ocnli_datasets = [
dict(
type=HFDataset,
abbr='ocnli',
path='json',
split='train',
data_files='./data/CLUE/OCNLI/dev.json',
reader_cfg=ocnli_reader_cfg,
infer_cfg=ocnli_infer_cfg,
eval_cfg=ocnli_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_ppl_ef69e7.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
ocnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'], output_column='label')
# TODO: two prompt templates for ocnli
ocnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'contradiction':
dict(round=[
dict(
role='HUMAN',
prompt='阅读文章:{sentence1}\n根据上文,回答如下问题:{sentence2}?'),
dict(role='BOT', prompt='错')
]),
'entailment':
dict(round=[
dict(
role='HUMAN',
prompt='阅读文章:{sentence1}\n根据上文,回答如下问题:{sentence2}?'),
dict(role='BOT', prompt='对')
]),
'neutral':
dict(round=[
dict(
role='HUMAN', prompt='如果{sentence1}为真,那么{sentence2}也为真吗?'),
dict(role='BOT', prompt='可能')
]),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ocnli_eval_cfg = dict(evaluator=dict(type=AccEvaluator), )
ocnli_datasets = [
dict(
type=HFDataset,
abbr='ocnli',
path='json',
split='train',
data_files='./data/CLUE/OCNLI/dev.json',
reader_cfg=ocnli_reader_cfg,
infer_cfg=ocnli_infer_cfg,
eval_cfg=ocnli_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CLUE_ocnli/CLUE_ocnli_ppl_fdc6de.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
ocnli_reader_cfg = dict(
input_columns=['sentence1', 'sentence2'], output_column='label')
# TODO: two prompt templates for ocnli
ocnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'contradiction':
dict(round=[
dict(
role='HUMAN',
prompt='语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?'
),
dict(role='BOT', prompt='矛盾')
]),
'entailment':
dict(round=[
dict(
role='HUMAN',
prompt='语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?'
),
dict(role='BOT', prompt='蕴含')
]),
'neutral':
dict(round=[
dict(
role='HUMAN',
prompt='语句一:“{sentence1}”\n语句二:“{sentence2}”\n请问这两句话是什么关系?'
),
dict(role='BOT', prompt='无关')
]),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ocnli_eval_cfg = dict(evaluator=dict(type=AccEvaluator), )
ocnli_datasets = [
dict(
type=HFDataset,
abbr='ocnli',
path='json',
split='train',
data_files='./data/CLUE/OCNLI/dev.json',
reader_cfg=ocnli_reader_cfg,
infer_cfg=ocnli_infer_cfg,
eval_cfg=ocnli_eval_cfg)
]
================================================
FILE: opencompass/configs/datasets/CMPhysBench/cmphysbench_gen.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.cmphysbench import CMPhysBenchDataset
from opencompass.datasets.cmphysbench import CMPhysBenchEvaluator
cmphysbench_reader_cfg = dict(
input_columns=['prompt'],
output_column='ground_truth'
)
cmphysbench_datasets = []
cmphysbench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role='HUMAN', prompt='You are a condensed matter physics expert. Please read the following question and provide a step-by-step solution using only the given symbols. Do not introduce any new symbols that are not provided in the problem statement. Your final answer must be presented as a readable LaTeX formula, enclosed in a \\boxed{{}} environment.\n{prompt}'),
dict(role='BOT', prompt='{ground_truth}\n')
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
cmphysbench_eval_cfg = dict(
evaluator=dict(type=CMPhysBenchEvaluator),
)
cmphysbench_datasets.append(
dict(
abbr='CMPhysBench-fix_prompt',
type=CMPhysBenchDataset,
path='weidawang/CMPhysBench',
reader_cfg=cmphysbench_reader_cfg,
infer_cfg=cmphysbench_infer_cfg,
eval_cfg=cmphysbench_eval_cfg,
)
)
================================================
FILE: opencompass/configs/datasets/CMPhysBench/cmphysbench_rawprompt_gen.py
================================================
from opencompass.openicl.icl_raw_prompt_template import RawPromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.cmphysbench import CMPhysBenchDataset
from opencompass.datasets.cmphysbench import CMPhysBenchEvaluator
cmphysbench_reader_cfg = dict(
input_columns=['prompt'],
output_column='ground_truth'
)
cmphysbench_datasets = []
cmphysbench_infer_cfg = dict(
prompt_template=dict(
type=RawPromptTemplate,
messages=[{'role': 'user', 'content': 'You are a condensed matter physics expert. Please read the following question and provide a step-by-step solution using only the given symbols. Do not introduce any new symbols that are not provided in the problem statement. Your final answer must be presented as a readable LaTeX formula, enclosed in a \\boxed{{}} environment.\n{prompt}'}],
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
cmphysbench_eval_cfg = dict(
evaluator=dict(type=CMPhysBenchEvaluator),
)
cmphysbench_datasets.append(
dict(
abbr='CMPhysBench-fix_prompt',
type=CMPhysBenchDataset,
path='weidawang/CMPhysBench',
reader_cfg=cmphysbench_reader_cfg,
infer_cfg=cmphysbench_infer_cfg,
eval_cfg=cmphysbench_eval_cfg,
)
)
================================================
FILE: opencompass/configs/datasets/ChemBench/ChemBench_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .ChemBench_gen_a9f753 import chembench_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ChemBench/ChemBench_gen_a9f753.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ChemBenchDataset
from opencompass.utils.text_postprocessors import first_capital_postprocess
chembench_reader_cfg = dict(
input_columns=['input', 'A', 'B', 'C', 'D'],
output_column='target',
train_split='dev')
chembench_all_sets = [
'Name_Conversion',
'Property_Prediction',
'Mol2caption',
'Caption2mol',
'Product_Prediction',
'Retrosynthesis',
'Yield_Prediction',
'Temperature_Prediction',
'Solvent_Prediction'
]
chembench_datasets = []
for _name in chembench_all_sets:
# _hint = f'There is a single choice question about {_name.replace("_", " ")}. Answer the question by replying A, B, C or D.'
_hint = f'There is a single choice question about chemistry. Answer the question by replying A, B, C or D.'
chembench_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
f'{_hint}\nQuestion: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer: '
),
dict(role='BOT', prompt='{target}\n')
]),
),
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin='',
round=[
dict(
role='HUMAN',
prompt=
f'{_hint}\nQuestion: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer: '
),
],
),
ice_token='',
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 1, 2, 3, 4]),
inferencer=dict(type=GenInferencer),
)
chembench_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(type=first_capital_postprocess))
chembench_datasets.append(
dict(
abbr=f'ChemBench_{_name}',
type=ChemBenchDataset,
path='opencompass/ChemBench4K',
name=_name,
reader_cfg=chembench_reader_cfg,
infer_cfg=chembench_infer_cfg,
eval_cfg=chembench_eval_cfg,
))
del _name, _hint
================================================
FILE: opencompass/configs/datasets/ChemBench/ChemBench_llmjudge_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .ChemBench_llmjudge_gen_c584cf import chembench_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ChemBench/ChemBench_llmjudge_gen_c584cf.py
================================================
from opencompass.datasets.math import MATHDataset
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets import ChemBenchDataset
chembench_reader_cfg = dict(
input_columns=['input', 'A', 'B', 'C', 'D'],
output_column='target',
train_split='dev')
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
chembench_all_sets = [
'Name_Conversion',
'Property_Prediction',
'Mol2caption',
'Caption2mol',
'Product_Prediction',
'Retrosynthesis',
'Yield_Prediction',
'Temperature_Prediction',
'Solvent_Prediction'
]
_hint = f'There is a single choice question about chemistry. Answer the question by replying A, B, C or D.'
chembench_datasets = []
for _name in chembench_all_sets:
chembench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role='HUMAN', prompt=f'{_hint}\nQuestion: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer: ')
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer)
)
# Evaluation configuration
chembench_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=ChemBenchDataset,
path='opencompass/ChemBench4K',
name=_name,
reader_cfg=chembench_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
chembench_datasets.append(
dict(
abbr=f'ChemBench_{_name}',
type=ChemBenchDataset,
path='opencompass/ChemBench4K',
name=_name,
reader_cfg=chembench_reader_cfg,
infer_cfg=chembench_infer_cfg,
eval_cfg=chembench_eval_cfg,
))
================================================
FILE: opencompass/configs/datasets/ChemBench/ChemBench_llmjudge_rawprompt_gen_fa3fc4.py
================================================
from opencompass.datasets.math import MATHDataset
from opencompass.openicl.icl_raw_prompt_template import RawPromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets import ChemBenchDataset
chembench_reader_cfg = dict(
input_columns=['input', 'A', 'B', 'C', 'D'],
output_column='target',
train_split='dev')
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
chembench_all_sets = [
'Name_Conversion',
'Property_Prediction',
'Mol2caption',
'Caption2mol',
'Product_Prediction',
'Retrosynthesis',
'Yield_Prediction',
'Temperature_Prediction',
'Solvent_Prediction'
]
_hint = f'There is a single choice question about chemistry. Answer the question by replying A, B, C or D.'
chembench_datasets = []
for _name in chembench_all_sets:
chembench_infer_cfg = dict(
prompt_template=dict(
type=RawPromptTemplate,
messages=[
{'role': 'user', 'content': f'{_hint}\nQuestion: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer: '}
]),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer)
)
# Evaluation configuration
chembench_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=RawPromptTemplate,
messages=[
{'role': 'system', 'content': "You are a helpful assistant who evaluates the correctness and quality of models' outputs."},
{'role': 'user', 'content': GRADER_TEMPLATE},
],
),
dataset_cfg=dict(
type=ChemBenchDataset,
path='opencompass/ChemBench4K',
name=_name,
reader_cfg=chembench_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
chembench_datasets.append(
dict(
abbr=f'ChemBench_{_name}',
type=ChemBenchDataset,
path='opencompass/ChemBench4K',
name=_name,
reader_cfg=chembench_reader_cfg,
infer_cfg=chembench_infer_cfg,
eval_cfg=chembench_eval_cfg,
))
================================================
FILE: opencompass/configs/datasets/ClimaQA/ClimaQA_Gold_llm_judge_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .ClimaQA_Gold_llm_judge_gen_f15343 import climaqa_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ClimaQA/ClimaQA_Gold_llm_judge_gen_f15343.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import ClimaQADataset, generic_llmjudge_postprocess
from opencompass.evaluator import GenericLLMEvaluator
climaqa_gold_sets = [
'mcq',
'cloze',
'ffq'
]
GRADER_TEMPLATE_mcq = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. The answer may be one of the four options: a, b, c, or d. Only when the options given by prediction are strictly consistent with the answer, the prediction can be considered correct.
3. If the prediction is given with 'The answer is:', please ignore the 'The answer is:', and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
GRADER_TEMPLATE_cloze = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. The form of the answer is a word or a phrase. Please strictly compare the prediction and the answer. Only when the prediction and the answer are exactly the same, will the prediction be considered correct; otherwise, it will be considered incorrect.
3. If the prediction is given with 'The answer is:', please ignore the 'The answer is:' and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
GRADER_TEMPLATE_ffq = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. The type of question is open-ended Q&A. Please compare whether the prediction is close enough to the meaning of the answer and whether the prediction covers each key point in the answer. If the prediction meets the above requirements, it can be considered very close to the answer.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with 'The answer is:', please ignore the 'The answer is:' and only judge whether the candidate's answer is very close to the standard answer.
Please judge whether the following answers are close to the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: very close to the answer
B: not very close to the answer
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either A or B. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
climaqa_reader_cfg = dict(input_columns=['input'], output_column='target')
climaqa_datasets = []
for _task in climaqa_gold_sets:
if _task == 'mcq':
GRADER_TEMPLATE = GRADER_TEMPLATE_mcq
infer_prompt = f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification. The question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: a\"\n\nQ: {{input}}\nA: "
if _task == 'ffq':
GRADER_TEMPLATE = GRADER_TEMPLATE_ffq
infer_prompt = f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\".\n\nQ: {{input}}\nA: "
if _task == 'cloze':
GRADER_TEMPLATE = GRADER_TEMPLATE_cloze
infer_prompt = f"Fill the in the sentence. Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\".\n\nQ: {{input}}\nA: "
climaqa_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=infer_prompt,
)
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
climaqa_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=ClimaQADataset,
path='opencompass/ClimaQA-Gold',
task=_task,
abbr='ClimaQA_Gold_' + _task,
reader_cfg=climaqa_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
climaqa_datasets.append(
dict(
abbr='ClimaQA_Gold_' + _task,
type=ClimaQADataset,
path='opencompass/ClimaQA-Gold',
task=_task,
reader_cfg=climaqa_reader_cfg,
infer_cfg=climaqa_infer_cfg,
eval_cfg=climaqa_eval_cfg,
)
)
================================================
FILE: opencompass/configs/datasets/ClimaQA/ClimaQA_Gold_llm_judge_rawprompt_gen_b3080f.py
================================================
from opencompass.openicl.icl_raw_prompt_template import RawPromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import ClimaQADataset, generic_llmjudge_postprocess
from opencompass.evaluator import GenericLLMEvaluator
climaqa_gold_sets = [
'mcq',
'cloze',
'ffq'
]
GRADER_TEMPLATE_mcq = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. The answer may be one of the four options: a, b, c, or d. Only when the options given by prediction are strictly consistent with the answer, the prediction can be considered correct.
3. If the prediction is given with 'The answer is:', please ignore the 'The answer is:', and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
GRADER_TEMPLATE_cloze = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. The form of the answer is a word or a phrase. Please strictly compare the prediction and the answer. Only when the prediction and the answer are exactly the same, will the prediction be considered correct; otherwise, it will be considered incorrect.
3. If the prediction is given with 'The answer is:', please ignore the 'The answer is:' and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
GRADER_TEMPLATE_ffq = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. The type of question is open-ended Q&A. Please compare whether the prediction is close enough to the meaning of the answer and whether the prediction covers each key point in the answer. If the prediction meets the above requirements, it can be considered very close to the answer.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with 'The answer is:', please ignore the 'The answer is:' and only judge whether the candidate's answer is very close to the standard answer.
Please judge whether the following answers are close to the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: very close to the answer
B: not very close to the answer
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either A or B. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
climaqa_reader_cfg = dict(input_columns=['input'], output_column='target')
climaqa_datasets = []
for _task in climaqa_gold_sets:
if _task == 'mcq':
GRADER_TEMPLATE = GRADER_TEMPLATE_mcq
infer_prompt = f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification. The question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: a\"\n\nQ: {{input}}\nA: "
if _task == 'ffq':
GRADER_TEMPLATE = GRADER_TEMPLATE_ffq
infer_prompt = f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\".\n\nQ: {{input}}\nA: "
if _task == 'cloze':
GRADER_TEMPLATE = GRADER_TEMPLATE_cloze
infer_prompt = f"Fill the in the sentence. Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\".\n\nQ: {{input}}\nA: "
climaqa_infer_cfg = dict(
prompt_template=dict(
type=RawPromptTemplate,
messages=[
{'role': 'user', 'content': infer_prompt},
]
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
climaqa_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=RawPromptTemplate,
messages=[
{'role': 'system', 'content': "You are a helpful assistant who evaluates the correctness and quality of models' outputs."},
{'role': 'user', 'content': GRADER_TEMPLATE},
],
),
dataset_cfg=dict(
type=ClimaQADataset,
path='opencompass/ClimaQA-Gold',
task=_task,
abbr='ClimaQA_Gold_' + _task,
reader_cfg=climaqa_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
climaqa_datasets.append(
dict(
abbr='ClimaQA_Gold_' + _task,
type=ClimaQADataset,
path='opencompass/ClimaQA-Gold',
task=_task,
reader_cfg=climaqa_reader_cfg,
infer_cfg=climaqa_infer_cfg,
eval_cfg=climaqa_eval_cfg,
)
)
================================================
FILE: opencompass/configs/datasets/ClimaQA/ClimaQA_Silver_llm_judge_gen.py
================================================
from mmengine.config import read_base
with read_base():
from .ClimaQA_Silver_llm_judge_gen_f15343 import climaqa_datasets # noqa: F401, F403
================================================
FILE: opencompass/configs/datasets/ClimaQA/ClimaQA_Silver_llm_judge_gen_f15343.py
================================================
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import ClimaQADataset, generic_llmjudge_postprocess
from opencompass.evaluator import GenericLLMEvaluator
climaqa_silver_sets = [
'mcq',
'cloze',
'ffq'
]
GRADER_TEMPLATE_mcq = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. The answer may be one of the four options: a, b, c, or d. Only when the options given by prediction are strictly consistent with the answer, the prediction can be considered correct.
3. If the prediction is given with 'The answer is:', please ignore the 'The answer is:', and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n\n\n
: \n{prediction}\n\n\n
Judging the correctness of candidates' answers:
""".strip()
GRADER_TEMPLATE_cloze = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. The form of the answer is a word or a phrase. Please strictly compare the prediction and the answer. Only when the prediction and the answer are exactly the same, will the prediction be considered correct; otherwise, it will be considered incorrect.
3. If the prediction is given with 'The answer is:', please ignore the 'The answer is:' and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
: \n{input}\n\n\n
: \n{target}\n