Full Code of huggingface/open-r1 for AI

main 0e06249d1caa cached

77 files

351.9 KB

89.1k tokens

221 symbols

1 requests

Download .txt

Showing preview only (374K chars total). Download the full file or copy to clipboard to get everything.

Repository: huggingface/open-r1
Branch: main
Commit: 0e06249d1caa
Files: 77
Total size: 351.9 KB

Directory structure:
gitextract_ohr6vcnk/

├── .github/
│   ├── dependabot.yml
│   └── workflows/
│       └── tests.yml
├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── recipes/
│   ├── DeepSeek-R1-Distill-Qwen-1.5B/
│   │   └── grpo/
│   │       └── config_demo.yaml
│   ├── OlympicCoder-32B/
│   │   └── sft/
│   │       └── config_v00.00.yaml
│   ├── OlympicCoder-7B/
│   │   └── sft/
│   │       └── config_v00.00.yaml
│   ├── OpenR1-Distill-7B/
│   │   └── sft/
│   │       └── config_distill.yaml
│   ├── Qwen2.5-1.5B-Instruct/
│   │   └── grpo/
│   │       ├── config_demo.yaml
│   │       ├── config_demo_code.yaml
│   │       └── config_demo_code_ioi.yaml
│   ├── Qwen2.5-Coder-7B-Instruct/
│   │   └── grpo/
│   │       └── config_codeforces.yaml
│   ├── README.md
│   ├── accelerate_configs/
│   │   ├── ddp.yaml
│   │   ├── fsdp.yaml
│   │   ├── zero2.yaml
│   │   └── zero3.yaml
│   └── dataset_filtering/
│       ├── config_demo.yaml
│       ├── filter_dapo.yaml
│       └── filter_python.yaml
├── scripts/
│   ├── benchmark_e2b.py
│   ├── decontaminate.py
│   ├── e2b_router.py
│   ├── generate_reasoning.py
│   ├── get_tensor_parallel_size.py
│   ├── morph_router.py
│   ├── pass_rate_filtering/
│   │   ├── README.md
│   │   ├── compute_pass_rate.py
│   │   └── launch_filtering.sh
│   ├── run_benchmarks.py
│   └── upload_details.py
├── setup.cfg
├── setup.py
├── slurm/
│   ├── README.md
│   ├── compute_pass_rate.slurm
│   ├── e2b_router.slurm
│   ├── evaluate.slurm
│   ├── experimental/
│   │   └── serve_r1_vllm.slurm
│   ├── generate.slurm
│   ├── morph_router.slurm
│   ├── piston/
│   │   ├── README.md
│   │   ├── launch_piston_workers.sh
│   │   └── launch_single_piston.sh
│   ├── serve_r1.slurm
│   ├── serve_router.slurm
│   └── train.slurm
├── src/
│   └── open_r1/
│       ├── __init__.py
│       ├── configs.py
│       ├── generate.py
│       ├── grpo.py
│       ├── rewards.py
│       ├── sft.py
│       └── utils/
│           ├── __init__.py
│           ├── callbacks.py
│           ├── code_providers.py
│           ├── competitive_programming/
│           │   ├── __init__.py
│           │   ├── cf_scoring.py
│           │   ├── code_patcher.py
│           │   ├── ioi_scoring.py
│           │   ├── ioi_utils.py
│           │   ├── morph_client.py
│           │   ├── piston_client.py
│           │   └── utils.py
│           ├── data.py
│           ├── evaluation.py
│           ├── hub.py
│           ├── import_utils.py
│           ├── model_utils.py
│           ├── routed_morph.py
│           ├── routed_sandbox.py
│           └── wandb_logging.py
└── tests/
    ├── __init__.py
    ├── slow/
    │   └── test_code_reward.py
    ├── test_rewards.py
    └── utils/
        └── test_data.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
  - package-ecosystem: "pip"
    directory: "/"
    schedule:
      interval: "weekly"
  - package-ecosystem: "github-actions"
    directory: "/"
    schedule:
      interval: "weekly"


================================================
FILE: .github/workflows/tests.yml
================================================
name: Tests

on:
  push:
    branches:
      - main
      - v*-release
  pull_request:
    branches:
      - main

jobs:

  tests:
    name: Run tests and quality checks
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Setup Python environment
        uses: actions/setup-python@v5
        with:
          python-version: 3.10.10
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          python -m pip install ".[quality,tests]"
      - name: Code quality
        run: |
          make quality
      - name: Run tests
        run: |
          make test



================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# UV
#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#uv.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# PyPI configuration file
.pypirc

# Temp folders
data/
wandb/
logs/
eval_results/
results/

.vscode/
.python-version

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: Makefile
================================================
.PHONY: style quality

# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
export PYTHONPATH = src

check_dirs := src tests


# dev dependencies
install:
	uv venv openr1 --python 3.11
	. openr1/bin/activate && uv pip install --upgrade pip && \
	uv pip install vllm==0.8.5.post1 && \
	uv pip install setuptools && \
	uv pip install flash-attn --no-build-isolation && \
	GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"

style:
	ruff format --line-length 119 --target-version py310 $(check_dirs) setup.py
	isort $(check_dirs) setup.py

quality:
	ruff check --line-length 119 --target-version py310 $(check_dirs) setup.py
	isort --check-only $(check_dirs) setup.py
	flake8 --max-line-length 119 $(check_dirs) setup.py

test:
	pytest -sv --ignore=tests/slow/ tests/

slow_test:
	pytest -sv -vv tests/slow/

# Evaluation

evaluate:
	$(eval PARALLEL_ARGS := $(if $(PARALLEL),$(shell \
		if [ "$(PARALLEL)" = "data" ]; then \
			echo "data_parallel_size=$(NUM_GPUS)"; \
		elif [ "$(PARALLEL)" = "tensor" ]; then \
			echo "tensor_parallel_size=$(NUM_GPUS)"; \
		fi \
	),))
	$(if $(filter tensor,$(PARALLEL)),export VLLM_WORKER_MULTIPROC_METHOD=spawn &&,) \
	MODEL_ARGS="pretrained=$(MODEL),dtype=bfloat16,$(PARALLEL_ARGS),max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}" && \
	if [ "$(TASK)" = "lcb" ]; then \
		lighteval vllm $$MODEL_ARGS "extended|lcb:codegeneration|0|0" \
			--use-chat-template \
			--output-dir data/evals/$(MODEL); \
	else \
		lighteval vllm $$MODEL_ARGS "lighteval|$(TASK)|0|0" \
			--use-chat-template \
			--output-dir data/evals/$(MODEL); \
	fi


================================================
FILE: README.md
================================================
# Open R1

*A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!*

**Table of Contents**  
1. [Overview](#overview)  
2. [Plan of attack](#plan-of-attack)  
3. [Installation](#installation)  
4. [Training models](#training-models)  
   - [SFT](#sft)  
   - [GRPO](#grpo)  
5. [Evaluating models](#evaluating-models)  
6. [Reproducing Deepseek's evaluation results](#reproducing-deepseeks-evaluation-results)  
7. [Data generation](#data-generation)  
   - [Generate data from a smol distilled R1 model](#generate-data-from-a-smol-distilled-r1-model)  
   - [Generate data from DeepSeek-R1](#generate-data-from-deepseek-r1)  
8. [Contributing](#contributing)

## Overview

The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:


- `src/open_r1`: contains the scripts to train models as well as generate synthetic data:
    - `grpo.py`: trains a model with GRPO on a given dataset.
    - `sft.py`: performs a simple SFT of a model on a dataset.
    - `generate.py`: generates synthetic data from a model using [Distilabel](https://github.com/argilla-io/distilabel).
- `Makefile`: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.

### Plan of attack

We will use the DeepSeek-R1 [tech report](https://github.com/deepseek-ai/DeepSeek-R1) as a guide, which can roughly be broken down into three main steps:

* Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
* Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
* Step 3: show we can go from base model to RL-tuned via multi-stage training.

<center>
    <img src="assets/plan-of-attack.png" width="500">
</center>

## News 🗞️

* **🧑‍🍳 [2025/05/26] (Step 1 completed!)** We release [**Mixture-of-Thoughts**](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts)--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to train [OpenR1-Distill-7B](https://huggingface.co/open-r1/OpenR1-Distill-7B), which replicates the reasoning capabilities of [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and marks the completion of step 1 in the Open R1 project.
* **⚡️ [2025/03/11] [(update #3)](https://huggingface.co/blog/open-r1/update-3):** We release the [**CodeForces-CoTs**](https://huggingface.co/datasets/open-r1/codeforces-cots) dataset of 10k competitive programming problems and 100k solutions distilled from R1. We also release IOI24: a new benchmark of _very_ hard problems from international olympiads. A 7B Qwen model trained on CodeForces-CoTs can outperform Claude 3.7 Sonnet on IOI24, while a 32B model can outperform R1 itself.
* **∞ [2025/02/10] [(update #2)](https://huggingface.co/blog/open-r1/update-2):** We release the [**OpenR1-Math-220k**](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) dataset of 220k traces distilled from R1 on a new version of NuminaMath. Models trained on this dataset match the performance of DeepSeek's distilled ones.
* **🔥 [2025/02/02] [(update #1)](https://huggingface.co/blog/open-r1/update-1):** We implement the first parts of the [training](https://github.com/huggingface/open-r1?tab=readme-ov-file#training-models), [inference](https://github.com/huggingface/open-r1?tab=readme-ov-file#data-generation), and [evaluation](https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results) pipelines. Let's go!  

## Installation

> [!CAUTION]
> Libraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running with `nvcc --version`.

To run the code in this project, first, create a Python virtual environment using e.g. `uv`.
To install `uv`, follow the [UV Installation Guide](https://docs.astral.sh/uv/getting-started/installation/).


> [!NOTE]
> As a shortcut, run `make install` to setup development libraries (spelled out below). Afterwards, if everything is setup correctly you can try out the Open-R1 models.


```shell
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip
```

> [!TIP]
> For Hugging Face cluster users, add `export UV_LINK_MODE=copy` to your `.bashrc` to suppress cache warnings from `uv`

Next, install vLLM and FlashAttention:

```shell
uv pip install vllm==0.8.5.post1
uv pip install setuptools && uv pip install flash-attn --no-build-isolation
```

This will also install PyTorch `v2.6.0` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:

```shell
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
```

Next, log into your Hugging Face and Weights and Biases accounts as follows:

```shell
huggingface-cli login
wandb login
```

Finally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:

```shell
git-lfs --version
```

If it isn't installed, run:

```shell
sudo apt-get install git-lfs
```

## Training models

> [!NOTE]
> The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.

We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to perform SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts), run:

```shell
# Train via command line
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --eos_token '<|im_end|>' \
    --learning_rate 4.0e-5 \
    --num_train_epochs 5 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 2 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
    --output_dir data/OpenR1-Distill-7B

# Train via YAML config
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
```

Currently, the following tasks are supported:

* Supervised Fine-Tuning `sft`
* Group Relative Policy Optimization `grpo`

> [!TIP]
> If you scale up/down the number of GPUs, we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant.

By default, these scripts will push each model to your Hugging Face Hub username, i.e. `{username}/{model_name}-{task}`. You can override the parameters in each YAML config by appending them to the command as follows: 

```shell
# Change the base model to a smaller variant
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml \
    --model_name_or_path Qwen/Qwen3-0.6B-Base \
    --hub_model_id OpenR1-Distill-0.6B \
    --output_dir data/OpenR1-Distill-0.6B
```

If you also wish to override the Weights and Biases default settings, you can do so as follows:

```shell
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO
```

**🚨 WARNING 🚨**

Most base models like `meta-llama/Llama-3.2-1B` do not have a chat template, so we set ChatML as the default during training. However, for Qwen base models like `Qwen/Qwen2.5-1.5B`, a chat template is pre-defined in the tokenizer, so the EOS token must be set accordingly, e.g.

```diff
# Align EOS token with chat template for Qwen base models
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path Qwen/Qwen2.5-1.5B \
+   --eos_token '<|im_end|>'
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --learning_rate 4.0e-5 \
    --num_train_epochs 1 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
```

If you wish to use a custom chat template (e.g. Llama or Gemma), then the chat template and associated EOS token must be provided:

```diff
# Align EOS token with custom chat template
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path meta-llama/Llama-3.2-1B \
+   --chat_template "$(cat llama_chat_template.jinja)" \
+   --eos_token '<|eot_id|>' \
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --learning_rate 4.0e-5 \
    --num_train_epochs 1 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
    --output_dir data/Llama-3.2-1B-Open-R1-Distill
```

### SFT distillation

We provide a recipe to reproduce the reasoning capabilities of [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), starting from the same base model. To do so, run:

```shell
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
```

The result will be a model like [open-r1/OpenR1-Distill-7B](https://huggingface.co/open-r1/OpenR1-Distill-7B), with the following downstream performance:

| Model                       | AIME 2024 | MATH-500 | GPQA Diamond | LiveCodeBench v5 |
|-----------------------------|-----------|----------|--------------|------------------|
| OpenR1-Distill-7B           | 52.7      | 89.0     | 52.8         | 39.4             |
| DeepSeek-R1-Distill-Qwen-7B | 51.3      | 93.5     | 52.4         | 37.4             |

You can adjust the YAML config to train on a different base model or dataset.

### GRPO

We use TRL's [vLLM backend](https://huggingface.co/docs/trl/speeding_up_training?vllm+examples=GRPO#vllm-for-fast-generation-in-online-methods) to scale training to large models across multiple nodes. For single-node training of smol models across 8 GPUs, use `vllm_mode="colocate"` to run vLLM in the same process as the training script:

```shell
ACCELERATE_LOG_LEVEL=info \
    accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \
    --vllm_mode colocate
```

> [!WARNING]
> The chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the `<think>` and `</think>` tags. It also prefills the assistant response with `<think>` which interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g.  [recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml](./recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml).

For multi-node training on N+1 nodes, with 1 node running the vLLM server and N nodes running training, we provide an example Slurm script. For example, to run the above example on 1+1 nodes with data parallelism, run:

```shell
sbatch --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 8 --tp 1
```

See the [Launching jobs on a Slurm cluster](#launching-jobs-on-a-slurm-cluster) section for more details.

#### GRPO dataset filtering

We provide support to filter datasets by generating and computing pass rate on veriable tasks, see this [README](scripts/pass_rate_filtering/README.md)

#### 👨‍💻 Training with a code interpreter

We provide a `code` reward function for executing code generated by the policy during training. Currently, this reward function targets code contests like [Codeforces](https://codeforces.com), where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we support multiple sandbox providers:

1. [E2B](https://e2b.dev) - Fast, cloud-based sandboxes with focus on Python execution
2. [Morph](https://cloud.morph.so/web/) - Cloud-based sandboxes with broader language support - Python/JS/C++/Rust

To use the code reward function, first install the necessary dependencies:

```shell
uv pip install -e '.[code]'
```

##### E2B Provider

To use E2B sandboxes, create a `.env` file and add your E2B API token:

```
E2B_API_KEY="e2b_xxx"
```

##### Morph Provider

To use Morph, first install the morphcloud package:

```shell
pip install morphcloud
```

Then add your Morph API token to the `.env` file:

```
MORPH_API_KEY="YOUR_MORPH_API_KEY"
```

To specify which provider to use, add the `provider_type` parameter in your configuration:

```yaml
# For E2B
provider_type: e2b

# For Morph
provider_type: morph
```

##### Dataset Requirements

Make sure your dataset contains a `verification_info` column with the following schema (adopted from PrimeIntellect's excellent [datasets](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37) of verifiable problems):

```python
{
    "language": "python",  # Morph supports more languages including C++, Java, etc.
    "test_cases": [
        {
            "input": "4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n",
            "output": "1\n3 \n-1\n0\n\n2\n1 2 \n",
            "type": "stdin_stdout",
        }
    ],
}
```

For example, to train a smol model on Python problems, start the vLLM server:

```shell
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-1.5B-Instruct
```

Then run training with:

```shell
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \
    accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes=7 \
    src/open_r1/grpo.py --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml
```

##### Using Router Services

It is possible to be rate limited when too many scripts are executed on sandbox services. For both providers, we offer router scripts that can be launched on a CPU node:

For E2B:
```shell
sbatch slurm/e2b_router.slurm
```

For Morph:
```shell
sbatch slurm/morph_router.slurm
```

Then add the router URL in your training YAML config:
```yaml
# For E2B
e2b_router_url: 1.2.3.4:8000

# For Morph
morph_router_url: 1.2.3.4:8000
```

The port should match the one used when launching the router.
All training jobs can share the same router IP which will ensure parallel executions are properly managed.

#### Competitive Programming problems: IOI & CodeForces

We provide `ioi_code_reward` and `cf_code_reward` reward functions for executing problems from [IOI](https://hf.co/datasets/open-r1/ioi) and [CodeForces](https://huggingface.co/datasets/open-r1/codeforces), respectively. You can use either [piston](https://github.com/engineer-man/piston) or Morph (currently IOI only) as your execution provider.

##### Piston 

To use Piston:
1. Get piston workers running, see [slurm/piston/README.md](./slurm/piston/README.md)
2. Set your environment variable `PISTON_ENDPOINTS` to `slurm` or to a list of piston worker endpoints

For IOI:

3. In your configuration, use `ioi_provider: "piston"`

For CodeForces:

3. Download the generated (hard) test cases:
```
# change PATH_TO_SAVE_TESTCASES. Increase --max-workers according to your machine's capacity
huggingface-cli download open-r1/codeforces --repo-type=dataset --include='generated_tests/*.parquet' --max-workers=8 --local-dir PATH_TO_SAVE_TESTCASES 
```
4. Save the path in .env:
```
CF_TESTS_FOLDER=PATH_TO_SAVE_TESTCASES
```

##### Morph 

Morph is a cloud-based solution that provides sandboxed environments for running code. To use it:
1. Install the Morph client: `pip install morphcloud`
2. Add your Morph API key to the `.env` file: `MORPH_API_KEY="your_key_here"`
3. In your configuration, use `ioi_provider: "morph"`

##### Example recipes
For IOI:

See the [example recipe](./recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml) for how to use the IOI reward function:

```shell
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
    --num_processes=7 src/open_r1/grpo.py \
    --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml
```

For CodeForces:

```shell
sbatch --job-name=cf-grpo --nodes=2 slurm/train.slurm --model Qwen2.5-Coder-7B-Instruct --task grpo --config codeforces --accelerator zero3 --dp 8 --tp 1
```

### Launching jobs on a Slurm cluster

If you have access to a Slurm cluster, we provide a `slurm/train.slurm` script that will automatically queue training jobs for you. Here's how you can use it:

```shell
sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model {model_name} --task {task} --config {config_suffix} --accelerator {accelerator}
```

Here `{model_name}` and `{task}` are defined as above, while `{config_suffix}` refers to the specific config and `{accelerator}` refers to the choice of 🤗 Accelerate config in `recipes/accelerate_configs`. If you wish to override the default config parameters, you can provide them by appending a space-separated string like `'--arg1=value1 --arg2=value2'`. Here's a concrete example to run SFT on 1 node of 8 GPUs:

```shell
sbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model OpenR1-Distill-7B --task sft --config distill --accelerator zero3
```

You can scale the number of nodes by increasing the `--nodes` flag.

For GRPO, we use 1 node for the vLLM server and N nodes for training. For example, to run GRPO on 1+1 nodes with mixed data and tensor parallelism, run:

```shell
sbatch --job-name=open_r1 --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 4 --tp 2
```

> [!NOTE]
> The configuration in `slurm/train.slurm` is optimised for the Hugging Face Compute Cluster and may require tweaking to be adapted to your own compute nodes.

### Customising the dataset mixture

To combine multiple datasets as a single training mixture, you can specify the `dataset_mixture` parameter in the YAML config file. Here's a template for how to do this:

```yaml
dataset_mixture:
  datasets:                     # List of datasets to include in the mixture
    - id: dataset_1             # Hub dataset ID
      config: config_name_1     # Name of the dataset config
      split: split_1            # Split to use from the dataset
      columns:                  # Columns to keep
        - column_1              
        - column_2    
      weight: 0.25              # Fraction of dataset to use
    - id: dataset_2
      config: config_name_2
      split: split_2
      columns:                  
        - column_1              
        - column_2   
      weight: 0.5
  seed: 42                      # Seed for shuffling the combined dataset
  test_split_size: 0.1          # Fraction of mixture to use for a test split
```

## Evaluating models

We use `lighteval` to evaluate models. For models which fit on a single GPU, run:

```shell
export VLLM_WORKER_MULTIPROC_METHOD=spawn # Required for vLLM
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

# AIME 2024
TASK=aime24
lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# MATH-500
TASK=math_500
lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# GPQA Diamond
TASK=gpqa:diamond
lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

# LiveCodeBench
lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR 
```

To increase throughput across multiple GPUs, use _data parallel_ as follows:

```shell
NUM_GPUS=8
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

For large models which require sharding across GPUs, use _tensor parallel_ and run:

```shell
NUM_GPUS=8
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

You can also launch an evaluation with `make evaluate`, specifying the model, task, and optionally the parallelism technique and number of GPUs.

To evaluate on a single GPU:

```shell
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24
```

To use Data Parallelism:

```shell
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8
```

To use Tensor Parallelism:

```shell
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8
```

## Reproducing Deepseek's evaluation results

The DeepSeek-R1 paper uses sampling with 4-64 responses per query to estimate `pass@1` accuracy, but does not specify the specific number of responses per benchmark. In the tables below, we estimate `pass@1` accuracy with the following number of responses per query:

|   Benchmark   | Number of responses per query |
|:-------------:|:-----------------------------:|
|   AIME 2024   |              64               |
|   MATH-500    |               4               |
| GPQA Diamond  |               8               |
| LiveCodeBench |              16               |


Note that for benchmarks like AIME24, it is important to sample many responses as there are only 30 problems and this can introduce high variance across repeated runs. The choice of how many responses to sample per prompt likely explains the small differences between our evaluation results and those reported by DeepSeek.

### AIME 2024

We are able to reproduce Deepseek's reported results on the AIME 2024 benchmark within ~1-3 standard deviations:

| Model                         | AIME 2024 (🤗 LightEval) | AIME 2024 (DeepSeek Reported) |
|:------------------------------|:------------------------:|:-----------------------------:|
| DeepSeek-R1-Distill-Qwen-1.5B |           30.7           |             28.9              |
| DeepSeek-R1-Distill-Qwen-7B   |           50.8           |             55.5              |
| DeepSeek-R1-Distill-Qwen-14B  |           65.9           |             69.7              |
| DeepSeek-R1-Distill-Qwen-32B  |           69.7           |             72.6              |
| DeepSeek-R1-Distill-Llama-8B  |           43.9           |             41.7              |
| DeepSeek-R1-Distill-Llama-70B |           63.0           |             70.0              |

To reproduce these results use the following command:

```shell
NUM_GPUS=1 # Set to 8 for 32B and 70B models
MODEL=deepseek-ai/{model_name}
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

Alternatively, you can launch Slurm jobs as follows:

```shell
python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks aime24
```

### MATH-500

We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations:

| Model                         | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |
|:------------------------------|:-----------------------:|:----------------------------:|
| DeepSeek-R1-Distill-Qwen-1.5B |          83.1           |             83.9             |
| DeepSeek-R1-Distill-Qwen-7B   |          94.5           |             92.8             |
| DeepSeek-R1-Distill-Qwen-14B  |          94.1           |             93.9             |
| DeepSeek-R1-Distill-Qwen-32B  |          95.6           |             94.3             |
| DeepSeek-R1-Distill-Llama-8B  |          88.6           |             89.1             |
| DeepSeek-R1-Distill-Llama-70B |          95.1           |             94.5             |

To reproduce these results use the following command:

```shell
export VLLM_WORKER_MULTIPROC_METHOD=spawn
NUM_GPUS=1 # Set to 8 for 32B and 70B models
MODEL=deepseek-ai/{model_name}
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "lighteval|math_500|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

Alternatively, you can launch Slurm jobs as follows:

```shell
python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks math_500
```

### GPQA Diamond

We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations:

| Model                         | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |
|:------------------------------|:---------------------------:|:--------------------------------:|
| DeepSeek-R1-Distill-Qwen-1.5B |            35.8             |               33.8               |
| DeepSeek-R1-Distill-Qwen-7B   |            50.5             |               49.1               |
| DeepSeek-R1-Distill-Qwen-14B  |            61.5             |               59.1               |
| DeepSeek-R1-Distill-Qwen-32B  |            63.1             |               62.1               |
| DeepSeek-R1-Distill-Llama-8B  |            46.7             |               49.0               |
| DeepSeek-R1-Distill-Llama-70B |            67.4             |               65.2               |

To reproduce these results use the following command:

```shell
export VLLM_WORKER_MULTIPROC_METHOD=spawn
NUM_GPUS=1 # Set to 8 for 32B and 70B models
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "lighteval|gpqa:diamond|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

```shell
python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks gpqa
```

### LiveCodeBench

We are able to reproduce Deepseek's reported results on the LiveCodeBench code generation benchmark within ~1-3 standard deviations:

| Model                         | LiveCodeBench (🤗 LightEval) | LiveCodeBench (DeepSeek Reported) |
|:------------------------------|:----------------------------:|:---------------------------------:|
| DeepSeek-R1-Distill-Qwen-1.5B |             16.1             |               16.9                |
| DeepSeek-R1-Distill-Qwen-7B   |             37.4             |               37.6                |
| DeepSeek-R1-Distill-Qwen-14B  |             51.3             |               53.1                |
| DeepSeek-R1-Distill-Qwen-32B  |             56.0             |               57.2                |
| DeepSeek-R1-Distill-Llama-8B  |             37.4             |               39.6                |
| DeepSeek-R1-Distill-Llama-70B |             55.9             |               57.5                |

To reproduce these results use the following command:

```shell
NUM_GPUS=1 # Set to 8 for 32B and 70B models, or data_parallel_size=8 with the smaller models for speed
MODEL=deepseek-ai/{model_name}
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

```shell
python scripts/run_benchmarks.py --model-id {model_id}  --benchmarks lcb
```

## Data generation

### Generate data from a smol distilled R1 model

The following example can be run in 1xH100. 
First install the following dependencies:

```shell
uv pip install "distilabel[vllm]>=1.5.2"
```

Now save the following snippet into a file named `pipeline.py` and run it with `python pipeline.py`. It will generate 4 outputs for each of the 10 examples (change the username for the repository to your org/user name):

```python
from datasets import load_dataset
from distilabel.models import vLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration


prompt_template = """\
You will be given a problem. Please reason step by step, and put your final answer within \boxed{}:
{{ instruction }}"""

dataset = load_dataset("AI-MO/NuminaMath-TIR", split="train").select(range(10))

model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"  # Exchange with another smol distilled r1

with Pipeline(
    name="distill-qwen-7b-r1",
    description="A pipeline to generate data from a distilled r1 model",
) as pipeline:

    llm = vLLM(
        model=model_id,
        tokenizer=model_id,
        extra_kwargs={
            "tensor_parallel_size": 1,
            "max_model_len": 8192,
        },
        generation_kwargs={
            "temperature": 0.6,
            "max_new_tokens": 8192,
        },
    )
    prompt_column = "problem"
    text_generation = TextGeneration(
        llm=llm, 
        template=prompt_template,
        num_generations=4,
        input_mappings={"instruction": prompt_column} if prompt_column is not None else {}
    )


if __name__ == "__main__":
    distiset = pipeline.run(dataset=dataset)
    distiset.push_to_hub(repo_id="username/numina-deepseek-r1-qwen-7b")
```

Take a look at the sample dataset at [HuggingFaceH4/numina-deepseek-r1-qwen-7b](https://huggingface.co/datasets/HuggingFaceH4/numina-deepseek-r1-qwen-7b).


### Generate data from DeepSeek-R1

To run the bigger DeepSeek-R1, we used 2 nodes, each with 8×H100 GPUs using the slurm file present in this repo at `slurm/generate.slurm`. First, install the dependencies:

(for now we need to install the vllm dev wheel that [fixes the R1 cuda graph capture](https://github.com/vllm-project/vllm/commits/221d388cc5a836fa189305785ed7e887cea8b510/csrc/moe/moe_align_sum_kernels.cu))
```shell
pip install https://wheels.vllm.ai/221d388cc5a836fa189305785ed7e887cea8b510/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu121

uv pip install "distilabel[vllm,ray,openai]>=1.5.2"
```

And then run the following command:

```shell
sbatch slurm/generate.slurm \
    --hf-dataset AI-MO/NuminaMath-TIR \
    --temperature 0.6 \
    --prompt-column problem \
    --model deepseek-ai/DeepSeek-R1 \
    --hf-output-dataset username/r1-dataset
```

> [!NOTE]  
> While the job is running, you can setup an SSH tunnel through the cluster login node to access the Ray dashboard from your computer running `ssh -L 8265:ray_ip_head_node:8265 <login_node>`, then browsing `http://localhost:8265`


### Data decontamination

Following [s1: Simple test-time scaling](https://huggingface.co/papers/2501.19393) the data can be decontaminated using the script at: [scripts/decontaminate.py](./scripts/decontaminate.py), which decontaminates a dataset using 8-grams and deduplicate the data. Sample run:

```shell
python scripts/decontaminate.py \
    --dataset "open-r1/verifiable-coding-problems-python" \
    --problem_column problem \
    --cleanup
```

It will decontaminate against the benchmark datasets, and remove the contaminated samples afterwards. If no argument `--new_dataset_name` is provided, the same dataset will be reused, adding a `_decontaminated`. It runs against the prompt, which for this dataset is the column `problem`, but a different one can be provided.

Arguments for the script:

```shell
usage: decontaminate.py [-h] --dataset DATASET [--split SPLIT] [--ngram_size NGRAM_SIZE] [--problem_column PROBLEM_COLUMN] [--cleanup] [--new_dataset_name NEW_DATASET_NAME]

options:
  -h, --help            show this help message and exit
  --dataset DATASET     Name of the dataset to check for contamination.
  --split SPLIT         Split to check for contamination, defaults to `train`.
  --ngram_size NGRAM_SIZE
                        Size of n-grams to build, defaults to 8.
  --problem_column PROBLEM_COLUMN
                        Name of the column containing the problem (prompt).
  --cleanup           Whether to remove the contaminated rows before pushing the dataset.
  --new_dataset_name NEW_DATASET_NAME
                        New name for the dataset. If not provided, will reuse the name and add a `_decontaminated` to the name.
```

## Contributing

Contributions are welcome. Please refer to https://github.com/huggingface/open-r1/issues/23.

## Acknowledgements

This project is built with the collective efforts of many groups and individuals in the open AI community. We are especially grateful to the vLLM and SGLang teams for creating high-performance tooling to scale the rollouts of GRPO. We also thank the teams at [OpenThoughts](https://www.open-thoughts.ai), [Prime Intellect](https://www.primeintellect.ai), and [General Reasoning](https://gr.inc) for creating and sharing high-quality datasets for reasoning.

## Citation

If you find this project is useful in your own work, please consider citing as follows:

```
@misc{openr1,
    title = {Open R1: A fully open reproduction of DeepSeek-R1},
    url = {https://github.com/huggingface/open-r1},
    author = {{Hugging Face}},
    month = {January},
    year = {2025}
}
```


================================================
FILE: recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml
================================================
# Model arguments
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
# We edit the DeepSeek chat template to ensure (a) the reasoning block within <think> and </think> is included in the completion and (b) the <think> tag is not part of the prefill so that the format reward works
chat_template: "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}"
dataset_name: open-r1/OpenR1-Math-220k
dataset_prompt_column: problem
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"

# GRPO trainer config
bf16: true
use_vllm: true
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: DeepSeek-R1-Distill-Qwen-1.5B-GRPO
hub_strategy: every_save
learning_rate: 1.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
max_prompt_length: 512
max_completion_length: 2048
max_steps: -1
num_generations: 16
num_train_epochs: 1
output_dir: data/DeepSeek-R1-Distill-Qwen-1.5B-GRPO
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: true
report_to:
- wandb
reward_funcs:
- accuracy
- format
- tag_count
reward_weights:
- 1.0
- 1.0
- 1.0
save_strategy: "epoch"
save_total_limit: 1
seed: 42
temperature: 0.7
use_liger_kernel: true
warmup_ratio: 0.1


================================================
FILE: recipes/OlympicCoder-32B/sft/config_v00.00.yaml
================================================
# Config for 16 nodes of 8 H100s with FSDP1
# Model arguments
model_name_or_path: Qwen/Qwen2.5-Coder-32B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: open-r1/codeforces-cots
dataset_config: solutions_decontaminated
dataset_num_proc: 12

# SFT trainer config
bf16: true
do_eval: false
eval_strategy: 'no'
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_always_push: true
hub_model_id: OlympicCoder-32B
hub_strategy: every_save
learning_rate: 4.0e-05
log_level: info
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
packing: false
max_grad_norm: 0.2
max_length: 22528 # we were unable to train at 32k due to OOM. See https://github.com/huggingface/transformers/issues/35983 for context parallelism support.
max_steps: -1
num_train_epochs: 10
optim: paged_adamw_8bit
output_dir: data/OlympicCoder-32B
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
push_to_hub: true
report_to:
- wandb
save_only_model: true # needed to bypass FSDP errors with saving paged optimizers
save_strategy: epoch
save_total_limit: 1
seed: 42
use_liger_kernel: false # fails on multi-node
warmup_ratio: 0.03

================================================
FILE: recipes/OlympicCoder-7B/sft/config_v00.00.yaml
================================================
# Config for 1 node of 8 H100s with DeepSpeed ZeRO-3
# Model arguments
model_name_or_path: Qwen/Qwen2.5-Coder-7B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: open-r1/codeforces-cots
dataset_config: solutions_decontaminated
dataset_num_proc: 48

# SFT trainer config
bf16: true
do_eval: false
eval_strategy: 'no'
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: open-r1/OlympicCoder-7B
hub_strategy: every_save
learning_rate: 1.0e-05
log_level: info
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
packing: false
max_grad_norm: 0.2
max_length: 32768
max_steps: -1
num_train_epochs: 10
output_dir: data/OlympicCoder-7B
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 2
push_to_hub: true
report_to:
- wandb
save_strategy: epoch
save_total_limit: 1
seed: 42
use_liger_kernel: true
warmup_ratio: 0.03

================================================
FILE: recipes/OpenR1-Distill-7B/sft/config_distill.yaml
================================================
# Config for 1 node of 8 x H100s (80GB)
# Model arguments
model_name_or_path: open-r1/Qwen2.5-Math-7B-RoPE-300k
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
chat_template: "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Open-R1, a language model trained by Hugging Face to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Open-R1, a language model trained by Hugging Face to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n"
dataset_name: open-r1/Mixture-of-Thoughts
dataset_config: all
dataset_num_proc: 12
eos_token: <|im_end|>

# SFT trainer config
bf16: true
do_eval: false
eval_strategy: 'no'
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: OpenR1-Distill-7B
hub_strategy: every_save
learning_rate: 4.0e-05
log_level: info
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
packing: false
max_grad_norm: 0.2
max_length: 32768
max_steps: -1
num_train_epochs: 5
output_dir: data/OpenR1-Distill-7B
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 2
push_to_hub: true
report_to:
- wandb
save_strategy: epoch
save_total_limit: 1
seed: 42
use_liger_kernel: true
warmup_ratio: 0.03

================================================
FILE: recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml
================================================
# Model arguments
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: open-r1/OpenR1-Math-220k
dataset_prompt_column: problem
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"

# GRPO trainer config
bf16: true
use_vllm: true
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: Qwen2.5-1.5B-Open-R1-GRPO
hub_strategy: every_save
learning_rate: 2.0e-05
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 512
max_completion_length: 1024
max_steps: -1
num_generations: 16
num_train_epochs: 1
output_dir: data/Qwen2.5-1.5B-Open-R1-GRPO
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: true
report_to:
- wandb
reward_funcs:
- accuracy
- format
- tag_count
reward_weights:
- 1.0
- 1.0
- 1.0
save_strategy: "epoch"
save_total_limit: 1
seed: 42
warmup_ratio: 0.1


================================================
FILE: recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml
================================================
# Model arguments
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: open-r1/verifiable-coding-problems-python
dataset_prompt_column: problem_statement
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"

# GRPO trainer config
beta: 0.01
bf16: true
use_vllm: true
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: Qwen2.5-1.5B-Open-R1-Code-GRPO
hub_strategy: every_save
learning_rate: 5.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
max_prompt_length: 1024
max_completion_length: 2048
max_steps: 500
num_generations: 14
num_train_epochs: 1
output_dir: data/Qwen2.5-1.5B-Open-R1-Code-GRPO
overwrite_output_dir: true
per_device_train_batch_size: 16
push_to_hub: true
report_to:
- wandb
reward_funcs:
- code
- format
reward_weights:
- 1.0
- 0.1
save_strategy: "steps"
save_steps: 50
save_total_limit: 1
seed: 42
temperature: 1.0
warmup_ratio: 0.03

================================================
FILE: recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml
================================================
# Model arguments
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: open-r1/ioi
dataset_prompt_column: problem
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"

# GRPO trainer config
beta: 0.01
bf16: true
use_vllm: true
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: Qwen2.5-1.5B-Open-R1-Code-GRPO
hub_strategy: every_save
learning_rate: 5.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
max_prompt_length: 1024
max_completion_length: 2048
max_steps: 500
num_generations: 14
num_train_epochs: 1
output_dir: data/Qwen2.5-1.5B-Open-R1-Code-GRPO
overwrite_output_dir: true
per_device_train_batch_size: 16
push_to_hub: true
report_to:
- wandb
save_strategy: "steps"
save_steps: 50
save_total_limit: 1
seed: 42
temperature: 1.0
warmup_ratio: 0.03
# ioi specific config
code_language: cpp
reward_funcs:
- ioi_code
- code_format
- format
reward_weights:
- 1.0
- 0.1
- 0.1
# for each generation, evaluate these many test cases in parallel, then check if any of them failed (0 score): if so stop evaluating
# otherwise continue with the next batch of test cases. Useful to avoid overloading the eval server + save time on wrong solutions
code_eval_test_batch_size: 3

================================================
FILE: recipes/Qwen2.5-Coder-7B-Instruct/grpo/config_codeforces.yaml
================================================
# Model arguments
model_name_or_path: Qwen/Qwen2.5-Coder-7B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
# Data training arguments
dataset_name: open-r1/codeforces
dataset_prompt_column: prompt
dataset_config: verifiable-prompts
dataset_test_split: test
dataset_train_split: train

system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"

# GRPO trainer config
callbacks:
- push_to_hub_revision
benchmarks:
- lcb_v4
beta: 0.0
loss_type: dr_grpo
scale_rewards: false
bf16: true
do_eval: false
eval_strategy: "no"
use_vllm: true
vllm_device: auto
vllm_gpu_memory_utilization: 0.7
gradient_accumulation_steps: 32
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: open-r1/Qwen2.5-Coder-7B-Instruct-Codeforces-GRPO
hub_model_revision: v01.00
hub_strategy: every_save
learning_rate: 1.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: constant_with_warmup
max_grad_norm: 0.2
max_prompt_length: 2000
max_completion_length: 8192
max_steps: -1
num_generations: 16
# aiming for 1k optimization steps
# total_samples_per_batch = num_gpus * grad_accumulation_steps * per_device_batch_size = 8 * 32 * 4 = 1024
# unique_prompts_per_batch = total_samples_per_batch / num_generations = 1024 / 16 = 64
# #dataset ~= 16k (8k * 2, for python and cpp)
# global_steps_per_epoch = #dataset / unique_prompts_per_batch = 16k / 64 ~= 250
# epochs_for_1k_steps = 1000/250 = 4 epochs
num_train_epochs: 4
output_dir: data/Qwen2.5-Coder-7B-Instruct-Codeforces-GRPO_v01.00
overwrite_output_dir: true
per_device_train_batch_size: 4
push_to_hub: true
report_to:
- wandb
reward_funcs:
- cf_code
- code_format
reward_weights:
- 1.0
- 0.1
save_strategy: "steps"
save_steps: 0.05
save_total_limit: 1
seed: 42
temperature: 0.7
wandb_entity: huggingface
wandb_project: open-r1
warmup_ratio: 0.1

mask_truncated_completions: true
# for each generation, evaluate these many test cases in parallel, then check if any of them failed (0 score): if so stop evaluating
# otherwise continue with the next batch of test cases. Useful to avoid overloading the eval server + save time on wrong solutions
code_eval_test_batch_size: -1
code_eval_scoring_mode: weighted_sum

================================================
FILE: recipes/README.md
================================================
# Post-training recipes

## OpenR1 Distill 7B

To train the OpenR1 Distill 7B model, run:

```
sbatch --nodes=1 slurm/train.slurm --model OpenR1-Distill-7B --task sft --config distill --accelerator zero3
```

## OlympicCoder

To train the OlympicCoder models, run:

```
# 7B
sbatch --nodes=1 slurm/train.slurm --model OlympicCoder-7B --task sft --config v00.00 --accelerator zero3

# 32B
sbatch --nodes=16 slurm/train.slurm --model OlympicCoder-32B --task sft --config v00.00 --accelerator fsdp
```

Note that we found it necessary to switch to FSDP1 and paged AdamW 8-bit for the 32B model in order to fit the largest possible context size.

================================================
FILE: recipes/accelerate_configs/ddp.yaml
================================================
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false


================================================
FILE: recipes/accelerate_configs/fsdp.yaml
================================================
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false # Need fix from: https://github.com/huggingface/transformers/pull/36610
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

================================================
FILE: recipes/accelerate_configs/zero2.yaml
================================================
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

================================================
FILE: recipes/accelerate_configs/zero3.yaml
================================================
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false


================================================
FILE: recipes/dataset_filtering/config_demo.yaml
================================================
# Model arguments
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
# We edit the DeepSeek chat template to ensure (a) the reasoning block within <think> and </think> is included in the completion and (b) the <think> tag is not part of the prefill so that the format reward works
chat_template: "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}"
dataset_name: open-r1/OpenR1-Math-220k
dataset_prompt_column: problem
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"

# Generation arguments
max_completion_length: 2048
num_generations: 8
temperature: 0.7
top_p: 0.95

# Reward func arguments
reward_funcs:
- accuracy
reward_weights:
- 1.0

# Filtering arguments. Samples with a pass rate outside the interval `pass_rate_min < x < pass_rate_max` will be filtered.  
pass_rate_min: 0.2
pass_rate_max: 0.8


================================================
FILE: recipes/dataset_filtering/filter_dapo.yaml
================================================
# Model arguments
model_name_or_path: open-r1/R1-Distill-Qwen-Math-7B
model_revision: v03.00-step-000008190
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
# We edit the DeepSeek chat template to ensure (a) the reasoning block within <think> and </think> is included in the completion and (b) the <think> tag is not part of the prefill so that the format reward works
dataset_name: open-r1/DAPO-Math-17k-Processed
dataset_config: all
dataset_split: train

# Generation arguments
max_completion_length: 32000
num_generations: 8
temperature: 1.0

# Reward func arguments
reward_funcs:
- accuracy
reward_weights:
- 1.0

# Filtering arguments. Samples with mean reward outside of low / high will be filtered
pass_rate_min: 0.1
pass_rate_max: 0.6

output_dataset_name: open-r1/DAPO-Math-17k-Processed-R1-Distill-Qwen-Math-7B-v03.00-step-000008190-filter


================================================
FILE: recipes/dataset_filtering/filter_python.yaml
================================================
# Model arguments
model_name_or_path: open-r1/R1-Distill-Qwen-Math-7B-Merges
model_revision: v00.00-step-000003660_v01.00-step-000002600_weights-0.50-0.50
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
# We edit the DeepSeek chat template to ensure (a) the reasoning block within <think> and </think> is included in the completion and (b) the <think> tag is not part of the prefill so that the format reward works
dataset_name: open-r1/verifiable-coding-problems-python_decontaminated-tested-shuffled
dataset_prompt_column: problem

# Generation arguments
max_completion_length: 16000
num_generations: 8
temperature: 0.7

# Reward func arguments
reward_funcs:
- binary_code
reward_weights:
- 1.0
e2b_router_url: ip-10-53-85-92:8000

# Filtering arguments. Samples with mean reward outside of low / high will be filtered
pass_rate_min: 0.1
pass_rate_max: 0.6


================================================
FILE: scripts/benchmark_e2b.py
================================================
# coding=utf-8
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Benchmark script for the code_reward function with E2B.

This script measures the performance of the code_reward function with varying numbers
of samples and parallelization levels.

Each sample is a CodeForces problem with a gold standard solution that is executed against a set of public test cases.
"""

from datasets import load_dataset
import time
from tqdm.auto import tqdm

from dotenv import load_dotenv
load_dotenv()

from open_r1.rewards import code_reward

def benchmark_code_reward(example):
    start_time = time.time()
    test_completions = [[{"content": example["gold_standard_solution"]}]]
    reward_kwargs = {"verification_info": [example["verification_info"]]}
    rewards = code_reward(test_completions, **reward_kwargs)
    end_time = time.time()
    example["test_reward"] = rewards[0]
    example["reward_time"] = end_time - start_time
    return example

if __name__ == "__main__":
    parallel_dict = {
        16:[1,4,16],
        64:[4,16, 64],
        256:[16, 64, 96], # cap at 96 as PRO account is limited to 100
    }
    # Store results for table formatting
    results = []
    
    for num_samples in tqdm([16, 64,256], desc="Benchmarking samples"):
        for num_parallel in parallel_dict[num_samples]:
            code_dataset = load_dataset("open-r1/verifiable-coding-problems-python_decontaminated")
            code_dataset = code_dataset["train"].shuffle(seed=42).select(range(num_samples))

            test_completions = [[{"content": example["gold_standard_solution"]}] for example in code_dataset]
            reward_kwargs = {"verification_info": [example["verification_info"] for example in code_dataset]}

            start_time = time.time()
            rewards = code_reward(test_completions, num_parallel=num_parallel, **reward_kwargs)
            execution_time = time.time() - start_time
            
            # Calculate some statistics about rewards
            mean_reward = sum(rewards) / len(rewards)
            min_reward = min(rewards)
            max_reward = max(rewards)
            
            # Store results
            results.append({
                "num_samples": num_samples,
                "num_parallel": num_parallel,
                "execution_time": execution_time,
                "mean_reward": mean_reward,
                "min_reward": min_reward,
                "max_reward": max_reward
            })
    
    print("\n## Benchmark Results\n")
    print("| Sample Size | Parallelization | Execution Time (s) | Mean Reward | Min Reward | Max Reward |")
    print("|:-----------:|:---------------:|------------------:|:-----------:|:-----------:|:-----------:|")
    
    for result in results:
        print(f"| {result['num_samples']:^11} | {result['num_parallel']:^15} | {result['execution_time']:17.2f} | {result['mean_reward']:^11.4f} | {result['min_reward']:^11.4f} | {result['max_reward']:^11.4f} |")
    


================================================
FILE: scripts/decontaminate.py
================================================
#!/usr/bin/env python
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This script is used to decontaminate a dataset by checking for n-gram overlap with other datasets.
It uses the same approach presented in https://huggingface.co/papers/2501.19393,
as found in: https://github.com/simplescaling/s1/blob/main/data/decontaminate_util.py

Usage:

python scripts/decontaminate.py \
    --dataset open-r1/verifiable-coding-problems-python \
    --split train \
    --ngram_size 8 \
    --problem_column problem \
    --cleanup
"""

import collections

from tqdm import tqdm


def normalize_string(text: str) -> str:
    """Basic string normalization."""
    # Convert to lowercase and normalize whitespace
    text = text.lower().strip()
    # Replace multiple spaces with single space
    text = " ".join(text.split())
    return text


def word_ngrams(text: str, n: int) -> list:
    """Generate word-level n-grams from text."""
    words = text.split()
    return [" ".join(words[i : i + n]) for i in range(len(words) - n + 1)]


def build_ngram_lookup(documents: list[str], ngram_size: int = 8) -> dict[str, set[int]]:
    """Build ngram lookup for documents."""
    lookup = collections.defaultdict(set)

    for doc_id, document in enumerate(tqdm(documents)):
        normalized_text = normalize_string(document)
        ngrams = word_ngrams(normalized_text, ngram_size)
        for ngram in ngrams:
            lookup[ngram].add(doc_id)

    return lookup


def build_ngram_single(document: str, ngram_size: int = 8) -> set[str]:
    normalized_text = normalize_string(document)
    ngrams = word_ngrams(normalized_text, ngram_size)

    return set(ngrams)


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", type=str, required=True, help="Name of the dataset to check for contamination.")
    parser.add_argument("--config", type=str, default=None, help="Name of the dataset config to load.")
    parser.add_argument("--split", type=str, default="train", help="Split to check for contamination, defaults to `train`.")
    parser.add_argument("--ngram_size", type=int, default=8, help="Size of n-grams to build, defaults to 8.")
    parser.add_argument(
        "--problem_column", type=str, default="problem", help="Name of the column containing the problem (prompt)."
    )
    parser.add_argument(
        "--cleanup",
        action="store_true",
        help="Whether to remove the contaminated rows before pushing the dataset.",
    )
    parser.add_argument(
        "--new_dataset_name",
        type=str,
        default=None,
        help="New name for the dataset. If not provided, will reuse the name and add a `_decontaminated` to the name."
    )
    args = parser.parse_args()

    from datasets import load_dataset, Dataset

    # Load the dataset to check for contamination
    ds = load_dataset(args.dataset, name=args.config, split=args.split)

    eval_datasets = {
        "aime_2024": (load_dataset("HuggingFaceH4/aime_2024", split="train"), "problem"),
        "aime_2025": (load_dataset("yentinglin/aime_2025", split="train"), "problem"),
        "math_500": (load_dataset("HuggingFaceH4/MATH-500", split="test"), "problem"),
        "gpqa": (load_dataset("Idavidrein/gpqa", "gpqa_diamond", split="train", trust_remote_code=True), "Question"),
        "lcb": (
            load_dataset(
                "livecodebench/code_generation_lite", split="test", version_tag="v4_v5", trust_remote_code=True
            ),
            "question_content",
        ),
    }
    ngram_lookups = {}
    for ds_name, (eval_dataset, problem_col) in eval_datasets.items():
        ngram_lookups[ds_name] = build_ngram_lookup(eval_dataset[problem_col], ngram_size=args.ngram_size)

    for eval_name, ngram_lookup in ngram_lookups.items():
        # Update the ngram_lookup variable for each dataset
        def find_contaminated(row):
            # For each example we have to build the ngrams and check for all of them on each row
            ngrams = build_ngram_single(row[args.problem_column], ngram_size=args.ngram_size)
            row[f"contaminated_{eval_name}"] = any(set(ngram in ngram_lookup for ngram in ngrams))
            return row

        ds = ds.map(find_contaminated, num_proc=8)

    # Allow cleaning up via CLI args (removing the contaminated examples and dropping the columns)
    def cleanup(dataset: Dataset) -> Dataset:
        initial_size = len(dataset)
        contamination_cols = [col for col in dataset.column_names if col.startswith("contaminated_")]
        for col in contamination_cols:
            if col.startswith("contaminated_"):
                size_prior = len(dataset)
                dataset = dataset.filter(lambda x: not x[col], num_proc=8)
                if len(dataset) < size_prior:
                    print(f"Removed {size_prior - len(dataset)} samples from '{col.replace('contaminated_', '')}'")
        dataset = dataset.remove_columns(contamination_cols)
        print(f"Initial size: {initial_size}, Final size: {len(dataset)}")
        return dataset

    if args.cleanup:
        ds = cleanup(ds)

    new_ds_name = args.new_dataset_name or f"{args.dataset}_decontaminated"
    config_name = args.config if args.config is not None else "default"
    url = ds.push_to_hub(new_ds_name, config_name=config_name, split="train")
    print(f"Decontaminated dataset: {url}")


================================================
FILE: scripts/e2b_router.py
================================================
# coding=utf-8
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel, ConfigDict
from typing import  Optional
from fastapi import FastAPI, Request
import argparse
import asyncio
from fastapi import FastAPI
import uvicorn
from e2b_code_interpreter.models import Execution
from dotenv import load_dotenv
from e2b_code_interpreter import AsyncSandbox

load_dotenv()

class BatchRequest(BaseModel):
    """
    BatchRequest is a data model representing a batch processing request.

    Attributes:
        scripts (list[str]): A list of script names or paths to be executed.
        languages (list[str]): The programming languages for each script in the list.
        timeout (int): The maximum allowed execution time for each script in seconds.
        request_timeout (int): The maximum allowed time for the entire batch request in seconds.
    """
    scripts: list[str]
    languages: list[str]
    timeout: int
    request_timeout: int

class ScriptResult(BaseModel):
    """
    ScriptResult is a Pydantic model that represents the result of a script execution.
    Attributes:
        execution (Optional[Execution]): An optional instance of the `Execution` class 
            that contains details about the script's execution, such as status, output, 
            or any other relevant metadata.
        exception_str (Optional[str]): An optional string that captures the exception 
            message or details if an error occurred during the script's execution.
        model_config (ConfigDict): A configuration dictionary that allows arbitrary 
            types to be used within the Pydantic model. This is necessary to support 
            custom types like `Execution` within the model.
    """
    execution: Optional[Execution]
    exception_str: Optional[str]
    
    # required to allow arbitrary types in pydantic models such as Execution
    model_config = ConfigDict(arbitrary_types_allowed=True)
    
def create_app(args):
    """
    Creates and configures a FastAPI application instance.
    Args:
        args: An object containing configuration parameters for the application.
              - num_sandboxes (int): The maximum number of concurrent sandboxes allowed.
    Returns:
        FastAPI: A configured FastAPI application instance.
    The application includes the following endpoints:
        1. GET /health:
            - Returns the health status of the application.
            - Response: {"status": "ok"}
        2. POST /execute_batch:
            - Executes a batch of scripts in an isolated sandbox environment.
            - Request Body: BatchRequest object containing:
                - languages (list[str]): The programming languages of the scripts (python or javascript).
                - timeout (int): The maximum execution time for each script.
                - request_timeout (int): The timeout for the request itself.
                - scripts (List[str]): A list of scripts to execute.
            - Response: A list of ScriptResult objects for each script, containing:
                - execution: The result of the script execution.
                - exception_str: Any exception encountered during execution.
    Notes:
        - A semaphore is used to limit the number of concurrent sandboxes.
        - Each script execution is wrapped in a timeout to prevent hanging.
        - Sandboxes are cleaned up after execution, even in case of errors.
    """
    app = FastAPI()

    # Instantiate semaphore and attach it to app state
    app.state.sandbox_semaphore = asyncio.Semaphore(args.max_num_sandboxes)

    @app.get("/health")
    async def health():
        return {"status": "ok"}

    @app.post("/execute_batch")
    async def execute_batch(batch: BatchRequest, request: Request):
        semaphore = request.app.state.sandbox_semaphore
        languages = batch.languages
        timeout = batch.timeout
        request_timeout = batch.request_timeout
        asyncio_timeout = batch.timeout + 1
        
        async def run_script(script: str, language: str) -> ScriptResult:

            async with semaphore:
                try:
                    sandbox = await AsyncSandbox.create(
                        timeout=timeout,
                        request_timeout=request_timeout,
                    )
                    execution = await asyncio.wait_for(
                        sandbox.run_code(script, language=language),
                        timeout=asyncio_timeout,
                    )
                    return ScriptResult(execution=execution, exception_str=None)

                except Exception as e:
                    return ScriptResult(execution=None, exception_str=str(e))
                
                finally:
                    try:
                        await sandbox.kill()
                    except Exception:
                        pass

        tasks = [run_script(script, lang) for script, lang in zip(batch.scripts, batch.languages)]
        return await asyncio.gather(*tasks)

    return app


def parse_args():
    """
    Parse command-line arguments for the e2b_router script.

    Arguments:
        --host (str): The hostname or IP address to bind the server to. Defaults to "0.0.0.0" (binds to all interfaces).
        --port (int): The port number on which the server will listen. Defaults to 8000.
        --max_num_sandboxes (int): The maximum number of sandboxes that can be created or managed simultaneously. Defaults to 20.

    Returns:
        argparse.Namespace: Parsed command-line arguments as an object.
    """
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", default="0.0.0.0")
    parser.add_argument("--port", type=int, default=8000)
    parser.add_argument("--max_num_sandboxes", type=int, default=20)
    return parser.parse_args()

if __name__ == "__main__":
    args = parse_args()
    app = create_app(args)

    uvicorn.run(app, host=args.host, port=args.port)

================================================
FILE: scripts/generate_reasoning.py
================================================
import argparse
import asyncio
import hashlib
import json
import os
import random
from asyncio import Lock
from typing import Set

from datasets import load_dataset
from tqdm.asyncio import tqdm

import aiofiles
import aiohttp
import uvloop


file_lock = Lock()


async def generate_completion(session, prompt, args):
    retry_budget = 10
    while retry_budget > 0:
        try:
            await asyncio.sleep(random.uniform(0.0, 0.1))
            async with session.post(
                f"http://{args.api_addr}/v1/chat/completions",
                json={
                    "model": "default",
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": args.max_tokens,
                    "temperature": args.temperature,
                    "top_p": args.top_p,
                },
                headers={"Authorization": "Bearer EMPTY"},
            ) as response:
                return await response.json(content_type=None)
        except Exception as e:
            print(f"API error (will retry): {e}")
            retry_budget -= 1
            await asyncio.sleep(10)
    return None


async def process_example(example, session, args, output_file, pbar):
    prompt = args.prompt_template.format(prompt=example[args.prompt_column])

    try:
        tasks = [generate_completion(session, prompt, args) for _ in range(args.num_generations)]

        completions = await asyncio.gather(*tasks)

        if any(completion is None for completion in completions):
            print(f"Error processing example")
            pbar.update(1)
            return None

        generations = []
        finish_reasons = []
        api_metadata = []

        for completion in completions:
            generations.append(completion["choices"][0]["message"]["content"])
            finish_reasons.append(completion["choices"][0]["finish_reason"])
            api_metadata.append(completion["usage"])

        # Combine original dataset fields with generations
        result = {
            **example,  # Preserve all original dataset fields
            "generations": generations,
            "finish_reasons": finish_reasons,
            "api_metadata": api_metadata,
        }

        # Write to file with lock
        async with file_lock:
            async with aiofiles.open(output_file, mode="a") as f:
                await f.write(json.dumps(result) + "\n")
                await f.flush()

        pbar.set_postfix(active=len(pbar.active_tasks), refresh=False)
        pbar.update(1)

        return result
    except Exception as e:
        print(f"Error processing example: {e}")
        pbar.update(1)
        return None


async def load_processed_uuids(output_file, uuid_column):
    processed_uuids = set()
    if os.path.exists(output_file):
        async with aiofiles.open(output_file, mode="r") as f:
            async for line in f:
                try:
                    data = json.loads(line)
                    processed_uuids.add(hashlib.md5(str(data[uuid_column]).encode()).hexdigest())
                except json.JSONDecodeError:
                    continue
    return processed_uuids


async def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset-name", type=str, required=True)
    parser.add_argument("--output-file", type=str, required=True)
    parser.add_argument("--prompt-column", type=str, required=True)
    parser.add_argument("--uuid-column", type=str, required=True)
    parser.add_argument("--api-addr", type=str, default="localhost:39876")
    parser.add_argument("--num-generations", type=int, default=4)
    parser.add_argument(
        "--prompt-template",
        type=str,
        default="You will be given a problem. Please reason step by step, and put your final answer within \\boxed{{}}:\n{prompt}",
    )
    parser.add_argument("--temperature", type=float, default=0.6)
    parser.add_argument("--top-p", type=float, default=0.95)
    parser.add_argument("--max-tokens", type=int, default=16384)
    parser.add_argument("--max-concurrent", type=int, default=1000)
    args = parser.parse_args()

    dataset = load_dataset(args.dataset_name, split="train").shuffle()
    processed_uuids = await load_processed_uuids(args.output_file, args.uuid_column)
    if processed_uuids:
        print(f"Found {len(processed_uuids)} already processed examples, resuming from there...")

    if not os.path.exists(args.output_file):
        async with aiofiles.open(args.output_file, mode="w") as f:
            await f.write("")

    active_tasks: Set[asyncio.Task] = set()

    pbar = tqdm(
        total=len(dataset) - len(processed_uuids),
        desc="Generating responses",
        unit="row",
        mininterval=2,
        smoothing=0.0001,
    )
    pbar.active_tasks = active_tasks

    async with aiohttp.ClientSession(
        timeout=aiohttp.ClientTimeout(total=60 * 60),
        connector=aiohttp.TCPConnector(limit=args.max_concurrent, ttl_dns_cache=300, keepalive_timeout=60 * 60),
    ) as session:
        for example in dataset:
            uuid = hashlib.md5(str(example[args.uuid_column]).encode()).hexdigest()
            if uuid not in processed_uuids:
                # Wait if we've hit the concurrency limit
                while len(active_tasks) >= args.max_concurrent:
                    done, active_tasks = await asyncio.wait(active_tasks, return_when=asyncio.FIRST_COMPLETED)
                    for task in done:
                        try:
                            await task
                        except Exception as e:
                            print(f"Task failed: {e}")

                task = asyncio.create_task(process_example(example, session, args, args.output_file, pbar))
                active_tasks.add(task)
                task.add_done_callback(active_tasks.discard)

                pbar.set_postfix(active=len(active_tasks), refresh=True)

        # Wait for remaining tasks
        if active_tasks:
            await asyncio.gather(*active_tasks, return_exceptions=True)

    pbar.close()


if __name__ == "__main__":
    uvloop.install()
    asyncio.run(main())


================================================
FILE: scripts/get_tensor_parallel_size.py
================================================
import argparse
from transformers import AutoConfig
from math import gcd

def get_tensor_parallel_size(model_name: str, revision: str = None, default_tp: int = 8) -> int:
    try:
        config = AutoConfig.from_pretrained(model_name, revision=revision, trust_remote_code=True)
        num_heads = getattr(config, 'num_attention_heads', None)

        if num_heads is not None and num_heads % default_tp != 0:
            tp = gcd(num_heads, default_tp)
            return max(tp, 1)
        else:
            return default_tp
    except Exception as e:
        print(f"Warning: Failed to fetch config for {model_name}@{revision}: {e}")
        return default_tp

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name", type=str, required=True, help="Hugging Face model name or path")
    parser.add_argument("--revision", type=str, default=None, help="Model revision if applicable")
    parser.add_argument("--default_tp", type=int, default=8, help="Default TP size (usually GPUs per node)")

    args = parser.parse_args()

    tp = get_tensor_parallel_size(args.model_name, args.revision, args.default_tp)
    print(tp)


================================================
FILE: scripts/morph_router.py
================================================
# coding=utf-8
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel, ConfigDict
from typing import Optional, List
from fastapi import FastAPI, Request
import uvicorn
from dotenv import load_dotenv
import os

load_dotenv()

class BatchRequest(BaseModel):
    """
    BatchRequest is a data model representing a batch processing request.

    Attributes:
        scripts (list[str]): A list of script names or paths to be executed.
        languages (List[str]): The programming languages for each script in the list.
        timeout (int): The maximum allowed execution time for each script in seconds.
        request_timeout (int): The maximum allowed time for the entire batch request in seconds.
    """
    scripts: List[str]
    languages: List[str]
    timeout: int
    request_timeout: int

class ScriptResult(BaseModel):
    """
    ScriptResult is a Pydantic model that represents the result of a script execution.
    Attributes:
        text (Optional[str]): The output text from the script execution.
        exception_str (Optional[str]): An optional string that captures the exception 
            message or details if an error occurred during the script's execution.
        model_config (ConfigDict): A configuration dictionary that allows arbitrary 
            types to be used within the Pydantic model.
    """
    text: Optional[str]
    exception_str: Optional[str]
    
    
    model_config = ConfigDict(arbitrary_types_allowed=True)
    
def create_app(args):
    """
    Creates and configures a FastAPI application instance for the MorphCloud router.
    
    Args:
        args: An object containing configuration parameters for the application.
              - max_num_sandboxes (int): The maximum number of concurrent sandboxes allowed.
              - api_key (str): The MorphCloud API key to use.
    
    Returns:
        FastAPI: A configured FastAPI application instance.
    """
    app = FastAPI()
    
    from morphcloud.api import MorphCloudClient
    from morphcloud.sandbox import Sandbox
    
    app.state.client = MorphCloudClient(api_key=args.api_key)
    app.state.Sandbox = Sandbox

    app.state.sandbox_semaphore = asyncio.Semaphore(args.max_num_sandboxes)

    @app.get("/health")
    async def health():
        return {"status": "ok"}

    @app.post("/execute_batch")
    async def execute_batch(batch: BatchRequest, request: Request):
        semaphore = request.app.state.sandbox_semaphore
        client = request.app.state.client
        Sandbox = request.app.state.Sandbox
        
        languages = batch.languages
        timeout = batch.timeout
        request_timeout = batch.request_timeout
        asyncio_timeout = batch.timeout + 1
        
        async def run_script(script: str, language: str) -> ScriptResult:
            sandbox = None
            sandbox_id = "unknown"

            async with semaphore:
                try:
                    sandbox = await asyncio.to_thread(
                        Sandbox.new,
                        client=client,
                        ttl_seconds=timeout
                    )
                    
                    sandbox_id = getattr(sandbox, 'id', None) or getattr(sandbox._instance, 'id', 'unknown')
                    
                    execution = await asyncio.wait_for(
                        asyncio.to_thread(
                            sandbox.run_code,
                            script,
                            language=language,
                            timeout=timeout * 1000  
                        ),
                        timeout=asyncio_timeout,
                    )
                    
                    if hasattr(execution, 'text') and execution.text:
                        return ScriptResult(text=execution.text, exception_str=None)
                    elif hasattr(execution, 'stdout') and execution.stdout:
                        return ScriptResult(text=execution.stdout, exception_str=None)
                    else:
                        return ScriptResult(text="", exception_str="No output from execution")

                except Exception as e:
                    return ScriptResult(text=None, exception_str=str(e))
                
                finally:
                    if sandbox:
                        try:
                            await asyncio.to_thread(sandbox.close)
                            await asyncio.to_thread(sandbox.shutdown)
                        except Exception:
                            pass

        tasks = [run_script(script, lang) for script, lang in zip(batch.scripts, batch.languages)]
        return await asyncio.gather(*tasks)

    return app

def parse_args():
    """
    Parse command-line arguments for the morph_router script.

    Arguments:
        --host (str): The hostname or IP address to bind the server to. Defaults to "0.0.0.0".
        --port (int): The port number on which the server will listen. Defaults to 8001.
        --max_num_sandboxes (int): The maximum number of sandboxes that can be created simultaneously. Defaults to 20.
        --api_key (str): The MorphCloud API key. If not provided, it will be read from the MORPH_API_KEY environment variable.

    Returns:
        argparse.Namespace: Parsed command-line arguments as an object.
    """
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", default="0.0.0.0")
    parser.add_argument("--port", type=int, default=8001)
    parser.add_argument("--max_num_sandboxes", type=int, default=20)
    parser.add_argument("--api_key", default=os.getenv("MORPH_API_KEY"))
    args = parser.parse_args()
    
    if not args.api_key:
        raise ValueError("MorphCloud API key not provided. Please set MORPH_API_KEY environment variable or use --api_key.")
    
    return args

if __name__ == "__main__":
    args = parse_args()
    app = create_app(args)
    
    print(f"Starting MorphCloud Router on {args.host}:{args.port}")
    uvicorn.run(app, host=args.host, port=args.port)

================================================
FILE: scripts/pass_rate_filtering/README.md
================================================
# Pass rate filtering

We provide support to filter datasets by generating and computing pass rate on veriable tasks

See `scripts/pass_rate_filtering/compute_pass_rate.py` and `scripts/pass_rate_filtering/launch_filtering.sh` (hardcoded for DAPO at the moment)

By default the script chunks the dataset, merge can be run using the following snippet (example for DAPO) :

from datasets import load_dataset, concatenate_datasets

name = "open-r1/DAPO-Math-17k-Processed-R1-Distill-Qwen-Math-7B-Merges-v00.02-v01.02-0.3-0.7-filter"

```python
gen_datasets = []
filt_datasets = []
for start in range(0,17400,200):
    end = start + 200
    if start == 17200:
        end = 17398
    gen_config_name = f"gen-{start}-{end}"
    gen_dataset = load_dataset(name, gen_config_name, revision="gen",  split="train")
    gen_datasets.append(gen_dataset)
    
    filt_config_name = f"filt-0.1-0.6-{start}-{end}"
    filt_dataset = load_dataset(name, filt_config_name, revision="pass_rate",  split="train")
    filt_datasets.append(filt_dataset)
    
gen_dataset = concatenate_datasets(gen_datasets)
gen_dataset.push_to_hub(name, config_name="gen", split="train")
print(gen_dataset)

filt_dataset = concatenate_datasets(filt_datasets)
filt_dataset.push_to_hub(name, config_name="default", split="train")

print(filt_dataset)
```

================================================
FILE: scripts/pass_rate_filtering/compute_pass_rate.py
================================================
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# example usage python scripts/filter_dataset.py --config recipes/dataset_filtering/config_demo.yaml

import logging
from dataclasses import dataclass
from git import Optional
import torch
import sys

import datasets
import transformers
from datasets import load_dataset
from transformers import set_seed

from open_r1.configs import GRPOConfig, GRPOScriptArguments
from open_r1.rewards import get_reward_funcs
from open_r1.utils import get_tokenizer
from trl import ModelConfig, TrlParser
from trl.data_utils import apply_chat_template
from vllm import LLM, SamplingParams

logger = logging.getLogger(__name__)

@dataclass
class PassRateScriptArguments(GRPOScriptArguments):
    # we can be lazy and just use the same script args as GRPO
    output_dataset_name: Optional[str] = None
    pass_rate_min: float = 0.1
    pass_rate_max: float = 0.9
    dataset_start_index: Optional[int] = None
    dataset_end_index: Optional[int] = None
    dataset_split: str = "train"


def main(script_args, training_args, model_args):
    # Set seed for reproducibility
    set_seed(training_args.seed)

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    logger.info(f"Model parameters {model_args}")
    logger.info(f"Script parameters {script_args}")
    logger.info(f"Training parameters {training_args}")

    # Load the dataset
    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config, split=script_args.dataset_split)
    if script_args.dataset_start_index is not None and script_args.dataset_end_index is not None:
        dataset = dataset.select(range(script_args.dataset_start_index, script_args.dataset_end_index))

    # Get reward functions from the registry
    reward_funcs = get_reward_funcs(script_args)

    # Format into conversation
    def make_conversation(example, prompt_column: str = script_args.dataset_prompt_column):
        example["prompt_backup"] = example[prompt_column]
        
        prompt = []

        if training_args.system_prompt is not None:
            prompt.append({"role": "system", "content": training_args.system_prompt})

        if prompt_column not in example:
            raise ValueError(f"Dataset Question Field Error: {prompt_column} is not supported.")

        prompt.append({"role": "user", "content": example[prompt_column]})
        return {"prompt": prompt}

    dataset = dataset.map(make_conversation)
    tokenizer = get_tokenizer(model_args, training_args)
    
    if "messages" in dataset.column_names:
        dataset = dataset.remove_columns("messages")
    
    dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
    llm = LLM(
        model=model_args.model_name_or_path,
        revision=model_args.model_revision,
        trust_remote_code=model_args.trust_remote_code,
    )

    sampling_params=SamplingParams(
        temperature=training_args.temperature,
        top_p=training_args.top_p,
        top_k=training_args.top_k,
        n=training_args.num_generations,
        max_tokens=training_args.max_completion_length,
    )
    
    def batch_score(examples):
        prompts = examples["prompt"]
        
        outputs = llm.generate(
            prompts,
            sampling_params=sampling_params,
            use_tqdm=False,
        )
        repeated_prompts = []
        reward_completions = []
        grouped_completions = []
        for output in outputs:
            prompt = output.prompt
            group = []
            for completion in output.outputs:
                text = completion.text
                group.append(text)
                message = [{"role": "assistant", "content": text}]
                repeated_prompts.append(prompt)
                reward_completions.append(message)
            grouped_completions.append(group)
        
        def repeat_each_element_k_times(list_to_repeat: list, k: int) -> list:
            return [element for item in list_to_repeat for element in [item] * k]
        
        rewards_per_func = torch.zeros(len(repeated_prompts), len(reward_funcs))
        for i, reward_func in enumerate(reward_funcs):
            keys = [key for key in examples.data.keys() if key not in ["prompt", "completion"]]
            reward_kwargs = {key: repeat_each_element_k_times(examples[key], training_args.num_generations) for key in keys}
            output_reward_func = reward_func(prompts=repeated_prompts, completions=reward_completions, **reward_kwargs)
            # Convert None values to NaN
            output_reward_func = [reward if reward is not None else torch.nan for reward in output_reward_func]

            rewards_per_func[:, i] = torch.tensor(output_reward_func, dtype=torch.float32)
            
        reshaped_rewards = rewards_per_func.view(-1, training_args.num_generations)
        
        examples["pass_rate_generations"] = grouped_completions
        examples["pass_rate_rewards"] = reshaped_rewards.tolist()

            
        return examples
    
    dataset = dataset.map(batch_score, batched=True, batch_size=64)
    
    # we need to restore the prompt for the final dataset
    def restore_prompt(example):
        example["prompt"] = example["prompt_backup"]
        return example
    
    dataset = dataset.map(restore_prompt)
    dataset = dataset.remove_columns("prompt_backup")
    
    if script_args.output_dataset_name is not None:
        output_dataset_name = script_args.output_dataset_name
    else:
        model_name = model_args.model_name_or_path
        if "/" in model_name:
            model_name = model_name.split("/")[-1]
        model_revision = model_args.model_revision
    
        output_dataset_name = f"{script_args.dataset_name}-{model_name}-{model_revision}-gen"
    
    config_name="default"
    filtered_config_name = f"filt-{script_args.pass_rate_min}-{script_args.pass_rate_max}"
    
    if script_args.dataset_start_index is not None and script_args.dataset_end_index is not None:
        config_name = f"gen-{script_args.dataset_start_index}-{script_args.dataset_end_index}"
        filtered_config_name = f"{filtered_config_name}-{script_args.dataset_start_index}-{script_args.dataset_end_index}"
        
    dataset.push_to_hub(output_dataset_name, config_name=config_name, revision="gen")
    
    def filter_func(example):
        rewards = example["pass_rate_rewards"]
        # get the mean of the rewards that are not None
        mean_reward = torch.nanmean(torch.tensor(rewards, dtype=torch.float32))
        
        return script_args.pass_rate_min < mean_reward < script_args.pass_rate_max
    
    logger.info(f"Filtering dataset with low reward threshold {script_args.pass_rate_min} and high reward threshold {script_args.pass_rate_max}")
    logger.info(f"Dataset size before filtering: {dataset}")
    dataset = dataset.filter(filter_func)
    logger.info(f"Dataset size after filtering: {dataset}")
    dataset.push_to_hub(output_dataset_name, config_name=filtered_config_name, revision="pass_rate")
    
    

if __name__ == "__main__":
    parser = TrlParser((PassRateScriptArguments, GRPOConfig, ModelConfig))
    script_args, training_args, model_args = parser.parse_args_and_config()
    main(script_args, training_args, model_args)


================================================
FILE: scripts/pass_rate_filtering/launch_filtering.sh
================================================


# a bash foor loop from 0 to 17,400 in chunks of 200

for i in {0..17000..200}
do
  START=$i
  END=$((i + 200))
  echo "Processing chunk from $START to $END"
  
  # Submit the job to SLURM
  sbatch slurm/compute_pass_rate.slurm recipes/dataset_filtering/filter_dapo.yaml $START $END
done

sbatch slurm/compute_pass_rate.slurm recipes/dataset_filtering/filter_dapo.yaml 17200 17398


================================================
FILE: scripts/run_benchmarks.py
================================================
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import List, Optional

from open_r1.utils.evaluation import SUPPORTED_BENCHMARKS, run_benchmark_jobs
from open_r1.configs import SFTConfig
from trl import ModelConfig, TrlParser


@dataclass
class ScriptArguments:
    model_id: str = field(
        default="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        metadata={"help": "The Hub model id to push the model to."},
    )
    model_revision: str = field(default="main", metadata={"help": "The Hub model branch to push the model to."})
    trust_remote_code: bool = field(default=False, metadata={"help": "Trust the remote code."})
    benchmarks: List[str] = field(
        default_factory=lambda: [], metadata={"help": "The benchmarks to run after training."}
    )
    list_benchmarks: bool = field(default=False, metadata={"help": "List all supported benchmarks."})
    system_prompt: Optional[str] = field(
        default=None, metadata={"help": "The system prompt to use for the benchmark."}
    )


def main():
    parser = TrlParser(ScriptArguments)
    args = parser.parse_args_and_config()[0]
    if args.list_benchmarks:
        print("Supported benchmarks:")
        for benchmark in SUPPORTED_BENCHMARKS:
            print(f"  - {benchmark}")
        return
    benchmark_args = SFTConfig(
        output_dir="",
        hub_model_id=args.model_id,
        hub_model_revision=args.model_revision,
        benchmarks=args.benchmarks,
        system_prompt=args.system_prompt,
    )
    run_benchmark_jobs(
        benchmark_args,
        ModelConfig(model_name_or_path="", model_revision="", trust_remote_code=args.trust_remote_code),
    )


if __name__ == "__main__":
    main()


================================================
FILE: scripts/upload_details.py
================================================
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Push the details from a LightEval run to the Hub.

Usage:

python src/open_r1/utils/upload_details.py \
    --data_files {path_to_parquet_file} \
    --hub_repo_id {hub_repo_id} \
    --config_name {config_name}
"""

from dataclasses import dataclass, field
from typing import List

from datasets import load_dataset
from transformers import HfArgumentParser


@dataclass
class ScriptArguments:
    data_files: List[str] = field(default_factory=list)
    hub_repo_id: str = None
    config_name: str = None


def main():
    parser = HfArgumentParser(ScriptArguments)
    args = parser.parse_args_into_dataclasses()[0]

    if all(file.endswith(".json") for file in args.data_files):
        ds = load_dataset("json", data_files=args.data_files)
    elif all(file.endswith(".jsonl") for file in args.data_files):
        ds = load_dataset("json", data_files=args.data_files)
    else:
        ds = load_dataset("parquet", data_files=args.data_files)
    url = ds.push_to_hub(args.hub_repo_id, config_name=args.config_name, private=True)
    print(f"Dataset available at: {url}")


if __name__ == "__main__":
    main()


================================================
FILE: setup.cfg
================================================
[isort]
default_section = FIRSTPARTY
ensure_newline_before_comments = True
force_grid_wrap = 0
include_trailing_comma = True
known_first_party = open_r1
known_third_party =
    transformers
    datasets
    fugashi
    git
    h5py
    matplotlib
    nltk
    numpy
    packaging
    pandas
    psutil
    pytest
    rouge_score
    sacrebleu
    seqeval
    sklearn
    streamlit
    torch
    tqdm

line_length = 119
lines_after_imports = 2
multi_line_output = 3
use_parentheses = True

[flake8]
ignore = E203, E501, E741, W503, W605
max-line-length = 119
per-file-ignores =
    # imported but unused
    __init__.py: F401

[tool:pytest]
doctest_optionflags=NUMBER NORMALIZE_WHITESPACE ELLIPSIS

================================================
FILE: setup.py
================================================
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Adapted from huggingface/transformers: https://github.com/huggingface/transformers/blob/21a2d900eceeded7be9edc445b56877b95eda4ca/setup.py


import re
import shutil
from pathlib import Path

from setuptools import find_packages, setup


# Remove stale open_r1.egg-info directory to avoid https://github.com/pypa/pip/issues/5466
stale_egg_info = Path(__file__).parent / "open_r1.egg-info"
if stale_egg_info.exists():
    print(
        (
            "Warning: {} exists.\n\n"
            "If you recently updated open_r1, this is expected,\n"
            "but it may prevent open_r1 from installing in editable mode.\n\n"
            "This directory is automatically generated by Python's packaging tools.\n"
            "I will remove it now.\n\n"
            "See https://github.com/pypa/pip/issues/5466 for details.\n"
        ).format(stale_egg_info)
    )
    shutil.rmtree(stale_egg_info)


# IMPORTANT: all dependencies should be listed here with their version requirements, if any.
#   * If a dependency is fast-moving (e.g. trl), pin to the exact version
_deps = [
    "accelerate==1.4.0",
    "bitsandbytes>=0.43.0",
    "datasets>=3.2.0",
    "deepspeed==0.16.8",
    "distilabel[vllm,ray,openai]>=1.5.2",
    "e2b-code-interpreter>=1.0.5",
    "einops>=0.8.0",
    "flake8>=6.0.0",
    "hf_transfer>=0.1.4",
    "huggingface-hub[cli,hf_xet]>=0.30.2,<1.0",
    "isort>=5.12.0",
    "jieba",  # Needed for Chinese language support
    "langdetect",  # Needed for LightEval's extended tasks
    "latex2sympy2_extended>=1.0.6",
    "liger-kernel>=0.5.10",
    "lighteval @ git+https://github.com/huggingface/lighteval.git@d3da6b9bbf38104c8b5e1acc86f83541f9a502d1",  # Critical bug fix for tokenizer revisions: https://github.com/huggingface/lighteval/pull/721
    "math-verify==0.5.2",  # Used for math verification in grpo
    "morphcloud==0.1.67",
    "packaging>=23.0",
    "parameterized>=0.9.0",
    "peft>=0.14.0",
    "pytest",
    "python-dotenv",
    "ruff>=0.9.0",
    "safetensors>=0.3.3",
    "sentencepiece>=0.1.99",
    "torch==2.6.0",
    "transformers==4.52.3",
    "trl[vllm]==0.18.0",
    "wandb>=0.19.1",
    "async-lru>=2.0.5",
    "aiofiles>=24.1.0",
    "pandas>=2.2.3",
]

# this is a lookup table with items like:
#
# tokenizers: "tokenizers==0.9.4"
# packaging: "packaging"
#
# some of the values are versioned whereas others aren't.
deps = {b: a for a, b in (re.findall(r"^(([^!=<>~ \[\]]+)(?:\[[^\]]+\])?(?:[!=<>~ ].*)?$)", x)[0] for x in _deps)}


def deps_list(*pkgs):
    return [deps[pkg] for pkg in pkgs]


extras = {}
extras["tests"] = deps_list("pytest", "parameterized", "math-verify", "jieba")
extras["torch"] = deps_list("torch")
extras["quality"] = deps_list("ruff", "isort", "flake8")
extras["code"] = deps_list("e2b-code-interpreter", "python-dotenv", "morphcloud", "jieba", "pandas", "aiofiles")
extras["eval"] = deps_list("lighteval", "math-verify")
extras["dev"] = extras["quality"] + extras["tests"] + extras["eval"] + extras["code"]

# core dependencies shared across the whole project - keep this to a bare minimum :)
install_requires = [
    deps["accelerate"],
    deps["bitsandbytes"],
    deps["einops"],
    deps["datasets"],
    deps["deepspeed"],
    deps["hf_transfer"],
    deps["huggingface-hub"],
    deps["langdetect"],
    deps["latex2sympy2_extended"],
    deps["math-verify"],
    deps["liger-kernel"],
    deps["packaging"],  # utilities from PyPA to e.g., compare versions
    deps["safetensors"],
    deps["sentencepiece"],
    deps["transformers"],
    deps["trl"],
    deps["wandb"],
    deps["async-lru"],
]

setup(
    name="open-r1",
    version="0.1.0.dev0",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
    author="The Hugging Face team (past and future)",
    author_email="lewis@huggingface.co",
    description="Open R1",
    long_description=open("README.md", "r", encoding="utf-8").read(),
    long_description_content_type="text/markdown",
    keywords="llm inference-time compute reasoning",
    license="Apache",
    url="https://github.com/huggingface/open-r1",
    package_dir={"": "src"},
    packages=find_packages("src"),
    zip_safe=False,
    extras_require=extras,
    python_requires=">=3.10.9",
    install_requires=install_requires,
    classifiers=[
        "Development Status :: 3 - Alpha",
        "Intended Audience :: Developers",
        "Intended Audience :: Education",
        "Intended Audience :: Science/Research",
        "License :: OSI Approved :: Apache Software License",
        "Operating System :: OS Independent",
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3.10",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
    ],
)


================================================
FILE: slurm/README.md
================================================
## Serving DeepSeek-R1 on 2x8 H100 SLURM nodes with SGLang 

1. Set up the environment (adjust for your cuda version):
```bash
conda create -n sglang124 python=3.11
conda activate sglang124

pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124

pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
```

2. Run the server and wait for the model to load:
```bash
sbatch slurm/serve_r1.slurm -m "/fsx/deepseek-r1-checkpoint" -e "sglang124"
```

3. Run the data generation script:
```bash
python scripts/generate_reasoning.py \
    --dataset-name "AI-MO/NuminaMath-1.5" \
    --output-file "numinamath_r1_generations.jsonl" \
    --prompt-column "problem" \
    --uuid-column "problem" \
    --api-addr "<SGLANG_SERVER_ADDRESS>:39877" \
    --num-generations 2 \
    --max-tokens 16384 \
    --max-concurrent 200
```

================================================
FILE: slurm/compute_pass_rate.slurm
================================================
#!/bin/bash

#SBATCH --job-name=open-r1-compute-pass-rate
#SBATCH --partition=hopper-prod
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --output=./logs/%x-%j.out
#SBATCH --error=./logs/%x-%j.err
#SBATCH --time=01-00:00:00
#SBATCH --requeue

# example usage: sbatch slurm/dataset_filter.slurm recipes/dataset_filtering/filter_dapo.yaml 0 500

set -x -e

source ~/.bashrc
source openr1/bin/activate

python scripts/pass_rate_filtering/compute_pass_rate.py --config $1 --dataset_start_index $2 --dataset_end_index $3

================================================
FILE: slurm/e2b_router.slurm
================================================
#!/bin/bash

#SBATCH --partition=hopper-cpu
#SBATCH --mem=16g
#SBATCH --cpus-per-task=16
#SBATCH --output=/fsx/open-r1/logs/e2b_router/%x-%j.out
#SBATCH --error=/fsx/open-r1/logs/e2b_router/%x-%j.err
#SBATCH --requeue
#SBATCH --time=7-00:00:00

echo "Starting job"
set -x -e

source ~/.bashrc
source openr1/bin/activate

srun python scripts/e2b_router.py

================================================
FILE: slurm/evaluate.slurm
================================================
#!/bin/bash
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --partition=hopper-prod
#SBATCH --output=./logs/%x-%j.out
#SBATCH --error=./logs/%x-%j.err
#SBATCH --requeue
#SBATCH --time=1-00:00:00


# Specific configuration optimized for the Hugging Face Compute Cluster
# Be ye warned this may not work on other clusters!
module load cuda/12.4

# Refresh Weka on h4 cache
echo "Refreshing Weka filesystem..."
find -L /fsx/h4/ -type f | xargs -d '\n' -r -n512 -P64 weka fs tier fetch

# Needed for vLLM
export VLLM_WORKER_MULTIPROC_METHOD=spawn

set -x -e

source ~/.bashrc
source openr1/bin/activate

TASK_NAME=$1
TASKS=$2
MODEL_ID=$3
MODEL_REVISION=$4
# Optional args
[ -z "$5"] && TENSOR_PARALLEL=False || TENSOR_PARALLEL=$5
[ -z "$6"] && TRUST_REMOTE_CODE=False || TRUST_REMOTE_CODE=$6
# $7 is reserved for system_prompt, see line 51
NUM_GPUS=$(nvidia-smi -L | wc -l)

# Use TP to shard model across GPUs
if [ "$TENSOR_PARALLEL" = "True" ]; then
    MODEL_ARGS="model_name=$MODEL_ID,revision=$MODEL_REVISION,trust_remote_code=$TRUST_REMOTE_CODE,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
else
    MODEL_ARGS="model_name=$MODEL_ID,revision=$MODEL_REVISION,trust_remote_code=$TRUST_REMOTE_CODE,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
fi

LM_EVAL_REPO_ID="open-r1/open-r1-eval-leaderboard"
MODEL_NAME=$(echo $MODEL_ID | sed 's/\//_/g') # replaces / with _
DETAILS_REPO_ID="open-r1/details-$MODEL_NAME"
OUTPUT_DIR="eval_results/$MODEL_ID/$MODEL_REVISION/$TASK_NAME"
# We need this flag since we run this script from training jobs that use DeepSpeed and the env vars get progated which causes errors during evaluation
ACCELERATE_USE_DEEPSPEED=false

echo "Running lighteval script ..."
echo "Eval results will be saved to $OUTPUT_DIR"
lighteval vllm "$MODEL_ARGS" $TASKS \
    --use-chat-template \
    --output-dir $OUTPUT_DIR \
    --save-details \
    ${7:+--system-prompt "$(echo "$7" | base64 --decode)"}

OUTPUT_FILEPATHS=$(find $OUTPUT_DIR/results/ -type f \( -name "*.json" \))
for filepath in $OUTPUT_FILEPATHS; do
    echo "Uploading $filepath to Hugging Face Hub..."
    filename=$(basename -- "$filepath")
    for attempt in {1..20}; do
        if huggingface-cli upload --repo-type space --private $LM_EVAL_REPO_ID $filepath $OUTPUT_DIR/$filename; then
            echo "Upload succeeded for $filepath"
            break
        else
            echo "Upload failed for $filepath. Attempt $attempt of 20. Retrying in 5 seconds..."
            sleep 5
        fi
    done
done

echo "Uploading details to Hugging Face Hub..."
DETAILS_FILEPATHS=$(find $OUTPUT_DIR/details/ -type f \( -name "*.parquet" \))
echo "DETAILS_FILEPATHS: $DETAILS_FILEPATHS"
TIMESTAMP=$(date +"%Y-%m-%dT%H-%M-%S")
python scripts/upload_details.py --data_files $DETAILS_FILEPATHS --hub_repo_id $DETAILS_REPO_ID --config_name $MODEL_REVISION.$TASK_NAME.$TIMESTAMP
    
echo "Cleaning up ..."
rm -rf $OUTPUT_DIR

echo "Done!"


================================================
FILE: slurm/experimental/serve_r1_vllm.slurm
================================================
#!/bin/bash
#SBATCH --job-name=r1-vllm
#SBATCH --partition=hopper-prod
#SBATCH --qos=normal
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --output=./logs/%x_%j_%n.out
#SBATCH --error=./logs/%x_%j_%n.err
#SBATCH --time=7-00:00:00
#SBATCH --ntasks-per-node=1

set -exuo pipefail

MODEL_PATH="deepseek-ai/DeepSeek-R1"
CONDA_ENV="vllm7"
SERVER_PORT=8000
RAY_PORT=6379
RAY_DASHBOARD_PORT=8265

while getopts "m:e:h" opt; do
    case $opt in
        m) MODEL_PATH="$OPTARG" ;;
        e) CONDA_ENV="$OPTARG" ;;
        h|?) echo "Usage: sbatch $0 [-m MODEL_PATH] [-e CONDA_ENV]"; exit 1 ;;
    esac
done

# Environment setup
module load cuda/12.1
source ~/.bashrc
source "$CONDA_PREFIX/etc/profile.d/conda.sh"
conda activate "$CONDA_ENV" || { echo "Failed to activate conda env $CONDA_ENV"; exit 1; }

# Get nodes information
NODES=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
HEAD_NODE="${NODES[0]}"
HEAD_NODE_IP=$(srun --nodes=1 --ntasks=1 -w "$HEAD_NODE" hostname --ip-address)

echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "Head node: $HEAD_NODE ($HEAD_NODE_IP)"

# Start Ray head node
echo "Starting Ray head node at $HEAD_NODE"
srun --nodes=1 --ntasks=1 -w "$HEAD_NODE" \
    ray start --head \
    --node-ip-address="$HEAD_NODE_IP" \
    --port=$RAY_PORT \
    --dashboard-host=0.0.0.0 \
    --dashboard-port=$RAY_DASHBOARD_PORT \
    --block &

sleep 10

# Start Ray worker nodes
WORKER_COUNT=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= WORKER_COUNT; i++)); do
    WORKER_NODE="${NODES[$i]}"
    echo "Starting Ray worker $i at $WORKER_NODE"
    srun --nodes=1 --ntasks=1 -w "$WORKER_NODE" \
        ray start --address "$HEAD_NODE_IP:$RAY_PORT" \
        --block &
    sleep 5
done

echo "Waiting for Ray cluster to initialize..."
sleep 60

# Start vLLM server
echo "Starting vLLM server..."
RAY_ADDRESS="http://$HEAD_NODE_IP:$RAY_DASHBOARD_PORT" ray job submit \
    --working-dir src/open_r1 \
    --no-wait \
    --job-id vllm-server \
    -- vllm serve "$MODEL_PATH" \
        --tensor-parallel-size 8 \
        --pipeline-parallel-size 4 \
        --gpu-memory-utilization 0.90 \
        --max-model-len 32768 \
        --max-num-batched-tokens 262144 \
        --max-num-seqs 128 \
        --max-seq-len-to-capture 32768 \
        --enable-chunked-prefill true \
        --preemption-mode recompute \
        --swap-space 128 \
        --trust-remote-code \
        --distributed-executor-backend ray

# Wait for server with timeout
TIMEOUT=3600  # 1h
START_TIME=$(date +%s)
echo "Waiting for vLLM server (http://$HEAD_NODE_IP:$SERVER_PORT)..."

while true; do
    if curl -s -o /dev/null -w "%{http_code}" "http://$HEAD_NODE_IP:$SERVER_PORT/health" >/dev/null 2>&1; then
        echo "Server is ready at http://$HEAD_NODE_IP:$SERVER_PORT"
        break
    fi

    CURRENT_TIME=$(date +%s)
    if [ $((CURRENT_TIME - START_TIME)) -gt $TIMEOUT ]; then
        echo "Error: Server failed to start within $TIMEOUT seconds"
        exit 1
    fi

    echo "Still waiting... ($(($CURRENT_TIME - $START_TIME)) seconds elapsed)"
    sleep 60
done

echo "Checking available models..."
curl "http://$HEAD_NODE_IP:$SERVER_PORT/v1/models"
sleep 10

echo "Executing sanity check..."
curl "http://$HEAD_NODE_IP:$SERVER_PORT/v1/completions" \
    -H "Content-Type: application/json" \
    -d "{
        \"model\": \"default\",
        \"prompt\": \"<｜begin▁of▁sentence｜><｜User｜>hi, how are you?<｜Assistant｜>\",
        \"max_tokens\": 2048,
        \"temperature\": 0.6
    }"

# Keep the job running with health checks
while true; do
    if ! curl -s -o /dev/null "http://$HEAD_NODE_IP:$SERVER_PORT/health"; then
        echo "Error: Server health check failed"
        exit 1
    fi
    sleep 300
done

================================================
FILE: slurm/generate.slurm
================================================
#!/bin/bash
#SBATCH --job-name=deepseek-r1-generation
#SBATCH --partition=hopper-prod
#SBATCH --qos=normal
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --output=./logs/%x-%j.out
#SBATCH --error=./logs/%x-%j.err
#SBATCH --time=04-00:00:00

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --hf-dataset)
            HF_DATASET="$2"
            shift 2
            ;;
        --hf-dataset-config)
            HF_DATASET_CONFIG="$2"
            shift 2
            ;;
        --hf-dataset-split)
            HF_DATASET_SPLIT="$2"
            shift 2
            ;;
        --prompt-column)
            PROMPT_COLUMN="$2"
            shift 2
            ;;
        --prompt-template)
            PROMPT_TEMPLATE="$2"
            shift 2
            ;;
        --model)
            MODEL="$2"
            shift 2
            ;;
        --temperature)
            TEMPERATURE="$2"
            shift 2
            ;;
        --top-p)
            TOP_P="$2"
            shift 2
            ;;
        --max-new-tokens)
            MAX_NEW_TOKENS="$2"
            shift 2
            ;;
        --num-generations)
            NUM_GENERATIONS="$2"
            shift 2
            ;;
        --input-batch-size)
            INPUT_BATCH_SIZE="$2"
            shift 2
            ;;
        --client-replicas)
            CLIENT_REPLICAS="$2"
            shift 2
            ;;
        --timeout)
            TIMEOUT="$2"
            shift 2
            ;;
        --retries)
            RETRIES="$2"
            shift 2
            ;;
        --hf-output-dataset)
            HF_OUTPUT_DATASET="$2"
            shift 2
            ;;
        --private)
            PRIVATE="true"
            shift
            ;;
        *)
            echo "Unknown parameter: $1"
            exit 1
            ;;
    esac
done

if [ -z "$MODEL" ] || [ -z "$HF_DATASET" ]; then
    echo "Error: --model and --hf-dataset are required parameters"
    exit 1
fi

# Set default values for optional parameters
HF_DATASET_SPLIT=${HF_DATASET_SPLIT:-"train"}
PROMPT_COLUMN=${PROMPT_COLUMN:-"prompt"}
PROMPT_TEMPLATE=${PROMPT_TEMPLATE:-"{{ instruction }}"}
MAX_NEW_TOKENS=${MAX_NEW_TOKENS:-8192}
NUM_GENERATIONS=${NUM_GENERATIONS:-1}
INPUT_BATCH_SIZE=${INPUT_BATCH_SIZE:-64}
CLIENT_REPLICAS=${CLIENT_REPLICAS:-1}
TIMEOUT=${TIMEOUT:-900}
RETRIES=${RETRIES:-0}
PRIVATE=${PRIVATE:-"false"}

# Print all input arguments
echo "Input arguments:"
echo "MODEL: $MODEL"
echo "HF_DATASET: $HF_DATASET"
echo "HF_DATASET_CONFIG: $HF_DATASET_CONFIG"
echo "HF_DATASET_SPLIT: $HF_DATASET_SPLIT"
echo "PROMPT_COLUMN: $PROMPT_COLUMN"
echo "PROMPT_TEMPLATE: $PROMPT_TEMPLATE"
echo "TEMPERATURE: $TEMPERATURE"
echo "TOP_P: $TOP_P"
echo "MAX_NEW_TOKENS: $MAX_NEW_TOKENS"
echo "NUM_GENERATIONS: $NUM_GENERATIONS"
echo "INPUT_BATCH_SIZE: $INPUT_BATCH_SIZE"
echo "CLIENT_REPLICAS: $CLIENT_REPLICAS"
echo "TIMEOUT: $TIMEOUT"
echo "RETRIES: $RETRIES"
echo "HF_OUTPUT_DATASET: $HF_OUTPUT_DATASET"
echo "PRIVATE: $PRIVATE"
echo "-------------------"

set -ex

module load cuda/12.4

export LD_LIBRARY_PATH=.venv/lib/python3.11/site-packages/nvidia/nvjitlink/lib

echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"

source openr1/bin/activate

# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

# Get the IP address of the head node
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Start Ray head node
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" --port=$port \
    --dashboard-host=0.0.0.0 \
    --dashboard-port=8265 \
    --block &

# Give some time to head node to start...
sleep 10

# Start Ray worker nodes
worker_num=$((SLURM_JOB_NUM_NODES - 1))

# Start from 1 (0 is head node)
for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --block &
    sleep 5
done

# Give some time to the Ray cluster to gather info
echo "Waiting a bit for Ray cluster to gather node info..."
sleep 60

# Run vllm
RAY_ADDRESS="http://$head_node_ip:8265" ray job submit \
    --working-dir src/open_r1 \
    --no-wait \
    --job-id vllm-server \
    -- vllm serve $MODEL \
    --tensor-parallel-size $SLURM_GPUS_PER_NODE \
    --pipeline-parallel-size $SLURM_JOB_NUM_NODES \
    --gpu-memory-utilization=0.85 \
    --max-model-len 16384 \
    --enable-chunked-prefill \
    --trust-remote-code \
    --distributed-executor-backend ray

# wait for vllm to load the model
echo "Waiting for vLLM (http://$head_node_ip:8000) server to be up..."

# wait for vllm to load and serve the model
while true; do
    if curl -s -o /dev/null -w "%{http_code}" http://$head_node_ip:8000 >/dev/null 2>&1; then
        echo "Received response from http://$head_node_ip:8000"
        break
    else
        echo "Still waiting... (Press Ctrl+C to cancel)"
        sleep 60
    fi
done

echo "Checking available models..."
curl http://$head_node_ip:8000/v1/models

echo "Executing sanity check..."
curl http://$head_node_ip:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d "{
        \"model\": \"$MODEL\",
        \"prompt\": \"<｜begin▁of▁sentence｜><｜User｜>hi, how are you?<｜Assistant｜>\",
        \"max_tokens\": 2048,
        \"temperature\": 0.6
    }"

# Finally submit the job to the cluster
echo "Submitting job to ray cluster..."
RAY_ADDRESS="http://$head_node_ip:8265" ray job submit \
    --working-dir src/open_r1 \
    --job-id generate \
    -- python -u generate.py \
    --model "$MODEL" \
    --hf-dataset "$HF_DATASET" \
    ${HF_DATASET_CONFIG:+--hf-dataset-config "$HF_DATASET_CONFIG"} \
    --hf-dataset-split "$HF_DATASET_SPLIT" \
    --prompt-column "$PROMPT_COLUMN" \
    --prompt-template "$PROMPT_TEMPLATE" \
    ${TEMPERATURE:+--temperature "$TEMPERATURE"} \
    ${TOP_P:+--top-p "$TOP_P"} \
    --max-new-tokens "$MAX_NEW_TOKENS" \
    --num-generations "$NUM_GENERATIONS" \
    --input-batch-size "$INPUT_BATCH_SIZE" \
    --client-replicas "$CLIENT_REPLICAS" \
    --timeout "$TIMEOUT" \
    --retries "$RETRIES" \
    ${HF_OUTPUT_DATASET:+--hf-output-dataset "$HF_OUTPUT_DATASET"} \
    ${PRIVATE:+--private} \
    --vllm-server-url "http://$head_node_ip:8000/v1"

mkdir -p ray_logs

echo "Downloading Ray job logs..."
RAY_ADDRESS="http://$head_node_ip:8265" ray job logs --job-id vllm-server > ray_logs/vllm-server-${SLURM_JOB_ID}.log
RAY_ADDRESS="http://$head_node_ip:8265" ray job logs --job-id generate > ray_logs/generate-${SLURM_JOB_ID}.log

================================================
FILE: slurm/morph_router.slurm
================================================
#!/bin/bash

#SBATCH --partition=hopper-cpu
#SBATCH --mem=16g
#SBATCH --cpus-per-task=16
#SBATCH --output=/fsx/open-r1/logs/morph_router/%x-%j.out
#SBATCH --err=/fsx/open-r1/logs/morph_router/%x-%j.err
#SBATCH --requeue
#SBATCH --time=7-00:00:00


echo "Starting job"
set -x -e

source ~/.bashrc
source openr1/bin/activate

srun python scripts/morph_router.py --port 8001 --max_num_sandboxes 20


================================================
FILE: slurm/piston/README.md
================================================
# Piston workers (slurm)

We have built a [piston](https://github.com/engineer-man/piston) package to run IOI problems.

To launch a fleet of piston workers on a slurm cluster, you can adapt the paths in `launch_piston_workers.sh` and `launch_single_piston.sh` and run:
```bash
slurm/piston/launch_piston_workers.sh (number of workers to launch)
```

This command will launch a slurm job for each worker, which will be called `piston-worker-<port>`, where `<port>` is the port where the worker will be listening.

## First time setup
You will need to install the [IOI package](https://github.com/guipenedo/piston/tree/master/packages/cms_ioi/1.0.0) in the workers.
1. Launch a single worker:
```bash
slurm/piston/launch_piston_workers.sh 1
```

2. Assuming it's running on `ip-10-53-86-146:1234`, send the package install request:

For IOI:
```bash
curl -X POST http://ip-10-53-86-146:1234/api/v2/packages -H "Content-Type: application/json" -d '{"language": "cms_ioi", "version": "1.0.0"}'
```

For CodeForces:
```bash
curl -X POST http://ip-10-53-86-146:1234/api/v2/packages -H "Content-Type: application/json" -d '{"language": "codeforces", "version": "1.0.0"}'
```

3. You can now launch more workers and due to the shared mounted packages directory, they should already have the package installed.

To have the main script find the workers automatically, you can export the following environment variable:
```bash
export PISTON_ENDPOINTS=slurm
```
Alternatively your can add `PISTON_ENDPOINTS=slurm` to your .env file.

You can also change `PISTON_MAX_REQUESTS_PER_ENDPOINT`, which tries to limit how many simultaneous requests each worker will handle (1 by default). Keep in mind that this is a local limit and in distributed setups, as there is no global limit, workers might sometimes be overwhelmed when some processes hit the same worker.

If you would like to adapt the code to run without piston, please see the [ioi repo](https://github.com/huggingface/ioi).
For CodeForces, you should implement the [`run`](https://github.com/guipenedo/piston/blob/master/packages/codeforces/1.0.0/run) and [`compile`](https://github.com/guipenedo/piston/blob/master/packages/codeforces/1.0.0/compile) scripts.

# Piston workers (local docker)
This will launch a single worker in a docker container. Consider launching multiple workers for better scalability. Replace 2000 with the port you want to use.
Make sure to change `/path/to/local/packages` to the path you want to persist for package installs.

```bash
docker run -d \
  --name piston_worker \
  -v /path/to/local/packages:/piston/packages \
  -e PORT=2000 \
  -e PISTON_COMPILE_TIMEOUT=60000 \
  -e PISTON_RUN_TIMEOUT=60000 \
  -e PISTON_OUTPUT_MAX_SIZE=1000000000 \
  -e PISTON_MAX_FILE_SIZE=1000000000 \
  -e PISTON_DISABLE_NETWORKING=true \
  -e PISTON_REPO_URL=https://github.com/guipenedo/piston/releases/download/pkgs/index \
  -p 2000:2000 \
  --entrypoint /bin/bash \
  ghcr.io/engineer-man/piston@sha256:63b5654156a89c5a2ad281aface21416615d62ec056d88efe8fcd307ce73575a \
  -c "sed -i '/app.use(body_parser.urlencoded/c\    app.use(body_parser.urlencoded({ extended: true, limit: \"512mb\" }));' src/index.js && \
      sed -i '/app.use(body_parser.json/c\    app.use(body_parser.json({ limit: \"512mb\" }));' src/index.js && \
      node src"
```

Install the package:
For IOI:
```bash
curl -X POST http://localhost:2000/api/v2/packages -H "Content-Type: application/json" -d '{"language": "cms_ioi", "version": "1.0.0"}'
```

For CodeForces:
```bash
curl -X POST http://localhost:2000/api/v2/packages -H "Content-Type: application/json" -d '{"language": "codeforces", "version": "1.0.0"}'
```

Remember to set `PISTON_ENDPOINTS`:
```bash
export PISTON_ENDPOINTS=http://localhost:2000/api/v2,http://localhost:2001/api/v2,http://localhost:2002/api/v2
```


================================================
FILE: slurm/piston/launch_piston_workers.sh
================================================
#!/bin/bash

# this simple script will launch a bunch of piston workers on the HF science cluster

N_INSTANCES=${1:-5}  # Default to 5 instances

for i in $(seq 1 $N_INSTANCES); do
    # Find random (hopefully) available port
    PORT=$(comm -23 <(seq 2000 10000 | sort) <(ss -tan | awk '{print $4}' | cut -d':' -f2 | sort -u) | shuf | head -n1)
    
    # the job name format is important for the code to then be able to get a list of workers. `piston-worker-<port>`
    sbatch \
        --job-name="piston-worker-$PORT" \
        --export=ALL,PORT=$PORT \
        slurm/piston/launch_single_piston.sh
done

================================================
FILE: slurm/piston/launch_single_piston.sh
================================================
#!/bin/bash
#SBATCH --job-name=piston_worker
#SBATCH --output=/fsx/open-r1/logs/piston/worker-logs/%x-%j.out
#SBATCH --error=/fsx/open-r1/logs/piston/worker-logs/%x-%j.out  # Redirect error logs to .out
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1950M
#SBATCH --partition=hopper-cpu
#SBATCH --time=48:00:00

# sometimes if a bunch of workers start at the same time pyxis dies
sleep $(( RANDOM % 20 ))

# mounting the packages folder lets us not have to manually install the package on each instance
# we use 63b5654156a89c5a2ad281aface21416615d62ec056d88efe8fcd307ce73575a as the latest image requires isolate, which does not work on the HF science cluster (cgroups incompatibility)
# feel free try with the latest image
# the code you see below increases the very constrained piston default limits, and sets the repo url to the one hosting our IOI package
srun --container-mounts=/fsx/guilherme/ioi2024/piston_files/packages:/piston/packages --container-image "ghcr.io#engineer-man/piston:sha256:63b5654156a89c5a2ad281aface21416615d62ec056d88efe8fcd307ce73575a" \
    bash -c "
    export PISTON_COMPILE_TIMEOUT=60000
    export PISTON_RUN_TIMEOUT=60000
    export PISTON_OUTPUT_MAX_SIZE=1000000000
    export PISTON_MAX_FILE_SIZE=1000000000
    export PISTON_DISABLE_NETWORKING=true
    export PISTON_REPO_URL=https://github.com/guipenedo/piston/releases/download/pkgs/index

    sed -i '/app.use(body_parser.urlencoded/c\    app.use(body_parser.urlencoded({ extended: true, limit: \"512mb\" }));' src/index.js
    sed -i '/app.use(body_parser.json/c\    app.use(body_parser.json({ limit: \"512mb\" }));' src/index.js

    # Start server in background
    node src
    "


================================================
FILE: slurm/serve_r1.slurm
================================================
#!/bin/bash
#SBATCH --job-name=r1-server
#SBATCH --partition=hopper-prod
#SBATCH --qos=normal
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --output=./logs/%x_%j_%n.out
#SBATCH --error=./logs/%x_%j_%n.err
#SBATCH --time=7-00:00:00
#SBATCH --ntasks-per-node=1

set -exuo pipefail

MODEL_PATH="deepseek-ai/DeepSeek-R1"
CONDA_ENV="sglang124"
ROUTER_ADDRESS=""
SERVER_PORT=39877
DIST_PORT=45000

# TODO: Adjust these variables to your cluster configuration
export OUTLINES_CACHE_DIR=/scratch/serve_r1/ocache/
export TRITON_HOME=/scratch/serve_r1/triton/
export GLOO_SOCKET_IFNAME="enp71s0"
export NCCL_SOCKET_IFNAME="enp71s0"

while getopts "m:e:r:h" opt; do
    case $opt in
        m) MODEL_PATH="$OPTARG" ;;
        e) CONDA_ENV="$OPTARG" ;;
        r) ROUTER_ADDRESS="$OPTARG" ;;
        h|?) echo "Usage: sbatch $0 [-m MODEL_PATH] [-e CONDA_ENV] [-r ROUTER_ADDRESS]"; exit 1 ;;
    esac
done

# TODO: Environment setup, adjust to your cluster configuration
module load cuda/12.4
source ~/.bashrc
source "$CONDA_PREFIX/etc/profile.d/conda.sh"
conda activate "$CONDA_ENV" || { echo "Failed to activate conda env $CONDA_ENV"; exit 1; }

FIRST_NODE=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
FIRST_NODE_IP=$(srun --nodes=1 --ntasks=1 -w "$FIRST_NODE" hostname --ip-address)

# Launch servers synchronously across all nodes
# (--max-running-requests=56 is rough estimate to avoid too many evicted/preempted 16k-long requests)
srun --nodes=2 --ntasks=2 --ntasks-per-node=1 \
    bash -c "python -m sglang.launch_server \
        --model-path '$MODEL_PATH' \
        --tp 16 \
        --dist-init-addr '$FIRST_NODE_IP:$DIST_PORT' \
        --nnodes 2 \
        --node-rank \$SLURM_PROCID \
        --port '$SERVER_PORT' \
        --host 0.0.0.0 \
        --trust-remote-code \
        --max-running-requests 56 \
        --context-length 32768" &

# Wait for server with timeout
TIMEOUT=3600  # 1h, but model loading should take ~30min
START_TIME=$(date +%s)
echo "Waiting for SGLang server (http://$FIRST_NODE_IP:$SERVER_PORT)..."

while true; do
    if curl -s -o /dev/null -w "%{http_code}" "http://$FIRST_NODE_IP:$SERVER_PORT/health" >/dev/null 2>&1; then
        echo "Server is ready at http://$FIRST_NODE_IP:$SERVER_PORT"
        break
    fi

    CURRENT_TIME=$(date +%s)
    if [ $((CURRENT_TIME - START_TIME)) -gt $TIMEOUT ]; then
        echo "Error: Server failed to start within $TIMEOUT seconds"
        exit 1
    fi

    echo "Still waiting... ($(($CURRENT_TIME - $START_TIME)) seconds elapsed)"
    sleep 60
done

# Register with router only if address was provided
if [ -n "$ROUTER_ADDRESS" ]; then
    echo "Registering with router at $ROUTER_ADDRESS..."
    curl -X POST "http://$ROUTER_ADDRESS/add_worker?url=http://$FIRST_NODE_IP:$SERVER_PORT" || true
    sleep 10
fi

echo "Checking available models..."
curl "http://$FIRST_NODE_IP:$SERVER_PORT/v1/models"
sleep 10

echo "Executing sanity check..."
curl "http://$FIRST_NODE_IP:$SERVER_PORT/v1/completions" \
    -H "Content-Type: application/json" \
    -d "{
        \"model\": \"default\",
        \"prompt\": \"<｜begin▁of▁sentence｜><｜User｜>hi, how are you?<｜Assistant｜>\",
        \"max_tokens\": 2048,
        \"temperature\": 0.6
    }"

# Keep the job running with health checks
while true; do
    if ! curl -s -o /dev/null "http://$FIRST_NODE_IP:$SERVER_PORT/health"; then
        echo "Error: Server health check failed"
        exit 1
    fi
    sleep 300
done

================================================
FILE: slurm/serve_router.slurm
================================================
#!/bin/bash
#SBATCH --job-name=r1-router
#SBATCH --partition=hopper-cpu
#SBATCH --qos=high
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1875m
#SBATCH --output=./logs/%x_%j_%n.out
#SBATCH --error=./logs/%x_%j_%n.err
#SBATCH --time=30-00:00:00
#SBATCH --requeue

set -exuo pipefail

# TODO: Adjust these variables to your cluster configuration
CONDA_ENV="sglang124"
ROUTER_PORT=39876

trap 'scontrol requeue ${SLURM_JOB_ID}; exit 15' SIGUSR1

while getopts "e:h" opt; do
    case $opt in
        e) CONDA_ENV="$OPTARG" ;;
        h|?) echo "Usage: sbatch $0 [-e CONDA_ENV]"; exit 1 ;;
    esac
done

# TODO: Environment setup, adjust to your cluster configuration
source ~/.bashrc
source "$CONDA_PREFIX/etc/profile.d/conda.sh"
conda activate "$CONDA_ENV" || { echo "Failed to activate conda env $CONDA_ENV"; exit 1; }

python -m sglang_router.launch_router \
    --port "$ROUTER_PORT" \
    --host 0.0.0.0 \
    --worker-startup-timeout-secs 300

# Keep the job running with health checks
while true; do
    if ! curl -s -o /dev/null "http://localhost:$ROUTER_PORT/health"; then
        echo "Error: Router health check failed"
        exit 1
    fi
    sleep 300
done

================================================
FILE: slurm/train.slurm
================================================
#!/bin/bash
#SBATCH --job-name=open_r1
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gres=gpu:8
#SBATCH --partition=hopper-prod  # Adjust this for your cluster
#SBATCH --output=./logs/%x-%j.out
#SBATCH --error=./logs/%x-%j.err
#SBATCH --requeue
#SBATCH --time=3-00:00:00


if [[ "$*" == *"--help"* ]]; then
  echo "Usage: sbatch slurm/train.slurm [options]"
  echo "Options:"
  echo "  --model MODEL            Model name"
  echo "  --task TASK              Task name (e.g. sft, grpo)"
  echo "  --config SUFFIX          Configuration suffix (e.g. demo, v00.00)"
  echo "  --accelerator CONFIG     Accelerator configuration name (e.g. zero3)"
  echo "  --dp N                   Data parallelism for vLLM server (default: 1)"
  echo "  --tp N                   Tensor parallelism for vLLM server (default: 1)"
  echo "  --args \"ARGS\"          Optional arguments to pass to the training script"
  exit 0
fi

# Specific configuration optimized for the Hugging Face Compute Cluster
module load cuda/12.4
set -x -e

source ~/.bashrc
source openr1/bin/activate
START_TIME=$(date +%s)
echo "START TIME: $(date)"

# Refresh Weka on h4 cache
echo "Refreshing Weka filesystem..."
find -L /fsx/h4/ -type f | xargs -d '\n' -r -n512 -P64 weka fs tier fetch

# Default values
MODEL=""
TASK=""
CONFIG_SUFFIX=""
ACCELERATOR=""
DP=1
TP=1
OPTIONAL_ARGS=""

# Parse command line arguments
while [[ $# -gt 0 ]]; do
  case $1 in
    --model)
      MODEL="$2"
      shift 2
      ;;
    --task)
      TASK="$2"
      shift 2
      ;;
    --config)
      CONFIG_SUFFIX="$2"
      shift 2
      ;;
    --accelerator)
      ACCELERATOR="$2"
      shift 2
      ;;
    --dp)
      DP="$2"
      shift 2
      ;;
    --tp)
      TP="$2"
      shift 2
      ;;
    --args)
      OPTIONAL_ARGS="$2"
      shift 2
      ;;
    *)
      echo "Unknown option: $1"
      echo "Use --help for usage information"
      exit 1
      ;;
  esac
done

# Validate required arguments
if [[ -z "$MODEL" || -z "$TASK" || -z "$CONFIG_SUFFIX" || -z "$ACCELERATOR" ]]; then
  echo "Error: Missing required arguments"
  echo "Run with --help for usage information"
  exit 1
fi


CONFIG_FILE=recipes/$MODEL/$TASK/config_$CONFIG_SUFFIX.yaml
GRAD_ACC_STEPS=$(grep 'gradient_accumulation_steps' $CONFIG_FILE | awk '{print $2}')

# Split the string into individual arguments
IFS=' ' read -ra ARGS <<< "$OPTIONAL_ARGS"
# Loop through the arguments and find the one with "--gradient_accumulation_steps"
for arg in "${ARGS[@]}"; do
    if [[ "$arg" == "--gradient_accumulation_steps="* ]]; then
        # Extract the value after the equals sign
        GRAD_ACC_STEPS="${arg#*=}"
        break  # Exit the loop once we find the desired argument
    fi
done

echo "Gradient accumulation steps: $GRAD_ACC_STEPS"

MODEL=$(grep 'model_name_or_path:' $CONFIG_FILE | awk '{print $2}')
REVISION=$(grep 'model_revision:' $CONFIG_FILE | head -n 1 | awk '{print $2}')

# Distributed configuration
NUM_NODES=$SLURM_NNODES
GPUS_PER_NODE=8
WORLD_SIZE=$(($NUM_NODES*$GPUS_PER_NODE))
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
MASTER_ADDR=${NODELIST[0]}  # First node for main process
MASTER_PORT=6000
TRAIN_NODES=("${NODELIST[@]}")

USE_VLLM="false"
if [[ -f "$CONFIG_FILE" ]] && grep -qE '^\s*use_vllm:\s*true' "$CONFIG_FILE"; then
    USE_VLLM="true"
fi
# if using vllm
if [[ "$USE_VLLM" == "true" ]]; then
     TRAIN_NODES=("${NODELIST[@]:0:$((NUM_NODES - 1))}")
     VLLM_NODE=${NODELIST[-1]} # Last node
     WORLD_SIZE=$((WORLD_SIZE - GPUS_PER_NODE))
     NUM_NODES=$((NUM_NODES - 1))
     srun --nodes=1 --ntasks=1 --nodelist=$VLLM_NODE trl vllm-serve --model $MODEL --revision $REVISION --tensor_parallel_size $TP --data_parallel_size $DP &

     OPTIONAL_ARGS="$OPTIONAL_ARGS --vllm_server_host=$VLLM_NODE"
fi

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1

export CMD=" \
    src/open_r1/$TASK.py --config $CONFIG_FILE $OPTIONAL_ARGS
    "

export LAUNCHER="ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch \
    --config_file recipes/accelerate_configs/$ACCELERATOR.yaml  \
    --gradient_accumulation_steps $GRAD_ACC_STEPS \
    --num_machines $NUM_NODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank $SLURM_PROCID \
    --rdzv_backend=c10d \
    --max_restarts 1 \
    --tee 3 \
    "
# srun error handling:
# --wait=60: wait 60 sec after the first task terminates before terminating all remaining tasks
# --kill-on-bad-exit=1: terminate a step if any task exits with a non-zero exit code
NODELIST=$(IFS=,; echo "${TRAIN_NODES[*]}")

SRUN_ARGS=" \
    --wait=60 \
    --kill-on-bad-exit=1 \
    --nodes=$NUM_NODES \
    --ntasks=$NUM_NODES \
    --nodelist=$NODELIST
    "
srun $SRUN_ARGS bash -c "$LAUNCHER $CMD" 2>&1

END_TIME=$(date +%s)
echo "END TIME: $(date)"
ELAPSED_SECONDS=$((END_TIME - START_TIME))
HOURS=$((ELAPSED_SECONDS / 3600))
MINUTES=$(( (ELAPSED_SECONDS % 3600) / 60 ))
SECONDS=$((ELAPSED_SECONDS % 60))
echo "TOTAL JOB TIME: ${HOURS}h ${MINUTES}m ${SECONDS}s (${ELAPSED_SECONDS} seconds)"


================================================
FILE: src/open_r1/__init__.py
================================================
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


================================================
FILE: src/open_r1/configs.py
================================================
# coding=utf-8
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from dataclasses import dataclass, field
from typing import Any, Literal, Optional

import trl


@dataclass
class DatasetConfig:
    """Configuration for a dataset in a mixture."""

    id: str
    config: Optional[str] = None
    split: str = "train"
    columns: Optional[list[str]] = None
    weight: Optional[float] = None


@dataclass
class DatasetMixtureConfig:
    """Configuration for a mixture of datasets."""

    datasets: list[DatasetConfig]
    seed: int = 0
    test_split_size: Optional[float] = None


@dataclass
class ScriptArguments(trl.ScriptArguments):
    """
    Extended version of ScriptArguments with support for dataset mixtures.

    Args:
        dataset_mixture (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
            Configuration for creating dataset mixtures with advanced options.
            Format:
              dataset_mixture:
                datasets:
                  - id: dataset_id1
                    config: config_name
                    columns:
                      - col1
                      - col2
                    weight: 0.5
                  - id: dataset_id2
                    config: config_name
                    columns:
                      - col1
                      - col2
                    weight: 0.5
                seed: 42
                test_split_size: 0.1
    """

    # Override the dataset_name to make it optional
    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "Dataset name. Can be omitted if using dataset_mixture."}
    )
    dataset_mixture: Optional[dict[str, Any]] = field(
        default=None,
        metadata={"help": "Configuration for creating dataset mixtures with advanced options like shuffling."},
    )

    def __post_init__(self):
        if self.dataset_name is None and self.dataset_mixture is None:
            raise ValueError("Either `dataset_name` or `dataset_mixture` must be provided")

        if self.dataset_mixture is not None:
            if not isinstance(self.dataset_mixture, dict) or "datasets" not in self.dataset_mixture:
                raise ValueError(
                    "dataset_mixture must be a dictionary with a 'datasets' key. "
                    "Expected format: {'datasets': [...], 'seed': int}"
                )

            datasets_list = []
            datasets_data = self.dataset_mixture.get("datasets", [])

            if isinstance(datasets_data, list):
                for dataset_config in datasets_data:
                    datasets_list.append(
                        DatasetConfig(
                            id=dataset_config.get("id"),
                            config=dataset_config.get("config"),
                            split=dataset_config.get("split", "train"),
                            columns=dataset_config.get("columns"),
                            weight=dataset_config.get("weight", 1.0),
                        )
                    )
            else:
                raise ValueError("'datasets' must be a list of dataset configurations")

            self.dataset_mixture = DatasetMixtureConfig(
                datasets=datasets_list,
                seed=self.dataset_mixture.get("seed", 0),
                test_split_size=self.dataset_mixture.get("test_split_size", None),
            )

            # Check that column names are consistent across all dataset configs
            columns_sets = [set(dataset.columns) for dataset in datasets_list if dataset.columns is not None]
            if columns_sets:
                first_columns = columns_sets[0]
                if not all(columns == first_columns for columns in columns_sets):
                    raise ValueError(
                        "Column names must be consistent across all dataset configurations in a mixture. "
                        f"Found different column sets: {[list(cols) for cols in columns_sets]}"
                    )


# TODO: add the shared options with a mixin to reduce code duplication
@dataclass
class GRPOConfig(trl.GRPOConfig):
    """
    args for callbacks, benchmarks etc
    """

    benchmarks: list[str] = field(
        default_factory=lambda: [],
        metadata={"help": "The benchmarks to run after training."},
    )
    callbacks: list[str] = field(
        default_factory=lambda: [],
        metadata={"help": "The callbacks to run during training."},
    )
    chat_template: Optional[str] = field(default=None, metadata={"help": "The chat template to use."})
    hub_model_revision: Optional[str] = field(
        default="main", metadata={"help": "The Hub model branch to push the model to."}
    )
    num_completions_to_print: int = field(default=0, metadata={"help": "Number of completions to print."})
    overwrite_hub_revision: bool = field(default=False, metadata={"help": "Whether to overwrite the Hub revision."})
    push_to_hub_revision: bool = field(default=False, metadata={"help": "Whether to push to a Hub revision/branch."})
    system_prompt: Optional[str] = field(
        default=None,
        metadata={"help": "The optional system prompt to use."},
    )
    wandb_log_unique_prompts: bool = field(
        default=True,
        metadata={
            "help": ("Whether to log the unique prompts to wandb. This will create a new run for each unique prompt.")
        },
    )
    wandb_entity: Optional[str] = field(
        default=None,
        metadata={"help": ("The entity to store runs under.")},
    )
    wandb_project: Optional[str] = field(
        default=None,
        metadata={"help": ("The project to store runs under.")},
    )
    wandb_run_group: Optional[str] = field(
        default=None,
        metadata={"help": ("The group to store runs under.")},
    )


@dataclass
class SFTConfig(trl.SFTConfig):
    """
    args for callbacks, benchmarks etc
    """

    benchmarks: list[str] = field(
        default_factory=lambda: [],
        metadata={"help": "The benchmarks to run after training."},
    )
    callbacks: list[str] = field(
        default_factory=lambda: [],
        metadata={"help": "The callbacks to run during training."},
    )
    chat_template: Optional[str] = field(default=None, metadata={"help": "The chat template to use."})
    system_prompt: Optional[str] = field(
        default=None,
        metadata={"help": "The optional system prompt to use for benchmarking."},
    )
    hub_model_revision: Optional[str] = field(
        default="main",
        metadata={"help": "The Hub model branch to push the model to."},
    )
    overwrite_hub_revision: bool = field(default=False, metadata={"help": "Whether to overwrite the Hub revision."})
    push_to_hub_revision: bool = field(default=False, metadata={"help": "Whether to push to a Hub revision/branch."})
    wandb_entity: Optional[str] = field(
        default=None,
        metadata={"help": ("The entity to store runs under.")},
    )
    wandb_project: Optional[str] = field(
        default=None,
        metadata={"help": ("The project to store runs under.")},
    )
    wandb_run_group: Optional[str] = field(
        default=None,
        metadata={"help": ("The group to store runs under.")},
    )


@dataclass
class GRPOScriptArguments(ScriptArguments):
    """
    Script arguments for the GRPO training script.

    Args:
        reward_funcs (`list[str]`):
            List of reward functions. Possible values: 'accuracy', 'format', 'reasoning_steps', 'cosine', 'repetition_penalty', 'length', 'tag_count', 'code', 'ioi_code', 'code_format', 'soft_overlong_punishment'.
        cosine_min_value_wrong (`float`):
            Minimum reward for cosine scaling for wrong answers.
        cosine_max_value_wrong (`float`):
            Maximum reward for cosine scaling for wrong answers.
        cosine_min_value_correct (`float`):
            Minimum reward for cosine scaling for correct answers.
        cosine_max_value_correct (`float`):
            Maximum reward for cosine scaling for correct answers.
        cosine_max_len (`int`):
            Maximum length for cosine scaling.
        code_language (`str`):
            Language for code format reward.
        max_completion_len (`int`):
            Maximum number of tokens in completion.
        soft_punish_cache (`int`):
            Minimum number of tokens in completion.
    """

    reward_funcs: list[str] = field(
        default_factory=lambda: ["accuracy", "format", "tag_count"],
        metadata={
            "help": "List of reward functions. Possible values: 'accuracy', 'format', 'reasoning_steps', 'cosine', 'repetition_penalty', 'length', tag_count', 'code', 'code_format'"
        },
    )
    cosine_min_value_wrong: float = field(
        default=0.0,
        metadata={"help": "Minimum reward for wrong answers"},
    )
    cosine_max_value_wrong: float = field(
        default=-0.5,
        metadata={"help": "Maximum reward for wrong answers"},
    )
    cosine_min_value_correct: float = field(
        default=0.5,
        metadata={"help": "Minimum reward for correct answers"},
    )
    cosine_max_value_correct: float = field(
        default=1.0,
        metadata={"help": "Maximum reward for correct answers"},
    )
    cosine_max_len: int = field(
        default=1000,
        metadata={"help": "Maximum length for scaling"},
    )
    repetition_n_grams: int = field(
        default=3,
        metadata={"help": "Number of n-grams for repetition penalty reward"},
    )
    repetition_max_penalty: float = field(
        default=-1.0,
        metadata={"help": "Maximum (negative) penalty for for repetition penalty reward"},
    )
    code_language: str = field(
        default="python",
        # '(?:python|cpp)'
        metadata={
            "help": "Language for code format reward. Based on E2B supported languages https://e2b.dev/docs/code-interpreting/supported-languages",
            "choices": ["python", "javascript", "r", "java", "bash", "cpp"],
        },
    )
    code_eval_test_batch_size: int = field(
        default=1,
        metadata={
            "help": "for each generation, evaluate these many test cases in parallel, then check if any of them failed (0 score): if so stop evaluating; otherwise continue with the next batch of test cases. Useful to avoid overloading the eval server + save time on wrong solutions"
        },
    )
    code_eval_scoring_mode: Literal["pass_fail", "partial", "weighted_sum"] = field(
        default="weighted_sum",
        metadata={"help": "use fraction of passed test cases as reward. If false, use 0/1 scoring."},
    )
    parallel_code_exec_per_proc: int = field(
        default=2,
        metadata={
            "help": "Number of parallel E2B code executions per process. Default of 2 is suitable for the Free Hobby tier of E2B with 8 GPUs used for training."
        },
    )

    dataset_prompt_column: str = field(
        default="prompt",
        metadata={"help": "Column to use as prompts for training."},
    )

    e2b_router_url: Optional[str] = field(
        default=None,
        metadata={"help": "URL for the E2B router. See scripts/e2b_router.py"},
    )

    morph_router_url: Optional[str] = field(
        default=None,
        metadata={"help": "URL for the MorphCloud router. See scripts/morph_router.py"},
    )

    code_provider: Optional[str] = field(
        default="e2b",
        metadata={
            "help": "Provider for code execution. Options: 'e2b', 'local', 'morph'.",
            "choices": ["e2b", "local", "morph"],
        },
    )

    ioi_provider: Optional[str] = field(
        default="piston",
        metadata={
            "help": "Provider for IOI code execution. Options: 'piston', 'morph'.",
            "choices": ["piston", "morph"],
        },
    )

    max_completion_len: int = field(
        default=16384,
        metadata={"help": "Maximum number of characters in completion."},
    )
    soft_punish_cache: int = field(
        default=4096,
        metadata={"help": "Minimum number of characters in completion."},
    )


================================================
FILE: src/open_r1/generate.py
================================================
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Optional

from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import StepResources
from distilabel.steps.tasks import TextGeneration


def build_distilabel_pipeline(
    model: str,
    base_url: str = "http://localhost:8000/v1",
    prompt_column: Optional[str] = None,
    prompt_template: str = "{{ instruction }}",
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    max_new_tokens: int = 8192,
    num_generations: int = 1,
    input_batch_size: int = 64,
    client_replicas: int = 1,
    timeout: int = 900,
    retries: int = 0,
) -> Pipeline:
    generation_kwargs = {"max_new_tokens": max_new_tokens}

    if temperature is not None:
        generation_kwargs["temperature"] = temperature

    if top_p is not None:
        generation_kwargs["top_p"] = top_p

    with Pipeline().ray() as pipeline:
        TextGeneration(
            llm=OpenAILLM(
                base_url=base_url,
                api_key="something",
                model=model,
                timeout=timeout,
                max_retries=retries,
                generation_kwargs=generation_kwargs,
            ),
            template=prompt_template,
            input_mappings=({"instruction": prompt_column} if prompt_column is not None else {}),
            input_batch_size=input_batch_size,
            num_generations=num_generations,
            group_generations=True,
            resources=StepResources(replicas=client_replicas),
        )

    return pipeline


if __name__ == "__main__":
    import argparse

    from datasets import load_dataset

    parser = argparse.ArgumentParser(description="Run distilabel pipeline for generating responses with DeepSeek R1")
    parser.add_argument(
        "--hf-dataset",
        type=str,
        required=True,
        help="HuggingFace dataset to load",
    )
    parser.add_argument(
        "--hf-dataset-config",
        type=str,
        required=False,
        help="Dataset config to use",
    )
    parser.add_argument(
        "--hf-dataset-split",
        type=str,
        default="train",
        help="Dataset split to use",
    )
    parser.add_argument(
        "--prompt-column",
        type=str,
        default="prompt",
    )
    parser.add_argument(
        "--prompt-template",
        type=str,
        default="{{ instruction }}",
        help="Template string for formatting prompts.",
    )
    parser.add_argument(
        "--model",
        type=str,
        required=True,
        help="Model name to use for generation",
    )
    parser.add_argument(
        "--vllm-server-url",
        type=str,
        default="http://localhost:8000/v1",
        help="URL of the vLLM server",
    )
    parser.add_argument(
        "--temperature",
        type=float,
        help="Temperature for generation",
    )
    parser.add_argument(
        "--top-p",
        type=float,
        help="Top-p value for generation",
    )
    parser.add_argument(
        "--max-new-tokens",
        type=int,
        default=8192,
        help="Maximum number of new tokens to generate",
    )
    parser.add_argument(
        "--num-generations",
        type=int,
        default=1,
        help="Number of generations per problem",
    )
    parser.add_argument(
        "--input-batch-size",
        type=int,
        default=64,
        help="Batch size for input processing",
    )
    parser.add_argument(
        "--client-replicas",
        type=int,
        default=1,
        help="Number of client replicas for parallel processing",
    )
    parser.add_argument(
        "--timeout",
        type=int,
        default=600,
        help="Request timeout in seconds (default: 600)",
    )
    parser.add_argument(
        "--retries",
        type=int,
        default=0,
        help="Number of retries for failed requests (default: 0)",
    )
    parser.add_argument(
        "--hf-output-dataset",
        type=str,
        required=False,
        help="HuggingFace repo to push results to",
    )
    parser.add_argument(
        "--private",
        action="store_true",
        help="Whether to make the output dataset private when pushing to HF Hub",
    )

    args = parser.parse_args()

    print("\nRunning with arguments:")
    for arg, value in vars(args).items():
        print(f"  {arg}: {value}")
    print()

    print(f"Loading '{args.hf_dataset}' (config: {args.hf_dataset_config}, split: {args.hf_dataset_split}) dataset...")
    dataset = load_dataset(args.hf_dataset, args.hf_dataset_config, split=args.hf_dataset_split)
    print("Dataset loaded!")

    pipeline = build_distilabel_pipeline(
        model=args.model,
        base_url=args.vllm_server_url,
        prompt_template=args.prompt_template,
        prompt_column=args.prompt_column,
        temperature=args.temperature,
        top_p=args.top_p,
        max_new_tokens=args.max_new_tokens,
        num_generations=args.num_generations,
        input_batch_size=args.input_batch_size,
        client_replicas=args.client_replicas,
        timeout=args.timeout,
        retries=args.retries,
    )

    print("Running generation pipeline...")
    distiset = pipeline.run(
        dataset=dataset,
        dataset_batch_size=args.input_batch_size * 1000,
        use_cache=False,
    )
    print("Generation pipeline finished!")

    if args.hf_output_dataset:
        print(f"Pushing resulting dataset to '{args.hf_output_dataset}'...")
        distiset.push_to_hub(args.hf_output_dataset, private=args.private)
        print("Dataset pushed!")


================================================
FILE: src/open_r1/grpo.py
================================================
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
import os
import sys

import datasets
import transformers
from transformers import set_seed
from transformers.trainer_utils import get_last_checkpoint

from open_r1.configs import GRPOConfig, GRPOScriptArguments
from open_r1.rewards import get_reward_funcs
from open_r1.utils import get_dataset, get_model, get_tokenizer
from open_r1.utils.callbacks import get_callbacks
from open_r1.utils.wandb_logging import init_wandb_training
from trl import GRPOTrainer, ModelConfig, TrlParser, get_peft_config


logger = logging.getLogger(__name__)


def main(script_args, training_args, model_args):
    # Set seed for reproducibility
    set_seed(training_args.seed)

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process a small summary
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Model parameters {model_args}")
    logger.info(f"Script parameters {script_args}")
    logger.info(f"Training parameters {training_args}")

    # Check for last checkpoint
    last_checkpoint = None
    if os.path.isdir(training_args.output_dir):
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
    if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
        logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")

    if "wandb" in training_args.report_to:
        init_wandb_training(training_args)

    # Load the dataset
    dataset = get_dataset(script_args)

    ################
    # Load tokenizer
    ################
    tokenizer = get_tokenizer(model_args, training_args)

    ##############
    # Load model #
    ##############
    logger.info("*** Loading model ***")
    model = get_model(model_args, training_args)

    # Get reward functions from the registry
    reward_funcs = get_reward_funcs(script_args)

    # Format into conversation
    def make_conversation(example, prompt_column: str = script_args.dataset_prompt_column):
        prompt = []

        if training_args.system_prompt is not None:
            prompt.append({"role": "system", "content": training_args.system_prompt})

        if prompt_column not in example:
            raise ValueError(f"Dataset Question Field Error: {prompt_column} is not supported.")

        prompt.append({"role": "user", "content": example[prompt_column]})
        return {"prompt": prompt}

    dataset = dataset.map(make_conversation)

    for split in dataset:
        if "messages" in dataset[split].column_names:
            dataset[split] = dataset[split].remove_columns("messages")

    #############################
    # Initialize the GRPO trainer
    #############################
    trainer = GRPOTrainer(
        model=model,
        reward_funcs=reward_funcs,
        args=training_args,
        train_dataset=dataset[script_args.dataset_train_split],
        eval_dataset=(dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None),
        peft_config=get_peft_config(model_args),
        callbacks=get_callbacks(training_args, model_args),
        processing_class=tokenizer,
    )

    ###############
    # Training loop
    ###############
    logger.info("*** Train ***")
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint is not None:
        checkpoint = last_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    metrics = train_result.metrics
    metrics["train_samples"] = len(dataset[script_args.dataset_train_split])
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

    ##################################
    # Save model and create model card
    ##################################
    logger.info("*** Save model ***")
    # Align the model's generation config with the tokenizer's eos token
    # to avoid unbounded generation in the transformers `pipeline()` function
    trainer.model.generation_config.eos_token_id = tokenizer.eos_token_id
    trainer.save_model(training_args.output_dir)
    logger.info(f"Model saved to {training_args.output_dir}")

    # Save everything else on main process
    kwargs = {
        "dataset_name": script_args.dataset_name,
        "tags": ["open-r1"],
    }
    if trainer.accelerator.is_main_process:
        trainer.create_model_card(**kwargs)
        # Restore k,v cache for fast inference
        trainer.model.config.use_cache = True
        trainer.model.config.save_pretrained(training_args.output_dir)

    ##########
    # Evaluate
    ##########
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        metrics = trainer.evaluate()
        metrics["eval_samples"] = len(dataset[script_args.dataset_test_split])
        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

    #############
    # push to hub
    #############
    if training_args.push_to_hub:
        logger.info("Pushing to hub...")
        trainer.push_to_hub(**kwargs)


if __name__ == "__main__":
    parser = TrlParser((GRPOScriptArguments, GRPOConfig, ModelConfig))
    script_args, training_args, model_args = parser.parse_args_and_config()
    main(script_args, training_args, model_args)


================================================
FILE: src/open_r1/rewards.py
================================================
# coding=utf-8
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Reward functions for GRPO training."""

import asyncio
import json
import math
import re
from functools import partial, update_wrapper
from typing import Callable, Dict, Literal, Optional

from latex2sympy2_extended import NormalizationConfig
from math_verify import LatexExtractionConfig, parse, verify

from .utils.code_providers import get_provider
from .utils.competitive_programming import (
    SubtaskResult,
    add_includes,
    get_morph_client_from_env,
    get_piston_client_from_env,
)
from .utils.competitive_programming import patch_code as cf_patch_code
from .utils.competitive_programming import score_submission as cf_score_submission
from .utils.competitive_programming import score_subtask


def accuracy_reward(completions: list[list[dict[str, str]]], solution: list[str], **kwargs) -> list[Optional[float]]:
    """Reward function that checks if the completion is the same as the ground truth."""
    contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, sol in zip(contents, solution):
        gold_parsed = parse(
            sol,
            extraction_mode="first_match",
        )
        if len(gold_parsed) != 0:
            # We require the answer to be provided in correct latex (no malformed operators)
            answer_parsed = parse(
                content,
                extraction_config=[
                    LatexExtractionConfig(
                        normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        # Ensures that boxed is tried first
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
                    )
                ],
                extraction_mode="first_match",
            )
            # Compute binary rewards if verifiable, `None` otherwise to skip this example
            try:
                reward = float(verify(gold_parsed, answer_parsed))
            except Exception as e:
                print(f"verify failed: {e}, answer: {answer_parsed}, gold: {gold_parsed}")
                reward = None
        else:
            # If the gold solution is not parseable, we assign `None` to skip this example
            reward = None
            print("Failed to parse gold solution: ", sol)
        rewards.append(reward)

    return rewards


def format_reward(completions, **kwargs):
    """Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags."""
    pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]


def tag_count_reward(completions, **kwargs) -> list[float]:
    """Reward function that checks if we produce the desired number of think and answer tags associated with `format_reward()`.

    Adapted from: https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb#file-grpo_demo-py-L90
    """

    def count_tags(text: str) -> float:
        count = 0.0
        if text.count("<think>\n") == 1:
            count += 0.25
        if text.count("\n</think>\n") == 1:
            count += 0.25
        if text.count("\n<answer>\n") == 1:
            count += 0.25
        if text.count("\n</answer>") == 1:
            count += 0.25
        return count

    contents = [completion[0]["content"] for completion in completions]
    return [count_tags(c) for c in contents]


def reasoning_steps_reward(completions, **kwargs):
    r"""Reward function that checks for clear step-by-step reasoning.
    Regex pattern:
        Step \d+: - matches "Step 1:", "Step 2:", etc.
        ^\d+\. - matches numbered lists like "1.", "2.", etc. at start of line
        \n- - matches bullet points with hyphens
        \n\* - matches bullet points with asterisks
        First,|Second,|Next,|Finally, - matches transition words
    """
    pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"
    completion_contents = [completion[0]["con

Download .txt

gitextract_ohr6vcnk/

├── .github/
│   ├── dependabot.yml
│   └── workflows/
│       └── tests.yml
├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── recipes/
│   ├── DeepSeek-R1-Distill-Qwen-1.5B/
│   │   └── grpo/
│   │       └── config_demo.yaml
│   ├── OlympicCoder-32B/
│   │   └── sft/
│   │       └── config_v00.00.yaml
│   ├── OlympicCoder-7B/
│   │   └── sft/
│   │       └── config_v00.00.yaml
│   ├── OpenR1-Distill-7B/
│   │   └── sft/
│   │       └── config_distill.yaml
│   ├── Qwen2.5-1.5B-Instruct/
│   │   └── grpo/
│   │       ├── config_demo.yaml
│   │       ├── config_demo_code.yaml
│   │       └── config_demo_code_ioi.yaml
│   ├── Qwen2.5-Coder-7B-Instruct/
│   │   └── grpo/
│   │       └── config_codeforces.yaml
│   ├── README.md
│   ├── accelerate_configs/
│   │   ├── ddp.yaml
│   │   ├── fsdp.yaml
│   │   ├── zero2.yaml
│   │   └── zero3.yaml
│   └── dataset_filtering/
│       ├── config_demo.yaml
│       ├── filter_dapo.yaml
│       └── filter_python.yaml
├── scripts/
│   ├── benchmark_e2b.py
│   ├── decontaminate.py
│   ├── e2b_router.py
│   ├── generate_reasoning.py
│   ├── get_tensor_parallel_size.py
│   ├── morph_router.py
│   ├── pass_rate_filtering/
│   │   ├── README.md
│   │   ├── compute_pass_rate.py
│   │   └── launch_filtering.sh
│   ├── run_benchmarks.py
│   └── upload_details.py
├── setup.cfg
├── setup.py
├── slurm/
│   ├── README.md
│   ├── compute_pass_rate.slurm
│   ├── e2b_router.slurm
│   ├── evaluate.slurm
│   ├── experimental/
│   │   └── serve_r1_vllm.slurm
│   ├── generate.slurm
│   ├── morph_router.slurm
│   ├── piston/
│   │   ├── README.md
│   │   ├── launch_piston_workers.sh
│   │   └── launch_single_piston.sh
│   ├── serve_r1.slurm
│   ├── serve_router.slurm
│   └── train.slurm
├── src/
│   └── open_r1/
│       ├── __init__.py
│       ├── configs.py
│       ├── generate.py
│       ├── grpo.py
│       ├── rewards.py
│       ├── sft.py
│       └── utils/
│           ├── __init__.py
│           ├── callbacks.py
│           ├── code_providers.py
│           ├── competitive_programming/
│           │   ├── __init__.py
│           │   ├── cf_scoring.py
│           │   ├── code_patcher.py
│           │   ├── ioi_scoring.py
│           │   ├── ioi_utils.py
│           │   ├── morph_client.py
│           │   ├── piston_client.py
│           │   └── utils.py
│           ├── data.py
│           ├── evaluation.py
│           ├── hub.py
│           ├── import_utils.py
│           ├── model_utils.py
│           ├── routed_morph.py
│           ├── routed_sandbox.py
│           └── wandb_logging.py
└── tests/
    ├── __init__.py
    ├── slow/
    │   └── test_code_reward.py
    ├── test_rewards.py
    └── utils/
        └── test_data.py

Download .txt

SYMBOL INDEX (221 symbols across 35 files)

FILE: scripts/benchmark_e2b.py
  function benchmark_code_reward (line 33) | def benchmark_code_reward(example):

FILE: scripts/decontaminate.py
  function normalize_string (line 36) | def normalize_string(text: str) -> str:
  function word_ngrams (line 45) | def word_ngrams(text: str, n: int) -> list:
  function build_ngram_lookup (line 51) | def build_ngram_lookup(documents: list[str], ngram_size: int = 8) -> dic...
  function build_ngram_single (line 64) | def build_ngram_single(document: str, ngram_size: int = 8) -> set[str]:
  function find_contaminated (line 118) | def find_contaminated(row):
  function cleanup (line 127) | def cleanup(dataset: Dataset) -> Dataset:

FILE: scripts/e2b_router.py
  class BatchRequest (line 32) | class BatchRequest(BaseModel):
  class ScriptResult (line 47) | class ScriptResult(BaseModel):
  function create_app (line 66) | def create_app(args):
  function parse_args (line 139) | def parse_args():

FILE: scripts/generate_reasoning.py
  function generate_completion (line 21) | async def generate_completion(session, prompt, args):
  function process_example (line 45) | async def process_example(example, session, args, output_file, pbar):
  function load_processed_uuids (line 91) | async def load_processed_uuids(output_file, uuid_column):
  function main (line 104) | async def main():

FILE: scripts/get_tensor_parallel_size.py
  function get_tensor_parallel_size (line 5) | def get_tensor_parallel_size(model_name: str, revision: str = None, defa...

FILE: scripts/morph_router.py
  class BatchRequest (line 28) | class BatchRequest(BaseModel):
  class ScriptResult (line 43) | class ScriptResult(BaseModel):
  function create_app (line 59) | def create_app(args):
  function parse_args (line 143) | def parse_args():

FILE: scripts/pass_rate_filtering/compute_pass_rate.py
  class PassRateScriptArguments (line 38) | class PassRateScriptArguments(GRPOScriptArguments):
  function main (line 48) | def main(script_args, training_args, model_args):

FILE: scripts/run_benchmarks.py
  class ScriptArguments (line 23) | class ScriptArguments:
  function main (line 39) | def main():

FILE: scripts/upload_details.py
  class ScriptArguments (line 34) | class ScriptArguments:
  function main (line 40) | def main():

FILE: setup.py
  function deps_list (line 88) | def deps_list(*pkgs):

FILE: src/open_r1/configs.py
  class DatasetConfig (line 23) | class DatasetConfig:
  class DatasetMixtureConfig (line 34) | class DatasetMixtureConfig:
  class ScriptArguments (line 43) | class ScriptArguments(trl.ScriptArguments):
    method __post_init__ (line 78) | def __post_init__(self):
  class GRPOConfig (line 125) | class GRPOConfig(trl.GRPOConfig):
  class SFTConfig (line 170) | class SFTConfig(trl.SFTConfig):
  class GRPOScriptArguments (line 209) | class GRPOScriptArguments(ScriptArguments):

FILE: src/open_r1/generate.py
  function build_distilabel_pipeline (line 23) | def build_distilabel_pipeline(

FILE: src/open_r1/grpo.py
  function main (line 35) | def main(script_args, training_args, model_args):

FILE: src/open_r1/rewards.py
  function accuracy_reward (line 40) | def accuracy_reward(completions: list[list[dict[str, str]]], solution: l...
  function format_reward (line 85) | def format_reward(completions, **kwargs):
  function tag_count_reward (line 93) | def tag_count_reward(completions, **kwargs) -> list[float]:
  function reasoning_steps_reward (line 115) | def reasoning_steps_reward(completions, **kwargs):
  function len_reward (line 132) | def len_reward(completions: list[Dict[str, str]], solution: list[str], *...
  function get_cosine_scaled_reward (line 205) | def get_cosine_scaled_reward(
  function get_repetition_penalty_reward (line 285) | def get_repetition_penalty_reward(ngram_size: int, max_penalty: float, l...
  function _init_event_loop (line 357) | def _init_event_loop():
  function ioi_code_reward (line 367) | def ioi_code_reward(completions, test_batch_size: int = 1, provider_type...
  function cf_code_reward (line 420) | def cf_code_reward(
  function extract_code (line 476) | def extract_code(completion: str, language: str | None = "python") -> str:
  function binary_code_reward (line 485) | def binary_code_reward(
  function code_reward (line 511) | def code_reward(
  function get_code_format_reward (line 595) | def get_code_format_reward(language: str = "python"):
  function get_soft_overlong_punishment (line 620) | def get_soft_overlong_punishment(max_completion_len, soft_punish_cache):
  function get_reward_funcs (line 646) | def get_reward_funcs(script_args) -> list[Callable]:

FILE: src/open_r1/sft.py
  function main (line 55) | def main(script_args, training_args, model_args):

FILE: src/open_r1/utils/callbacks.py
  function is_slurm_available (line 28) | def is_slurm_available() -> bool:
  class DummyConfig (line 37) | class DummyConfig:
    method __init__ (line 38) | def __init__(self, **kwargs):
  class PushToHubRevisionCallback (line 43) | class PushToHubRevisionCallback(TrainerCallback):
    method __init__ (line 44) | def __init__(self, model_config) -> None:
    method on_save (line 47) | def on_save(
  function get_callbacks (line 85) | def get_callbacks(train_config, model_config) -> List[TrainerCallback]:

FILE: src/open_r1/utils/code_providers.py
  class CodeExecutionProvider (line 46) | class CodeExecutionProvider(abc.ABC):
    method execute_scripts (line 50) | def execute_scripts(self, scripts: List[str], languages: List[str]) ->...
  class E2BProvider (line 63) | class E2BProvider(CodeExecutionProvider):
    method __init__ (line 66) | def __init__(self, num_parallel: int = 2, e2b_router_url: Optional[str...
    method execute_scripts (line 82) | def execute_scripts(self, scripts: List[str], languages: List[str]) ->...
    method _run_async_from_sync (line 115) | def _run_async_from_sync(self, scripts: List[str], languages: List[str...
    method _run_async (line 125) | async def _run_async(self, scripts: List[str], languages: List[str], n...
    method _run_script (line 135) | async def _run_script(self, script: str, languages: List[str], semapho...
  class MorphProvider (line 169) | class MorphProvider(CodeExecutionProvider):
    method __init__ (line 172) | def __init__(self, num_parallel: int = 2, morph_router_url: Optional[s...
    method execute_scripts (line 211) | def execute_scripts(self, scripts: List[str], languages: List[str]) ->...
    method _run_async (line 253) | async def _run_async(self, scripts: List[str], languages: List[str], n...
    method _run_script (line 273) | async def _run_script(self, script: str, languages: List[str], semapho...
  function get_provider (line 339) | def get_provider(provider_type: str = "e2b", **kwargs) -> CodeExecutionP...

FILE: src/open_r1/utils/competitive_programming/cf_scoring.py
  function score_single_test_case (line 12) | async def score_single_test_case(
  function get_generated_contest_tests (line 59) | async def get_generated_contest_tests(contest_id: str) -> list[dict]:
  function get_generated_tests (line 89) | async def get_generated_tests(problem_id: str) -> list[dict]:
  function score_submission (line 94) | async def score_submission(

FILE: src/open_r1/utils/competitive_programming/code_patcher.py
  function fix_python3_imports (line 4) | def fix_python3_imports(source_code):
  function fix_cpp_includes (line 76) | def fix_cpp_includes(source_code):
  function is_patchable (line 85) | def is_patchable(lang):
  function patch_code (line 89) | def patch_code(text, lang):

FILE: src/open_r1/utils/competitive_programming/ioi_scoring.py
  class TestResult (line 11) | class TestResult:
  class SubtaskResult (line 29) | class SubtaskResult:
    method status (line 50) | def status(self):
    method score (line 62) | def score(self):
    method weighted_score (line 76) | def weighted_score(self):
    method to_dict (line 91) | def to_dict(self):
  function _extract_single_status (line 110) | def _extract_single_status(score: float, feedback: str) -> str:
  function score_single_test_case (line 138) | async def score_single_test_case(
  function score_subtask (line 164) | async def score_subtask(
  function score_subtasks (line 246) | async def score_subtasks(
  function run_submission (line 267) | async def run_submission(
  function execute_ioi (line 302) | async def execute_ioi(client, data) -> tuple[str, str]:

FILE: src/open_r1/utils/competitive_programming/ioi_utils.py
  function add_includes (line 7) | def add_includes(code: str, problem_id: str) -> str:
  function load_ioi_tests_for_year (line 26) | def load_ioi_tests_for_year(year: int) -> dict[str, dict[str, tuple[str,...
  function load_ioi_tests (line 37) | def load_ioi_tests(year: int, problem_id: str) -> dict[str, tuple[str, s...

FILE: src/open_r1/utils/competitive_programming/morph_client.py
  class MorphCloudError (line 26) | class MorphCloudError(Exception):
  class MorphCloudExecutionClient (line 30) | class MorphCloudExecutionClient:
    method __init__ (line 31) | def __init__(
    method _prepare_instance (line 49) | async def _prepare_instance(self, snapshot_id=None) -> Instance:
    method _prepare_files (line 82) | async def _prepare_files(self, data: Dict[str, Any], temp_dir: str) ->...
    method _upload_files (line 139) | async def _upload_files(self, instance: Instance, local_files: Dict[st...
    method _compile_code (line 166) | async def _compile_code(self, instance: Instance) -> InstanceExecRespo...
    method _run_tests (line 186) | async def _run_tests(self, instance: Instance, data: Dict[str, Any]) -...
    method _execute_with_instance (line 222) | async def _execute_with_instance(self, instance: Instance, data: Dict[...
    method _execute (line 250) | async def _execute(self, data: Dict[str, Any]) -> Tuple[str, str]:
    method execute (line 282) | async def execute(self, data: Dict[str, Any]) -> Tuple[str, str]:
    method _get_or_create_base_snapshot (line 337) | async def _get_or_create_base_snapshot(self):
    method _get_compile_script (line 408) | async def _get_compile_script(self):
    method _get_run_script (line 485) | async def _get_run_script(self):
  function get_morph_client_from_env (line 715) | def get_morph_client_from_env(session=None) -> MorphCloudExecutionClient:

FILE: src/open_r1/utils/competitive_programming/piston_client.py
  class PistonError (line 12) | class PistonError(Exception):
  function get_piston_client_from_env (line 17) | def get_piston_client_from_env(session=None):
  class PistonClient (line 36) | class PistonClient:
    method __init__ (line 59) | def __init__(
    method session (line 82) | def session(self):
    method _wait_for_endpoint (line 94) | async def _wait_for_endpoint(self):
    method _release_endpoint (line 98) | async def _release_endpoint(self, endpoint):
    method _send_request (line 101) | async def _send_request(self, endpoint, route, data=None, method="post"):
    method _send_to_all (line 107) | async def _send_to_all(self, route, data=None, method="post"):
    method _send_to_one (line 112) | async def _send_to_one(self, endpoint, route, data=None, method="post"):
    method install_package (line 115) | async def install_package(self, language, version):
    method uninstall_package (line 118) | async def uninstall_package(self, language, version):
    method get_supported_runtimes (line 121) | async def get_supported_runtimes(self):
    method _check_failed_endpoint (line 124) | async def _check_failed_endpoint(self, endpoint):
    method send_execute (line 137) | async def send_execute(self, data, language="cms_ioi", max_retries=5):
  function get_slurm_piston_endpoints (line 201) | def get_slurm_piston_endpoints():

FILE: src/open_r1/utils/competitive_programming/utils.py
  function batched (line 4) | def batched(iterable, n):

FILE: src/open_r1/utils/data.py
  function get_dataset (line 12) | def get_dataset(args: ScriptArguments) -> DatasetDict:

FILE: src/open_r1/utils/evaluation.py
  function register_lighteval_task (line 27) | def register_lighteval_task(
  function get_lighteval_tasks (line 62) | def get_lighteval_tasks():
  function run_lighteval_job (line 69) | def run_lighteval_job(
  function run_benchmark_jobs (line 106) | def run_benchmark_jobs(training_args: Union["SFTConfig", "GRPOConfig"], ...

FILE: src/open_r1/utils/hub.py
  function push_to_hub_revision (line 39) | def push_to_hub_revision(training_args: SFTConfig | GRPOConfig, extra_ig...
  function check_hub_revision_exists (line 70) | def check_hub_revision_exists(training_args: SFTConfig | GRPOConfig):
  function get_param_count_from_repo_id (line 89) | def get_param_count_from_repo_id(repo_id: str) -> int:
  function get_gpu_count_for_vllm (line 121) | def get_gpu_count_for_vllm(model_name: str, revision: str = "main", num_...

FILE: src/open_r1/utils/import_utils.py
  function is_e2b_available (line 22) | def is_e2b_available() -> bool:
  function is_morph_available (line 29) | def is_morph_available() -> bool:

FILE: src/open_r1/utils/model_utils.py
  function get_tokenizer (line 9) | def get_tokenizer(model_args: ModelConfig, training_args: SFTConfig | GR...
  function get_model (line 23) | def get_model(model_args: ModelConfig, training_args: SFTConfig | GRPOCo...

FILE: src/open_r1/utils/routed_morph.py
  class RoutedMorphSandbox (line 21) | class RoutedMorphSandbox:
    method __init__ (line 35) | def __init__(self, router_url: str, timeout: int = 300, request_timeou...
    method run_code (line 48) | def run_code(

FILE: src/open_r1/utils/routed_sandbox.py
  class RoutedSandbox (line 22) | class RoutedSandbox:
    method __init__ (line 32) | def __init__(self, router_url: str):
    method run_code (line 41) | def run_code(

FILE: src/open_r1/utils/wandb_logging.py
  function init_wandb_training (line 4) | def init_wandb_training(training_args):

FILE: tests/slow/test_code_reward.py
  class TestCodeRewards (line 26) | class TestCodeRewards(unittest.TestCase):
    method test_python_code_reward (line 27) | def test_python_code_reward(self):
    method test_e2b_router (line 38) | def test_e2b_router(self):
    method test_e2b_router_parallel (line 49) | def test_e2b_router_parallel(self):
    method test_ioi_code_reward (line 74) | def test_ioi_code_reward(self):
    method test_e2b_router_run_code_success (line 87) | def test_e2b_router_run_code_success(self):
    method test_e2b_router_run_code_with_error (line 105) | def test_e2b_router_run_code_with_error(self):
    method test_python_code_reward_morph (line 127) | def test_python_code_reward_morph(self):
    method test_morph_router (line 141) | def test_morph_router(self):
    method test_morph_router_parallel (line 156) | def test_morph_router_parallel(self):
    method test_morph_router_run_code_success (line 183) | def test_morph_router_run_code_success(self):
    method test_morph_router_run_code_with_error (line 200) | def test_morph_router_run_code_with_error(self):

FILE: tests/test_rewards.py
  class TestGetRewardFuncs (line 37) | class TestGetRewardFuncs(unittest.TestCase):
    method test_get_reward_funcs (line 38) | def test_get_reward_funcs(self):
  class TestRewards (line 78) | class TestRewards(unittest.TestCase):
    method test_accuracy_reward_correct_answer (line 79) | def test_accuracy_reward_correct_answer(self):
    method test_accuracy_reward_wrong_answer (line 86) | def test_accuracy_reward_wrong_answer(self):
    method test_accuracy_reward_wrong_answer_no_latex (line 93) | def test_accuracy_reward_wrong_answer_no_latex(self):
    method test_format_reward_correct (line 100) | def test_format_reward_correct(self):
    method test_format_reward_incorrect (line 106) | def test_format_reward_incorrect(self):
    method test_reasoning_steps_reward (line 121) | def test_reasoning_steps_reward(self):
    method test_multiple_completions (line 139) | def test_multiple_completions(self):
    method test_cosine_scaled_reward (line 152) | def test_cosine_scaled_reward(self):
    method test_format_reward_specific_multiline (line 200) | def test_format_reward_specific_multiline(self):
    method test_same_length_responses (line 207) | def test_same_length_responses(self):
    method test_different_lengths_correct_answers (line 218) | def test_different_lengths_correct_answers(self):
    method test_different_lengths_incorrect_answers (line 230) | def test_different_lengths_incorrect_answers(self):
    method test_mixed_correctness (line 243) | def test_mixed_correctness(self):
    method test_unparseable_solution (line 270) | def test_unparseable_solution(self):
  class TestRepetitionPenaltyReward (line 283) | class TestRepetitionPenaltyReward(unittest.TestCase):
    method test_positive_max_penalty_raises_value_error (line 284) | def test_positive_max_penalty_raises_value_error(self):
    method test_no_repetition (line 290) | def test_no_repetition(self):
    method test_full_repetition (line 296) | def test_full_repetition(self):
    method test_partial_repetition (line 304) | def test_partial_repetition(self):
    method test_multiple_completions (line 313) | def test_multiple_completions(self):
    method test_empty_completion (line 326) | def test_empty_completion(self):
    method test_different_ngram_size (line 332) | def test_different_ngram_size(self):
    method test_mixed_case (line 339) | def test_mixed_case(self):
    method test_one_word_completion (line 350) | def test_one_word_completion(self):
    method test_two_word_completion (line 357) | def test_two_word_completion(self):
    method test_three_word_completion (line 364) | def test_three_word_completion(self):
    method test_three_word_repetition_completion (line 371) | def test_three_word_repetition_completion(self):
    method test_four_word_completion_with_repetition (line 378) | def test_four_word_completion_with_repetition(self):
    method test_five_word_completion_with_repetition (line 386) | def test_five_word_completion_with_repetition(self):
    method test_six_word_completion_with_repetition (line 394) | def test_six_word_completion_with_repetition(self):
    method test_long_completion_with_repetition (line 401) | def test_long_completion_with_repetition(self):
    method test_long_completion_without_repetition (line 407) | def test_long_completion_without_repetition(self):
    method test_tag_count_rewards_all_correct (line 414) | def test_tag_count_rewards_all_correct(self):
    method test_tag_count_rewards_missing_think_begin (line 420) | def test_tag_count_rewards_missing_think_begin(self):
    method test_tag_count_rewards_missing_think_end (line 426) | def test_tag_count_rewards_missing_think_end(self):
    method test_tag_count_rewards_missing_answer_begin (line 432) | def test_tag_count_rewards_missing_answer_begin(self):
    method test_tag_count_rewards_missing_answer_end (line 438) | def test_tag_count_rewards_missing_answer_end(self):
    method test_tag_count_rewards_missing_all_tags (line 444) | def test_tag_count_rewards_missing_all_tags(self):
    method test_full_repetition_with_language (line 450) | def test_full_repetition_with_language(self):
    method test_soft_overlong_punishment_short_completion (line 461) | def test_soft_overlong_punishment_short_completion(self):
    method test_soft_overlong_punishment_long_completion (line 469) | def test_soft_overlong_punishment_long_completion(self):
    method test_soft_overlong_punishment_intermediate_completion (line 477) | def test_soft_overlong_punishment_intermediate_completion(self):
  class TestCodeFormat (line 485) | class TestCodeFormat(unittest.TestCase):
    method test_correct_python_format (line 486) | def test_correct_python_format(self):
    method test_incorrect_formats (line 499) | def test_incorrect_formats(self):
    method test_multiple_code_blocks (line 520) | def test_multiple_code_blocks(self):
    method test_different_languages (line 533) | def test_different_languages(self):
    method test_multiline_code (line 553) | def test_multiline_code(self):

FILE: tests/utils/test_data.py
  class TestGetDataset (line 23) | class TestGetDataset(unittest.TestCase):
    method setUpClass (line 25) | def setUpClass(cls):
    method test_dataset_and_config_name (line 30) | def test_dataset_and_config_name(self):
    method test_unweighted_mixture (line 37) | def test_unweighted_mixture(self):
    method test_weighted_mixture (line 52) | def test_weighted_mixture(self):
    method test_mixture_and_test_split (line 69) | def test_mixture_and_test_split(self):
    method test_mixture_column_selection (line 85) | def test_mixture_column_selection(self):
    method test_mixture_with_mismatched_columns (line 106) | def test_mixture_with_mismatched_columns(self):
    method test_no_dataset_name_or_mixture (line 122) | def test_no_dataset_name_or_mixture(self):

Download .json

Condensed preview — 77 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (381K chars).

[
  {
    "path": ".github/dependabot.yml",
    "chars": 205,
    "preview": "version: 2\nupdates:\n  - package-ecosystem: \"pip\"\n    directory: \"/\"\n    schedule:\n      interval: \"weekly\"\n  - package-e"
  },
  {
    "path": ".github/workflows/tests.yml",
    "chars": 667,
    "preview": "name: Tests\n\non:\n  push:\n    branches:\n      - main\n      - v*-release\n  pull_request:\n    branches:\n      - main\n\njobs:"
  },
  {
    "path": ".gitignore",
    "chars": 3498,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "Makefile",
    "chars": 1687,
    "preview": ".PHONY: style quality\n\n# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes"
  },
  {
    "path": "README.md",
    "chars": 34945,
    "preview": "# Open R1\n\n*A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!*\n\n**Tabl"
  },
  {
    "path": "recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml",
    "chars": 3713,
    "preview": "# Model arguments\nmodel_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\nmodel_revision: main\ntorch_dtype: bfloat"
  },
  {
    "path": "recipes/OlympicCoder-32B/sft/config_v00.00.yaml",
    "chars": 1346,
    "preview": "# Config for 16 nodes of 8 H100s with FSDP1\n# Model arguments\nmodel_name_or_path: Qwen/Qwen2.5-Coder-32B-Instruct\nmodel_"
  },
  {
    "path": "recipes/OlympicCoder-7B/sft/config_v00.00.yaml",
    "chars": 1070,
    "preview": "# Config for 1 node of 8 H100s with DeepSpeed ZeRO-3\n# Model arguments\nmodel_name_or_path: Qwen/Qwen2.5-Coder-7B-Instruc"
  },
  {
    "path": "recipes/OpenR1-Distill-7B/sft/config_distill.yaml",
    "chars": 6039,
    "preview": "# Config for 1 node of 8 x H100s (80GB)\n# Model arguments\nmodel_name_or_path: open-r1/Qwen2.5-Math-7B-RoPE-300k\nmodel_re"
  },
  {
    "path": "recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml",
    "chars": 1359,
    "preview": "# Model arguments\nmodel_name_or_path: Qwen/Qwen2.5-1.5B-Instruct\nmodel_revision: main\ntorch_dtype: bfloat16\nattn_impleme"
  },
  {
    "path": "recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml",
    "chars": 1440,
    "preview": "# Model arguments\nmodel_name_or_path: Qwen/Qwen2.5-1.5B-Instruct\nmodel_revision: main\ntorch_dtype: bfloat16\nattn_impleme"
  },
  {
    "path": "recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml",
    "chars": 1758,
    "preview": "# Model arguments\nmodel_name_or_path: Qwen/Qwen2.5-1.5B-Instruct\nmodel_revision: main\ntorch_dtype: bfloat16\nattn_impleme"
  },
  {
    "path": "recipes/Qwen2.5-Coder-7B-Instruct/grpo/config_codeforces.yaml",
    "chars": 2531,
    "preview": "# Model arguments\nmodel_name_or_path: Qwen/Qwen2.5-Coder-7B-Instruct\nmodel_revision: main\ntorch_dtype: bfloat16\nattn_imp"
  },
  {
    "path": "recipes/README.md",
    "chars": 641,
    "preview": "# Post-training recipes\n\n## OpenR1 Distill 7B\n\nTo train the OpenR1 Distill 7B model, run:\n\n```\nsbatch --nodes=1 slurm/tr"
  },
  {
    "path": "recipes/accelerate_configs/ddp.yaml",
    "chars": 319,
    "preview": "compute_environment: LOCAL_MACHINE\ndebug: false\ndistributed_type: MULTI_GPU\ndowncast_bf16: 'no'\ngpu_ids: all\nmachine_ran"
  },
  {
    "path": "recipes/accelerate_configs/fsdp.yaml",
    "chars": 774,
    "preview": "compute_environment: LOCAL_MACHINE\ndebug: false\ndistributed_type: FSDP\ndowncast_bf16: 'no'\nenable_cpu_affinity: false\nfs"
  },
  {
    "path": "recipes/accelerate_configs/zero2.yaml",
    "chars": 467,
    "preview": "compute_environment: LOCAL_MACHINE\ndebug: false\ndeepspeed_config:\n  deepspeed_multinode_launcher: standard\n  offload_opt"
  },
  {
    "path": "recipes/accelerate_configs/zero3.yaml",
    "chars": 498,
    "preview": "compute_environment: LOCAL_MACHINE\ndebug: false\ndeepspeed_config:\n  deepspeed_multinode_launcher: standard\n  offload_opt"
  },
  {
    "path": "recipes/dataset_filtering/config_demo.yaml",
    "chars": 3118,
    "preview": "# Model arguments\nmodel_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\nmodel_revision: main\ntorch_dtype: bfloat"
  },
  {
    "path": "recipes/dataset_filtering/filter_dapo.yaml",
    "chars": 889,
    "preview": "# Model arguments\nmodel_name_or_path: open-r1/R1-Distill-Qwen-Math-7B\nmodel_revision: v03.00-step-000008190\ntorch_dtype:"
  },
  {
    "path": "recipes/dataset_filtering/filter_python.yaml",
    "chars": 899,
    "preview": "# Model arguments\nmodel_name_or_path: open-r1/R1-Distill-Qwen-Math-7B-Merges\nmodel_revision: v00.00-step-000003660_v01.0"
  },
  {
    "path": "scripts/benchmark_e2b.py",
    "chars": 3527,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "scripts/decontaminate.py",
    "chars": 5994,
    "preview": "#!/usr/bin/env python\n# coding=utf-8\n# Copyright 2025 The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under"
  },
  {
    "path": "scripts/e2b_router.py",
    "chars": 6581,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "scripts/generate_reasoning.py",
    "chars": 6156,
    "preview": "import argparse\nimport asyncio\nimport hashlib\nimport json\nimport os\nimport random\nfrom asyncio import Lock\nfrom typing i"
  },
  {
    "path": "scripts/get_tensor_parallel_size.py",
    "chars": 1175,
    "preview": "import argparse\nfrom transformers import AutoConfig\nfrom math import gcd\n\ndef get_tensor_parallel_size(model_name: str, "
  },
  {
    "path": "scripts/morph_router.py",
    "chars": 6653,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "scripts/pass_rate_filtering/README.md",
    "chars": 1315,
    "preview": "# Pass rate filtering\n\nWe provide support to filter datasets by generating and computing pass rate on veriable tasks\n\nSe"
  },
  {
    "path": "scripts/pass_rate_filtering/compute_pass_rate.py",
    "chars": 8336,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "scripts/pass_rate_filtering/launch_filtering.sh",
    "chars": 383,
    "preview": "\n\n# a bash foor loop from 0 to 17,400 in chunks of 200\n\nfor i in {0..17000..200}\ndo\n  START=$i\n  END=$((i + 200))\n  echo"
  },
  {
    "path": "scripts/run_benchmarks.py",
    "chars": 2307,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "scripts/upload_details.py",
    "chars": 1749,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under the Apache License, V"
  },
  {
    "path": "setup.cfg",
    "chars": 696,
    "preview": "[isort]\ndefault_section = FIRSTPARTY\nensure_newline_before_comments = True\nforce_grid_wrap = 0\ninclude_trailing_comma = "
  },
  {
    "path": "setup.py",
    "chars": 5365,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "slurm/README.md",
    "chars": 938,
    "preview": "## Serving DeepSeek-R1 on 2x8 H100 SLURM nodes with SGLang \n\n1. Set up the environment (adjust for your cuda version):\n`"
  },
  {
    "path": "slurm/compute_pass_rate.slurm",
    "chars": 540,
    "preview": "#!/bin/bash\n\n#SBATCH --job-name=open-r1-compute-pass-rate\n#SBATCH --partition=hopper-prod\n#SBATCH --qos=normal\n#SBATCH -"
  },
  {
    "path": "slurm/e2b_router.slurm",
    "chars": 354,
    "preview": "#!/bin/bash\n\n#SBATCH --partition=hopper-cpu\n#SBATCH --mem=16g\n#SBATCH --cpus-per-task=16\n#SBATCH --output=/fsx/open-r1/l"
  },
  {
    "path": "slurm/evaluate.slurm",
    "chars": 3169,
    "preview": "#!/bin/bash\n#SBATCH --ntasks-per-node=1\n#SBATCH --gres=gpu:8\n#SBATCH --partition=hopper-prod\n#SBATCH --output=./logs/%x-"
  },
  {
    "path": "slurm/experimental/serve_r1_vllm.slurm",
    "chars": 3776,
    "preview": "#!/bin/bash\n#SBATCH --job-name=r1-vllm\n#SBATCH --partition=hopper-prod\n#SBATCH --qos=normal\n#SBATCH --nodes=4\n#SBATCH --"
  },
  {
    "path": "slurm/generate.slurm",
    "chars": 6821,
    "preview": "#!/bin/bash\n#SBATCH --job-name=deepseek-r1-generation\n#SBATCH --partition=hopper-prod\n#SBATCH --qos=normal\n#SBATCH --nod"
  },
  {
    "path": "slurm/morph_router.slurm",
    "chars": 395,
    "preview": "#!/bin/bash\n\n#SBATCH --partition=hopper-cpu\n#SBATCH --mem=16g\n#SBATCH --cpus-per-task=16\n#SBATCH --output=/fsx/open-r1/l"
  },
  {
    "path": "slurm/piston/README.md",
    "chars": 3821,
    "preview": "# Piston workers (slurm)\n\nWe have built a [piston](https://github.com/engineer-man/piston) package to run IOI problems.\n"
  },
  {
    "path": "slurm/piston/launch_piston_workers.sh",
    "chars": 607,
    "preview": "#!/bin/bash\n\n# this simple script will launch a bunch of piston workers on the HF science cluster\n\nN_INSTANCES=${1:-5}  "
  },
  {
    "path": "slurm/piston/launch_single_piston.sh",
    "chars": 1678,
    "preview": "#!/bin/bash\n#SBATCH --job-name=piston_worker\n#SBATCH --output=/fsx/open-r1/logs/piston/worker-logs/%x-%j.out\n#SBATCH --e"
  },
  {
    "path": "slurm/serve_r1.slurm",
    "chars": 3479,
    "preview": "#!/bin/bash\n#SBATCH --job-name=r1-server\n#SBATCH --partition=hopper-prod\n#SBATCH --qos=normal\n#SBATCH --nodes=2\n#SBATCH "
  },
  {
    "path": "slurm/serve_router.slurm",
    "chars": 1187,
    "preview": "#!/bin/bash\n#SBATCH --job-name=r1-router\n#SBATCH --partition=hopper-cpu\n#SBATCH --qos=high\n#SBATCH --nodes=1\n#SBATCH --c"
  },
  {
    "path": "slurm/train.slurm",
    "chars": 5287,
    "preview": "#!/bin/bash\n#SBATCH --job-name=open_r1\n#SBATCH --ntasks-per-node=1\n#SBATCH --exclusive\n#SBATCH --gres=gpu:8\n#SBATCH --pa"
  },
  {
    "path": "src/open_r1/__init__.py",
    "chars": 606,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "src/open_r1/configs.py",
    "chars": 12665,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "src/open_r1/generate.py",
    "chars": 6198,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "src/open_r1/grpo.py",
    "chars": 6633,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "src/open_r1/rewards.py",
    "chars": 26219,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "src/open_r1/sft.py",
    "chars": 6130,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "src/open_r1/utils/__init__.py",
    "chars": 243,
    "preview": "from .data import get_dataset\nfrom .import_utils import is_e2b_available, is_morph_available\nfrom .model_utils import ge"
  },
  {
    "path": "src/open_r1/utils/callbacks.py",
    "chars": 3187,
    "preview": "#!/usr/bin/env python\n# coding=utf-8\n# Copyright 2025 The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under"
  },
  {
    "path": "src/open_r1/utils/code_providers.py",
    "chars": 12997,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "src/open_r1/utils/competitive_programming/__init__.py",
    "chars": 561,
    "preview": "from .cf_scoring import score_submission\nfrom .code_patcher import patch_code\nfrom .ioi_scoring import SubtaskResult, sc"
  },
  {
    "path": "src/open_r1/utils/competitive_programming/cf_scoring.py",
    "chars": 5954,
    "preview": "import asyncio\nimport os\nfrom io import BytesIO\nfrom typing import Literal\n\nfrom async_lru import alru_cache\n\nfrom .pist"
  },
  {
    "path": "src/open_r1/utils/competitive_programming/code_patcher.py",
    "chars": 4396,
    "preview": "import re\n\n\ndef fix_python3_imports(source_code):\n    \"\"\"\n    Fix common import and function changes between Python 3 ve"
  },
  {
    "path": "src/open_r1/utils/competitive_programming/ioi_scoring.py",
    "chars": 11765,
    "preview": "import asyncio\nfrom dataclasses import asdict, dataclass, field\nfrom typing import Union\n\nfrom .ioi_utils import load_io"
  },
  {
    "path": "src/open_r1/utils/competitive_programming/ioi_utils.py",
    "chars": 1392,
    "preview": "from collections import defaultdict\nfrom functools import lru_cache\n\nfrom datasets import load_dataset\n\n\ndef add_include"
  },
  {
    "path": "src/open_r1/utils/competitive_programming/morph_client.py",
    "chars": 26492,
    "preview": "import asyncio\nimport json\nimport logging\nimport os\nimport tempfile\nfrom typing import Any, Dict, Optional, Tuple\n\nfrom "
  },
  {
    "path": "src/open_r1/utils/competitive_programming/piston_client.py",
    "chars": 9657,
    "preview": "import asyncio\nimport os\nimport random\nimport re\nimport subprocess\nfrom collections import Counter\nfrom functools import"
  },
  {
    "path": "src/open_r1/utils/competitive_programming/utils.py",
    "chars": 293,
    "preview": "from itertools import islice\n\n\ndef batched(iterable, n):\n    \"Batch data into lists of length n. The last batch may be s"
  },
  {
    "path": "src/open_r1/utils/data.py",
    "chars": 2661,
    "preview": "import logging\n\nimport datasets\nfrom datasets import DatasetDict, concatenate_datasets\n\nfrom ..configs import ScriptArgu"
  },
  {
    "path": "src/open_r1/utils/evaluation.py",
    "chars": 4654,
    "preview": "import subprocess\nfrom typing import TYPE_CHECKING, Dict, Union\n\nfrom .hub import get_gpu_count_for_vllm, get_param_coun"
  },
  {
    "path": "src/open_r1/utils/hub.py",
    "chars": 5478,
    "preview": "#!/usr/bin/env python\n# coding=utf-8\n# Copyright 2025 The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under"
  },
  {
    "path": "src/open_r1/utils/import_utils.py",
    "chars": 948,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "src/open_r1/utils/model_utils.py",
    "chars": 1598,
    "preview": "import torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizer\n\nfrom trl import ModelCon"
  },
  {
    "path": "src/open_r1/utils/routed_morph.py",
    "chars": 4500,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "src/open_r1/utils/routed_sandbox.py",
    "chars": 4363,
    "preview": "# coding=utf-8\n# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Versio"
  },
  {
    "path": "src/open_r1/utils/wandb_logging.py",
    "chars": 480,
    "preview": "import os\n\n\ndef init_wandb_training(training_args):\n    \"\"\"\n    Helper function for setting up Weights & Biases logging "
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/slow/test_code_reward.py",
    "chars": 9386,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "tests/test_rewards.py",
    "chars": 23741,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "tests/utils/test_data.py",
    "chars": 5641,
    "preview": "# Copyright 2025 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  }
]

About this extraction

This page contains the full source code of the huggingface/open-r1 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 77 files (351.9 KB), approximately 89.1k tokens, and a symbol index with 221 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo