Repository: lmarena/p2l
Branch: main
Commit: a905fa5ea94a
Files: 59
Total size: 268.0 KB

Directory structure:
gitextract_yhm008o_/

├── .gitignore
├── README.md
├── deepspeed/
│   └── zero1.json
├── fast_lambda_setup.sh
├── fast_runpod_setup.sh
├── p2l/
│   ├── auto_eval_utils.py
│   ├── auto_evals.py
│   ├── dataset.py
│   ├── endpoint.py
│   ├── eval.py
│   ├── model.py
│   └── train.py
├── probe_barrier.py
├── route/
│   ├── chat.py
│   ├── cost_optimizers.py
│   ├── datatypes.py
│   ├── example_config.yaml
│   ├── openai_server.py
│   ├── requirements.txt
│   ├── routers.py
│   └── utils.py
├── serve_requirements.txt
├── train_requirements.txt
└── training_configs/
    ├── Llama3.1-8B-full-train.yaml
    ├── Qwen2.5-1.5B-bag-chrono-eps-0.016-04302025.yaml
    ├── Qwen2.5-1.5B-bag-chrono-eps-0.032-04302025.yaml
    ├── Qwen2.5-1.5B-bag-chrono-eps-0.06-04302025.yaml
    ├── Qwen2.5-1.5B-bag-chrono-eps-0.112-04302025.yaml
    ├── Qwen2.5-1.5B-bag-chrono-eps-0.2-04302025.yaml
    ├── Qwen2.5-1.5B-bag-full-train-02222025.yaml
    ├── Qwen2.5-1.5B-full-train.yaml
    ├── Qwen2.5-1.5B-rk-full-train-half-batch.yaml
    ├── Qwen2.5-1.5B-rk-full-train.yaml
    ├── Qwen2.5-3B-bag-full-train-02222025.yaml
    ├── Qwen2.5-3B-bag-full-train-02242025.yaml
    ├── Qwen2.5-3B-freeze-test-part-2.yaml
    ├── Qwen2.5-3B-freeze-test.yaml
    ├── Qwen2.5-3B-full-train-double-batch.yaml
    ├── Qwen2.5-3B-full-train.yaml
    ├── Qwen2.5-3B-rk-full-train-half-batch.yaml
    ├── Qwen2.5-3B-rk-full-train.yaml
    ├── Qwen2.5-3B-training-bt_data_11092024 copy.yaml
    ├── Qwen2.5-7B-bag-full-train-02222025.yaml
    ├── Qwen2.5-7B-bag-full-train-02242025.yaml
    ├── Qwen2.5-7B-bag-full-train-03132025.yaml
    ├── Qwen2.5-7B-bag-full-train-chrono.yaml
    ├── Qwen2.5-7B-bt-full-train-02222025.yaml
    ├── Qwen2.5-7B-full-train.yaml
    ├── Qwen2.5-7B-rk-full-train-abs.yaml
    ├── Qwen2.5-7B-rk-full-train-half-batch.yaml
    ├── Qwen2.5-7B-rk-full-train.yaml
    ├── debug.yaml
    ├── init_debug_qwen_1.5b_he.yaml
    ├── init_debug_qwen_1.5b_reset_params.yaml
    ├── init_debug_qwen_1.5b_xavier.yaml
    ├── init_debug_qwen_3b_he.yaml
    ├── init_debug_qwen_3b_reset_params.yaml
    ├── init_debug_qwen_3b_xavier.yaml
    └── qwen_1.5B_geom_test.yaml

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
__pycache__/


================================================
FILE: README.md
================================================
# Prompt-to-Leaderboard (P2L)

This is the codebase for the paper [Prompt-to-Leaderboard](https://arxiv.org/pdf/2502.14855).

Models weights found at our [LMArena HF Collection](https://huggingface.co/collections/lmarena-ai/prompt-to-leaderboard-67bcf7ddf6022ef3cfd260cc).

Try on Chatbot Arena at the [Prompt-to-Leaderboard](https://lmarena.ai/?p2l) tab!

## Abstract
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance.
To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt or set of prompts.
The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote.
The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. 
Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. 
Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot in the Chatbot Arena leaderboard.

## Table of Contents

- [P2L](#p2l)
  - [Abstract](#abstract)
  - [Table of Contents](#table-of-contents)
  - [Environment Setup](#environment-setup)
    - [Installing `uv`](#installing-uv)
    - [Serving P2L Setup](#serving-p2l-setup)
    - [Serving a Router Setup](#serving-a-router-setup)
    - [Training Setup](#training-setup)
  - [Serving P2L](#serving-p2l)
  - [Serving an OpenAI Compatible Router](#serving-an-openai-compatible-router)
    - [Example: serving a Bradley-Terry based cost-optimal router](#example-serving-a-bradley-terry-based-cost-optimal-router)
    - [Example: serving a Grounded RK based simple cost router](#example-serving-a-grounded-rk-based-simple-cost-router)
  - [Calling the OpenAI Compatible Router](#calling-the-openai-compatible-router)
  - [Training a P2L Model](#training-a-p2l-model)
  - [Inferencing a P2L Model](#inferencing-a-p2l-model)
  - [AutoEval Suite](#autoeval-suite)
    - [Params](#params)
  - [Citation](#citation)


## Environment Setup

Setup instuctions will be shown using `uv`, however any package management system will work. All environments are native to Python 3.10, other versions are untested but may also work.

### Installing `uv`

If you like the sound of ~50x faster environment setup times, run the following to install `uv`.

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh

source $HOME/.local/bin/env
```

To create a Python virtual environment run:

```bash
uv venv .env --python 3.10
```

To activate said environment, run:

```bash
source .env/bin/activate
```

### Serving P2L Setup

To serve a P2L model first run:

```bash
uv pip install -r serve_requirements.txt
```

### Serving a Router Setup

To serve a OpenAI compatible router, first run:

```bash
uv pip install -r route/requirements.txt
```

### Training Setup

To train a P2L model first run:

```bash
uv pip install -r train_requirements.txt
```

## Serving P2L

Before getting started, make sure you have followed the steps in [Serving Setup](#serving-p2l-setup).

`python p2l.endpoint` considers the following arguments:

| Option | Short Flag | Description |
|--------|-----------|-------------|
| `--help` | `-h` | Show this help message and exit. |
| `--model-path MODEL_PATH` | `-m MODEL_PATH` | Path to the model repository. |
| `--model-type MODEL_TYPE` | `-mt MODEL_TYPE` | Type of the model. |
| `--head-type HEAD_TYPE` | `-ht HEAD_TYPE` | Type of model head. |
| `--loss-type LOSS_TYPE` | `-lt LOSS_TYPE` | Type of the loss function. |
| `--api-key API_KEY` | `-a API_KEY` | API key for authorization. |
| `--host HOST` | `-H HOST` | Host to run the server on. |
| `--port PORT` | `-p PORT` | Port to run the server on. |
| `--reload, --no-reload` | - | Whether to reload the endpoint on detected code changes (requires workers to be set to 1). |
| `--workers WORKERS` | - | Number of endpoint workers (each will hold a model instance). |
| `--cuda, --no-cuda` | - | Flag to enable using a GPU to host the model. Flag is true by default. |

For example, to run lmarena-ai/p2l-7b-grk-02222025, which is a Qwen2 based "grk" model, which has head type `rk`, we would run:

```bash
python -m p2l.endpoint --model-path lmarena-ai/p2l-7b-grk-02222025 --model-type qwen2 --head-type rk --api-key <your-desired-api-key>
```

This code will host the model running on 1 worker and host 0.0.0.0 and port 10250 by default. Reload will be enabled meaning code changes will reload the endpoint. Note that by default the endpoint expects to load the model onto a GPU, however by specifying `--no-cuda` you can run this on CPU only, which may work for smaller P2L models.

Each P2L model has an associated model list, which specifices which model each index of the outputted coefficients corresponds to. Below is an example function to get this model list from the hosted endpoint:

```python
def get_p2l_endpoint_models(base_url: str, api_key: str) -> List[str]:

    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    try:
        response = requests.get(f"{base_url}/models", headers=headers)
        response.raise_for_status()
        result = response.json()
        return result["models"]

    except Exception as err:
        print(f"An error occurred: {err}")
```

Below is an example python function to query the P2L endpoint:

```python
def query_p2l_endpoint(
    prompt: list[str], base_url: str, api_key: str
) -> Dict[str, List]:

    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    payload = {"prompt": prompt}

    try:
        response = requests.post(
            f"{base_url}/predict", headers=headers, data=json.dumps(payload)
        )
        response.raise_for_status()
        result = response.json()
        return result

    except Exception as err:

        raise err
```

Note that the input is a list of strings. This is NOT for  a batch of prompts, but rather for each turn in a coversation. For example, given a 2 turn conversation:

```
User: "hi!"
Assistant: "Hello"
User: "what's 1+1?"
```

The correct P2L input would be:

```python
["hi!", "what's 1+1?"]
```

## Serving an OpenAI Compatible Router

Serve an OpenAI compatible router with `python -m route.openai_server`. The available arguments are shown below.

| Option | Short Flag | Description |
|--------|-----------|-------------|
| `--help` | `-h` | Show this help message and exit. |
| `--config CONFIG` | `-c CONFIG` | Path to the configuration file. |
| `--router-type ROUTER_TYPE` | - | Type of the router to use. Available types are `bt-endpoint` and `grk-endpoint`.|
| `--router-model-name ROUTER_MODEL_NAME` | - | Name of the router model. |
| `--router-model-endpoint ROUTER_MODEL_ENDPOINT` | - | Endpoint URL for the router model. |
| `--router-api-key ROUTER_API_KEY` | - | API key for the router authentication. |
| `--cost-optimizer COST_OPTIMIZER` | - | Enable or configure cost optimization settings. Available types are `optimal-lp`, `simple-lp`, `strict`.|
| `--port PORT` | `-p PORT` | Port to run the server on. |
| `--host HOST` | - | Host to run the server on. |
| `--api-key API_KEY` | - | API key for authorization. |
| `--reload, --no-reload` | - | Whether to reload the endpoint on detected code changes (requires workers to be set to 1). |
| `--workers WORKERS` | - | Number of endpoint workers (each will hold a model instance). |

### Example: serving a Bradley-Terry based cost-optimal router

First, similar to above [above](#serving-p2l), we need to start serving a P2L model, this time Bradley-Terry based. To do this, let's run:

```bash
python -m p2l.endpoint --model-path lmarena-ai/p2l-7b-bt-01132025 --model-type qwen2 --head-type bt --api-key <your-desired-api-key>
```

Now, we need to configure a routing config file. This will specify the available models and inference details for the router.

For example, here is an example configuration that specifies Claude-3.5-Sonnet and GPT-4o:

```yaml
model_configs:
    claude-3-5-sonnet-20241022:
        api_key: <your-api-key>
        base_url: null
        cost: 9.3110239362
        max_tokens: 8192
        name: claude-3-5-sonnet-20241022
        system_prompt: null
        temp: 0.7
        top_p: 0.7
        type: anthropic

    gpt-4o-2024-05-13:
        api_key: <your-api-key>
        base_url: null
        cost: 12.3166873868
        name: gpt-4o-2024-05-13
        system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based
        on the GPT-4 architecture.

        Current date: 2025-01-06


        Image input capabilities: Enabled

        Personality: v2'
        temp: 0.7
        top_p: 1.0
        type: openai
```

Notice how the system prompt, temperature, and top_p are defined. These replicate how the models are served on Chatbot Arena. P2L is trained with the expectation that the models are running on this configuration. Therefore, for the most reliable results, we recommend sticking to the configs shown in [`example_config.yaml`](./route/example_config.yaml), though alternatives should still function well.

Additionally, we allow for adjustment of the `cost` parameter. One natural choice is just cost per output token, however more accuracte cost estimates are better. For example, the costs in [`example_config.yaml`](./route/example_config.yaml) are calculated to be proportional to the formula `cost_per_output_token * average_output_tokens_per_response`.

Now, lets assume we put the above config content into `config.yaml`. To start the OpenAI compatible router we would run:

```bash
python -m route.openai_server --config config.yaml --router-type bt-endpoint --router-model-endpoint http://0.0.0.0:10250 --router-api-key <your-api-key> --cost-optimizer optimal-lp --api-key <your-endpoint-api-key>
```

Let's break down what this command means:

- `--router-type bt-endpoint`: we are using a Bradley-Terry based P2L model hosted on an endpoint.
- `--router-model-endpoint http://0.0.0.0:10250`: this is where the router endpoint is, generally the default address will be this if you are running the routing server on the same machine running the P2L endpoint.
- `--cost-optimizer optimal-lp`: we are using cost routing using the optimal linear program detailed in Theorem 1 of the paper.

>**Note**: `optimal-lp` is only compatible with BT models, and `simple-lp` is only compatible with grounded RK (sometimes specified as bag) models.


### Example: serving a Grounded RK based simple cost router

P2L has a class of "Grounded RK" models. These models produces coefficents such that `0.0` represents the threshold for a "usable" answer. We can leverage this to cost route to maximize $P(\text{Not Bad})$... whatever that means exactly. Below we detail the steps to run this routing setup.

First, start up the P2L endpoint:

```bash
python -m p2l.endpoint --model-path lmarena-ai/p2l-7b-grk-02222025 --model-type qwen2 --head-type rk --api-key <your-desired-api-key>
```

Then start up the router server:

```bash
python -m router.openai_server --config config.yaml --router-type grk-endpoint --router-model-endpoint http://0.0.0.0:10250 --router-api-key <your-api-key> --cost-optimizer simple-lp --api-key <your-endpoint-api-key>
```

## Calling the OpenAI Compatible Router

As aptly named, the router server is OpenAI compatible. We can call it like any other OpenAI compatible model:

```python
from openai import OpenAI

client = OpenAI(
    base_url: "<your_router_endpoint_url>/v1",
    api_key: "<your_router_api_key>",
)

prompt = "what's 828913*1234?"

response = client.chat.completions.create(
    model="-", # This field is actually not used
    message=[{"role": "user", "content": prompt}],
    stream=True, # Router is compatible with and without streaming.
)
# Notice no temperature, top_p, or system prompt is set.
# This allows the router to use the default provided by the config file.
# If you do pass in these fields, they will override the config.
```

If we want to specify a cost budget, we need to do the following:

```python
response = client.chat.completions.create(
    model="-", # This field is actually not used
    message=[{"role": "user", "content": prompt}],
    stream=True, # Router is compatible with and without streaming.
    extra_body={"cost": <desired_cost>}
)
```

## Training a P2L Model

This codebase also contains the training code for P2L models. To train a P2L model, first set up a training config. The [`training_configs`](./training_configs/) directory has many examples.

To train run, for example:

```bash
deepspeed --num_gpus=8 --module p2l.train --config training_configs/<your_config>.yaml --no-eval --save-steps 512
```

## Inferencing a P2L Model

To quickly inference on a dataset using P2L, run:

```bash
python -m p2l.eval --model <p2l_model_name> --dataset <hf_dataset_path> --head-type <head_type> --model-type <qwen2_or_llama> --batch-size 2
```

This will work on any dataset of single turn prompts under the column name `prompt`.

## AutoEval Suite

Our in-depth evaluation code can be run using `p2l.auto_evals`.

### Params

- **a. Model List Params**
    1. Either provide `--model_repo`, which has a `model_list.json` file.
    2. Or provide a local `--model_list_path` file.

- **b. Val Data**
    1. **Data is in JSONL format**:
        - Provide a local `--eval_path`.
        - If no path is provided, the program will look for an `eval_outputs.jsonl` file in the `--model_repo` on HF.
    2. **Data is in JSON format (checkpoint files)**:
        - Provide a local `--checkpoint_path`.
        - Or provide remote `--hf_checkpoint_repo` and `--hf_checkpoint_file`.

- **c. Output Directory**
    1. Provide a local `--output_dir` or a remote `--hf_output_dir`.
    2. Provide `--output_file_name`.

- **d. Train Data (Optional)**
    - Provide `--hf_train_dataset` or a local `--train_path`.

- **e. Arena Data (Optional)**
    - Provide a local `--arena_path` (CSV with model rankings).

- **f. Provide Model Info**
    1. `--loss_type` (e.g., `bt`, `bt_tie`, `rk`).
    2. `--model_type` (e.g., `p2l`, `marginal`, `arena`, `marginal-gt`).
    3. `--categories`.

- **g. Provide Types of Metrics**
    1. `--simple_metrics`, `--category_metrics`, `--rand_subset_metrics`, `--aggr_scale_subset_metrics`.
    2. Use `--metrics_to_inc` to filter out which of the above metrics to include.

- **h. Random Subset Params**
    1. `--rand_subset_sizes`: Specify subset sizes.
    2. `--rand_num_samples`: Specify the number of samples per random subset size.

- **i. Aggregation Subset Params**
    1. `--aggr_scale_subset_sizes`: Specify subset sizes.
    2. `--aggr_scale_num_samples`: Specify the number of samples per random subset size.
    3. `--aggr_scale_gt`: Specify whether to use `marginal-gt` or `arena` as ground truth for categories.

---

## Citation

```
@misc{frick2025prompttoleaderboard,
      title={Prompt-to-Leaderboard}, 
      author={Evan Frick and Connor Chen and Joseph Tennyson and Tianle Li and Wei-Lin Chiang and Anastasios N. Angelopoulos and Ion Stoica},
      year={2025},
      eprint={2502.14855},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.14855}, 
}
```


================================================
FILE: deepspeed/zero1.json
================================================
{
    "bf16": {
        "enabled": "auto"
    },

    "fp16": {
        "enabled": "auto"
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": true,
    "zero_optimization": {
        "stage": 1,
        "reduce_bucket_size": 5e8
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": "auto",
          "betas": [
            0.9,
            0.999
          ],
          "eps": "auto"
        }
    }
}

================================================
FILE: fast_lambda_setup.sh
================================================
sudo apt-get update -y
sudo apt-get install tmux -y
sudo apt-get install python3-dev -y

sudo apt-get install tmux libaio-dev libopenmpi-dev python3-mpi4py -y

curl -LsSf https://astral.sh/uv/install.sh | sh

source $HOME/.local/bin/env

uv venv .env --python 3.10

source .env/bin/activate

uv pip install wheel packaging

uv pip install -r train_requirements.txt
uv pip install flash-attn==2.5.9.post1 --no-build-isolation


================================================
FILE: fast_runpod_setup.sh
================================================
apt-get update -y
apt-get install tmux -y
apt-get install python3-dev -y

apt-get install tmux libaio-dev libopenmpi-dev python3-mpi4py -y

curl -LsSf https://astral.sh/uv/install.sh | sh

source $HOME/.local/bin/env

uv venv .env --python 3.10

source .env/bin/activate

uv pip install wheel packaging

uv pip install -r train_requirements.txt
uv pip install flash-attn==2.5.9.post1 --no-build-isolation


================================================
FILE: p2l/auto_eval_utils.py
================================================
from typing import Callable, Dict


import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
import pandas as pd
import numpy as np
from scipy.optimize import minimize


from scipy.stats import kendalltau, spearmanr
from model import (
    registered_losses,
    HeadOutputs,
    registered_aggr_models,
    registered_pairwise_losses,
)

registered_simple_metrics: Dict[str, Dict[str, Callable]] = {}
registered_aggr_metrics: Dict[str, Dict[str, Callable]] = {}
registered_helpers: Dict[str, Callable] = {}


def register_simple_metric(loss_type: str, metric: str):
    def decorator(func: Callable):
        if loss_type not in registered_simple_metrics:
            registered_simple_metrics[loss_type] = {}
        registered_simple_metrics[loss_type][metric] = func
        return func

    return decorator


def register_aggr_metric(loss_type: str, metric: str):
    def decorator(func: Callable):
        if loss_type not in registered_aggr_metrics:
            registered_aggr_metrics[loss_type] = {}
        registered_aggr_metrics[loss_type][metric] = func
        return func

    return decorator


def register_helper(loss_or_model_type: str, helper_func):
    def decorator(func: Callable):
        if loss_or_model_type not in registered_helpers:
            registered_helpers[loss_or_model_type] = {}
        registered_helpers[loss_or_model_type][helper_func] = func
        return func

    return decorator


@register_helper("p2l", "output_labels")
def output_labels_p2l(val_data: pd.DataFrame, **kwargs):
    betas = torch.tensor(np.stack(val_data["betas"]), dtype=torch.float)
    labels = torch.tensor(np.stack(val_data["labels"]))
    etas = None

    if "eta" in val_data.columns:
        etas = torch.tensor(np.stack(val_data["eta"]), dtype=torch.float)

    return HeadOutputs(coefs=betas, eta=etas), labels


def translate_coefs(coef, old_list, new_list):
    old_list = old_list.tolist()
    old_to_new = [old_list.index(model) for model in new_list]
    betas_array = np.array(coef)

    betas_array = betas_array[old_to_new]

    return torch.tensor(betas_array)


@register_helper("marginal", "output_labels")
def output_labels_marginal(
    val_data: pd.DataFrame,
    train_data: pd.DataFrame,
    model_list: np.array,
    train_model_list: np.array,
    loss_type: str,
    **kwargs,
):
    train_labels = torch.tensor(np.stack(train_data["labels"]))
    coefs, eta = train_marginal(train_model_list, train_labels, loss_type)
    coefs, eta = coefs[0], eta[0] if eta is not None else None

    coefs = translate_coefs(coefs, train_model_list, model_list)

    val_labels = torch.tensor(np.stack(val_data["labels"]))

    coefs = coefs.expand(len(val_labels), -1)
    eta = eta.expand(len(val_labels), -1) if eta is not None else None

    return HeadOutputs(coefs=coefs, eta=eta), val_labels


@register_helper("marginal-gt", "output_labels")
def output_labels_marginal_gt(
    val_data: pd.DataFrame, model_list: np.array, loss_type: str, **kwargs
):
    val_labels = torch.tensor(np.stack(val_data["labels"]))
    coefs, eta = train_marginal(model_list, val_labels, loss_type)

    coefs = coefs.expand(len(val_labels), -1)
    eta = eta.expand(len(val_labels), -1) if eta is not None else None

    return HeadOutputs(coefs=coefs, eta=eta), val_labels


@register_helper("arena", "output_labels")
def output_labels_arena(
    arena_rankings: torch.tensor, val_data: pd.DataFrame, loss_type: str, **kwargs
):
    labels = torch.tensor(np.stack(val_data["labels"]))

    # arena rankings is already filtered so it will be 1d tensor
    betas = arena_rankings.expand(len(labels), -1)
    etas = torch.ones(len(labels))
    etas = etas.unsqueeze(-1)

    # TODO: Cleanup
    if loss_type == "bt" or loss_type == "bt-tie":
        etas = None

    return HeadOutputs(coefs=betas, eta=etas), labels


@register_helper("bag", "preprocess_data")
def preprocess_data_bag(data: pd.DataFrame, **kwargs):
    condition = data["winner"] == "tie (bothbad)"
    data.loc[condition, "labels"] = data.loc[condition, "labels"].apply(
        lambda arr: arr[:2] + [2]
    )
    return data


@register_helper("bt", "preprocess_data")
@register_helper("bt-tie", "preprocess_data")
@register_helper("rk", "preprocess_data")
@register_helper("rk-reparam", "preprocess_data")
def preprocess_data(data: pd.DataFrame, **kwargs):
    return data


@register_simple_metric("bt", "Loss")
@register_simple_metric("bt", "BCELoss")
@register_simple_metric("bt-tie", "Loss")
@register_simple_metric("rk", "Loss")
@register_simple_metric("rk-reparam", "Loss")
@register_simple_metric("bag", "Loss")
def loss(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs):
    loss_func = registered_losses.get(loss_type)
    return loss_func(head_output=head_output, labels=labels).item()


@register_simple_metric("rk", "Tie_Loss")
@register_simple_metric("bag", "Tie_Loss")
def tie_loss(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs):
    loss_func = registered_losses.get("tie-" + loss_type)
    return loss_func(head_output=head_output, labels=labels).item()


@register_simple_metric("bag", "Tie_bb_Loss")
def tie_bb_loss(
    head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs
):
    loss_func = registered_losses.get("tie-bb-" + loss_type)
    return loss_func(head_output=head_output, labels=labels).item()


@register_aggr_metric("bt", "Aggr_Tie_Loss")
@register_aggr_metric("bt-tie", "Aggr_Tie_Loss")
@register_aggr_metric("rk", "Aggr_Tie_Loss")
@register_aggr_metric("rk-reparam", "Aggr_Tie_Loss")
@register_aggr_metric("bag", "Aggr_Tie_Loss")
def Aggr_Tie_Loss(
    gt_output: HeadOutputs,
    model_output: HeadOutputs,
    loss_type: str,
    labels: torch.tensor,
    **kwargs,
):

    return aggr_metric("Tie_Loss", loss_type, labels, gt_output, model_output)


@register_simple_metric("bt-tie", "BCELoss")
@register_simple_metric("rk", "BCELoss")
@register_simple_metric("rk-reparam", "BCELoss")
@register_simple_metric("bag", "BCELoss")
def BCE_loss(head_output: HeadOutputs, labels: torch.Tensor, **kwargs):
    non_tie_index = torch.where(labels[:, -1] == 0)[0]

    new_coefs = head_output.coefs[non_tie_index, :]
    new_eta = head_output.eta[non_tie_index] if head_output.eta is not None else None

    no_tie_output = HeadOutputs(coefs=new_coefs, eta=new_eta)
    no_tie_labels = labels[non_tie_index, :]
    return loss(no_tie_output, no_tie_labels, loss_type="bt")


def aggr_metric(metric_name, loss_type, labels, gt_output, model_output):
    func = registered_simple_metrics[loss_type][metric_name]

    gt = func(
        labels=labels, head_output=expand_output(gt_output, labels), loss_type=loss_type
    )
    model = func(
        labels=labels,
        head_output=expand_output(model_output, labels),
        loss_type=loss_type,
    )

    return {"ground-truth": round(gt, 4), "model-aggr": round(model, 4)}


@register_aggr_metric("bt", "Aggr_Loss")
@register_aggr_metric("bt-tie", "Aggr_Loss")
@register_aggr_metric("rk", "Aggr_Loss")
@register_aggr_metric("rk-reparam", "Aggr_Loss")
@register_aggr_metric("bag", "Aggr_Loss")
def Aggr_Loss(
    gt_output: HeadOutputs,
    model_output: HeadOutputs,
    loss_type: str,
    labels: torch.tensor,
    **kwargs,
):

    return aggr_metric("Loss", loss_type, labels, gt_output, model_output)


@register_aggr_metric("bt", "Aggr_BCELoss")
@register_aggr_metric("bt-tie", "Aggr_BCELoss")
@register_aggr_metric("rk", "Aggr_BCELoss")
@register_aggr_metric("rk-reparam", "Aggr_BCELoss")
@register_aggr_metric("bag", "Aggr_BCELoss")
def Aggr_BCE_Loss(
    gt_output: HeadOutputs,
    model_output: HeadOutputs,
    loss_type: str,
    labels: torch.tensor,
    **kwargs,
):

    return aggr_metric("BCELoss", loss_type, labels, gt_output, model_output)


def expand_output(output, labels):
    coefs, eta = output.coefs, output.eta
    new_coefs = coefs.expand(len(labels), -1)

    if eta is not None:
        eta = eta.expand(len(labels), -1)
    return HeadOutputs(coefs=new_coefs, eta=eta)


@register_simple_metric("bt", "MSELoss")
def BT_mse(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    **kwargs,
):
    coefs = head_output.coefs
    paired_coefs = coefs.gather(dim=-1, index=labels).contiguous()

    paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1]
    predicted_probs = torch.sigmoid(paired_delta_logit)
    true_labels = torch.ones_like(predicted_probs)

    mse = F.mse_loss(predicted_probs, true_labels)
    return mse.mean().item()


@register_simple_metric("bt-tie", "MSELoss")
def BT_tie_mst(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    **kwargs,
):
    coefs = head_output.coefs
    model_idx = labels[:, :2]

    paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous()
    paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1]

    p_w = torch.sigmoid(paired_delta_logit)
    tie_ind = labels[:, -1]

    # let label be 0.5 if there is tie
    pred_probs = torch.where(tie_ind == 1, 0.5, p_w)

    true_labels = torch.ones_like(pred_probs)
    mse = F.mse_loss(pred_probs, true_labels)
    return mse.mean().item()


@register_simple_metric("rk", "MSELoss")
@register_simple_metric("rk-reparam", "MSELoss")
def RK_mse(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs):
    probs_func = registered_helpers[loss_type]["probs"]
    p_w, _, p_t = probs_func(head_output=head_output, labels=labels)

    tie_ind = labels[:, -1]

    # True label will always be win (since first index) unless a tie occurs
    pred_probs = torch.where(tie_ind == 1, p_t, p_w)

    true_labels = torch.ones_like(pred_probs)
    mse = F.mse_loss(pred_probs, true_labels)
    return mse.mean().item()


@register_simple_metric("bag", "MSELoss")
def bag_mse(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs):
    probs_func = registered_helpers[loss_type]["probs"]
    p_w, _, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels)

    tie_ind = labels[:, -1].unsqueeze(-1)

    P = torch.stack([p_w, p_t, p_t_bb], dim=-1)

    pred_probs = P.gather(dim=-1, index=tie_ind).contiguous().squeeze(-1)

    true_labels = torch.ones_like(pred_probs)
    mse = F.mse_loss(pred_probs, true_labels)
    return mse.mean().item()


@register_helper("rk-reparam", "probs")
def rk_reparam_probs(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    coefs = head_output.coefs
    eta = head_output.eta

    theta = (torch.exp(eta) + 1.000001).squeeze(-1)

    winner_idx = labels[:, 0:1]
    loser_idx = labels[:, 1:2]

    beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()[:, 0]
    beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()[:, 0]

    pi_win = torch.exp(beta_win)
    pi_lose = torch.exp(beta_lose)
    p_win = pi_win / (pi_win + theta * pi_lose + 1.0)

    p_lose = pi_lose / (pi_lose + theta * pi_win + 1.0)

    p_tie = 1.0 - p_win - p_lose
    return p_win, p_lose, p_tie


@register_helper("bag", "probs")
def bag_probs(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    coefs = head_output.coefs
    eta = head_output.eta

    theta = (torch.exp(eta) + 1.000001).squeeze(-1)

    winner_idx = labels[:, 0:1]
    loser_idx = labels[:, 1:2]

    beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()[:, 0]
    beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()[:, 0]

    pi_win = torch.exp(beta_win)
    pi_lose = torch.exp(beta_lose)
    pi_gamma = 1.0

    p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma)
    p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma)
    p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose)

    p_tie = 1.0 - p_win - p_lose - p_tie_bb
    return p_win, p_lose, p_tie, p_tie_bb


@register_helper("rk", "probs")
def rk_probs(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    coefs = head_output.coefs

    eta = rk_eta(head_output)

    model_idx = labels[:, :2]
    paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous()
    paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1]

    p_w = torch.sigmoid(paired_delta_logit - eta)
    p_l = torch.sigmoid(-1 * paired_delta_logit - eta)
    p_t = 1 - p_w - p_l

    return p_w, p_l, p_t


@register_simple_metric("bt", "Accuracy")
def BT_accuracy(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    **kwargs,
):
    coefs = head_output.coefs
    paired_coefs = coefs.gather(dim=-1, index=labels).contiguous()
    paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1]

    # winner would have positive difference
    correct = (paired_delta_logit > 0).float()
    return correct.mean().item()


@register_simple_metric("bt-tie", "Accuracy")
def BT_tie_accuracy(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    **kwargs,
):
    coefs = head_output.coefs
    paired_coefs = coefs.gather(dim=-1, index=labels).contiguous()

    paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1]

    # winner would have positive difference
    correct = (paired_delta_logit > 0).float()
    tie_ind = labels[:, -1]
    # we give ties half the accuracy
    correct[tie_ind == 1] = 0.5
    return correct.mean().item()


@register_simple_metric("rk", "Accuracy")
@register_simple_metric("rk-reparam", "Accuracy")
def RK_accuracy(
    head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs
):
    probs_func = registered_helpers[loss_type]["probs"]
    p_w, p_l, p_t = probs_func(head_output=head_output, labels=labels)

    pred_labels = torch.where(
        p_w >= p_l, torch.where(p_w >= p_t, 1, 0.5), torch.where(p_l >= p_t, 0, 0.5)
    )

    tie_ind = labels[:, -1]
    # tie if tie index, else winner (first index) predicted to win
    true_labels = torch.where(tie_ind == 1, 0.5, 1)

    correct = (pred_labels == true_labels).float()
    return correct.mean().item()


@register_simple_metric("rk", "Tie_Accuracy")
def RK_tie_accuracy(
    head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs
):
    probs_func = registered_helpers[loss_type]["probs"]
    p_w, p_l, p_t = probs_func(head_output=head_output, labels=labels)

    p_nt = p_w + p_l

    pred_tie = torch.where(p_t >= p_nt, 1, 0)

    tie_ind = labels[:, -1]
    correct = (pred_tie == tie_ind).float()
    return correct.mean().item()


@register_simple_metric("bag", "Tie_Accuracy")
def bag_tie_accuracy(
    head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs
):
    probs_func = registered_helpers[loss_type]["probs"]
    p_w, p_l, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels)

    p_nt = p_w + p_l
    p_tie = p_t + p_t_bb

    pred_tie = torch.where(p_nt >= p_tie, 0, 1)

    tie_ind = torch.where(labels[:, -1] == 0, 0, 1)
    correct = (pred_tie == tie_ind).float()
    return correct.mean().item()


@register_simple_metric("bag", "Tie_bb_Accuracy")
def bag_tie_bb_accuracy(
    head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs
):
    probs_func = registered_helpers[loss_type]["probs"]
    p_w, p_l, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels)

    p_nt_bb = p_w + p_l + p_t

    pred_tie = torch.where(p_t_bb >= p_nt_bb, 1, 0)

    tie_ind = torch.where(labels[:, -1] == 2, 1, 0)
    correct = (pred_tie == tie_ind).float()
    return correct.mean().item()


@register_aggr_metric("bt", "Aggr_Tie_Accuracy")
@register_aggr_metric("bt-tie", "Aggr_Tie_Accuracy")
@register_aggr_metric("rk", "Aggr_Tie_Accuracy")
@register_aggr_metric("rk-reparam", "Aggr_Tie_Accuracy")
@register_aggr_metric("bag", "Aggr_Tie_Accuracy")
def Aggr_Tie_accuracy(
    gt_output: HeadOutputs,
    model_output: HeadOutputs,
    loss_type: str,
    labels: torch.tensor,
    **kwargs,
):

    return aggr_metric("Tie_Accuracy", loss_type, labels, gt_output, model_output)


@register_aggr_metric("bt", "Aggr_Tie_Accuracy")
@register_aggr_metric("bt-tie", "Aggr_Tie_Accuracy")
@register_aggr_metric("rk", "Aggr_Tie_Accuracy")
@register_aggr_metric("rk-reparam", "Aggr_Tie_Accuracy")
@register_aggr_metric("bag", "Aggr_Tie_Accuracy")
def Aggr_Tie_accuracy(
    gt_output: HeadOutputs,
    model_output: HeadOutputs,
    loss_type: str,
    labels: torch.tensor,
    **kwargs,
):

    return aggr_metric("Tie_Accuracy", loss_type, labels, gt_output, model_output)


@register_aggr_metric("bt", "Aggr_Tie_bb_Accuracy")
@register_aggr_metric("bt-tie", "Aggr_Tie_bb_Accuracy")
@register_aggr_metric("rk", "Aggr_Tie_bb_Accuracy")
@register_aggr_metric("rk-reparam", "Aggr_Tie_bb_Accuracy")
@register_aggr_metric("bag", "Aggr_Tie_bb_Accuracy")
def Aggr_Tie_bb_accuracy(
    gt_output: HeadOutputs,
    model_output: HeadOutputs,
    loss_type: str,
    labels: torch.tensor,
    **kwargs,
):

    return aggr_metric("Tie_bb_Accuracy", loss_type, labels, gt_output, model_output)


@register_aggr_metric("bt", "Aggr_Tie_bb_Loss")
@register_aggr_metric("bt-tie", "Aggr_Tie_bb_Loss")
@register_aggr_metric("rk", "Aggr_Tie_bb_Loss")
@register_aggr_metric("rk-reparam", "Aggr_Tie_bb_Loss")
@register_aggr_metric("bag", "Aggr_Tie_bb_Loss")
def Aggr_Tie_bb_loss(
    gt_output: HeadOutputs,
    model_output: HeadOutputs,
    loss_type: str,
    labels: torch.tensor,
    **kwargs,
):

    return aggr_metric("Tie_bb_Loss", loss_type, labels, gt_output, model_output)


@register_simple_metric("rk-reparam", "Tie_Accuracy")
@register_simple_metric("bt", "Tie_Accuracy")
@register_simple_metric("bt-tie", "Tie_Accuracy")
@register_simple_metric("bt", "Tie_bb_Loss")
@register_simple_metric("rk-reparam", "Tie_bb_Loss")
@register_simple_metric("bt-tie", "Tie_bb_Loss")
@register_simple_metric("rk", "Tie_bb_Loss")
@register_simple_metric("bt", "Tie_Loss")
@register_simple_metric("bt-tie", "Tie_Loss")
@register_simple_metric("rk-reparam", "Tie_Loss")
@register_simple_metric("rk", "Tie_bb_Accuracy")
@register_simple_metric("rk-reparam", "Tie_bb_Accuracy")
@register_simple_metric("bt", "Tie_bb_Accuracy")
@register_simple_metric("bt-tie", "Tie_bb_Accuracy")
def not_implemented(
    head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs
):
    return -1  # not implemented


@register_simple_metric("bag", "Accuracy")
def bag_accuracy(
    head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs
):
    probs_func = registered_helpers[loss_type]["probs"]
    p_w, p_l, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels)

    P = torch.stack([p_w, p_t, p_t_bb, p_l], dim=-1)

    pred_labels = P.argmax(dim=-1)

    tie_ind = labels[:, -1]
    # let win be 0, tie be 1, tie_bb be 2. loss never predicted since winner_idx first
    true_labels = tie_ind

    correct = (pred_labels == true_labels).float()
    return correct.mean().item()


@register_simple_metric("bt", "Mean-BT")
@register_simple_metric("bt-tie", "Mean-BT")
@register_simple_metric("rk", "Mean-BT")
@register_simple_metric("rk-reparam", "Mean-BT")
@register_simple_metric("bag", "Mean-BT")
def beta_mean(
    head_output: HeadOutputs,
    **kwargs,
):
    betas = head_output.coefs
    flat_betas = betas.flatten()
    return torch.mean(flat_betas).item()


@register_simple_metric("bt", "Std-BT")
@register_simple_metric("bt-tie", "Std-BT")
@register_simple_metric("rk", "Std-BT")
@register_simple_metric("rk-reparam", "Std-BT")
@register_simple_metric("bag", "Std-BT")
def beta_std(
    head_output: HeadOutputs,
    **kwargs,
):
    betas = head_output.coefs
    flat_betas = betas.flatten()
    return torch.std(flat_betas).item()


@register_simple_metric("bt", "Spread-BT")
@register_simple_metric("bt-tie", "Spread-BT")
@register_simple_metric("rk", "Spread-BT")
@register_simple_metric("rk-reparam", "Spread-BT")
@register_simple_metric("bag", "Spread-BT")
def beta_spread(
    head_output: HeadOutputs,
    **kwargs,
):
    betas = head_output.coefs
    flat_betas = betas.flatten()
    return (torch.max(flat_betas) - torch.min(flat_betas)).item()


@register_simple_metric("bt", "Mean-Spread-BT")
@register_simple_metric("bt-tie", "Mean-Spread-BT")
@register_simple_metric("rk", "Mean-Spread-BT")
@register_simple_metric("rk-reparam", "Mean-Spread-BT")
@register_simple_metric("bag", "Mean-Spread-BT")
def beta_mean_spread(
    head_output: HeadOutputs,
    **kwargs,
):
    betas = head_output.coefs
    max_min_per_prompt = (
        torch.max(betas, dim=-1).values - torch.min(betas, dim=-1).values
    )
    return torch.mean(max_min_per_prompt).item()


@register_simple_metric("bt", "Mean-IQR-BT")
@register_simple_metric("bt-tie", "Mean-IQR-BT")
@register_simple_metric("rk", "Mean-IQR-BT")
@register_simple_metric("rk-reparam", "Mean-IQR-BT")
@register_simple_metric("bag", "Mean-IQR-BT")
def beta_mean_iqr(
    head_output: HeadOutputs,
    **kwargs,
):
    betas = head_output.coefs
    iqr_per_prompt = torch.quantile(betas, 0.75, dim=-1) - torch.quantile(
        betas, 0.25, dim=-1
    )
    return torch.mean(iqr_per_prompt).item()


@register_simple_metric("bt", "Mean-Std-BT")
@register_simple_metric("bt-tie", "Mean-Std-BT")
@register_simple_metric("rk", "Mean-Std-BT")
@register_simple_metric("rk-reparam", "Mean-Std-BT")
@register_simple_metric("bag", "Mean-Std-BT")
def beta_mean_std(
    head_output: HeadOutputs,
    **kwargs,
):
    betas = head_output.coefs
    std_per_prompt = torch.std(betas, dim=-1)
    return torch.mean(std_per_prompt).item()


@register_helper("marginal-gt", "aggregrate")
def aggr_marginal_gt(
    labels: torch.Tensor, model_list: torch.Tensor, loss_type: str, **kwargs
):
    coefs, eta = train_marginal(model_list, labels, loss_type)
    return HeadOutputs(coefs=coefs[0], eta=eta[0] if eta is not None else None)


@register_helper("p2l", "aggregrate")
def aggr_p2l(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    model_list: torch.Tensor,
    loss_type: str,
    **kwargs,
):
    coefs, eta = train_aggr_prob(
        model_list, head_output, labels, loss_type, is_batch=False
    )
    return HeadOutputs(coefs=coefs[0], eta=eta[0] if eta is not None else None)


@register_helper("p2l", "aggregrate-batch")
def aggr_p2l_batch(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    model_list: torch.Tensor,
    loss_type: str,
    **kwargs,
):
    coefs_batch, eta_batch = train_aggr_prob(
        model_list, head_output, labels, loss_type, is_batch=True
    )
    return [
        HeadOutputs(
            coefs=coefs_batch[i], eta=eta_batch[i] if eta_batch is not None else None
        )
        for i in range(coefs_batch.shape[0])
    ]


@register_helper("marginal-gt", "aggregrate-batch")
def aggr_p2l_batch(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    model_list: torch.Tensor,
    loss_type: str,
    **kwargs,
):
    # TODO: Make faster if necessary
    return [
        aggr_marginal_gt(labels[i], model_list, loss_type) for i in range(len(labels))
    ]


@register_helper("marginal", "aggregrate")
def aggr_non_p2l(head_output: HeadOutputs, loss_type: str, **kwargs):
    etas = head_output.eta
    etas = etas[0, :] if etas is not None else None
    return HeadOutputs(coefs=head_output.coefs[0, :], eta=etas)


@register_helper("arena", "aggregrate")
def aggr_non_p2l(
    head_output: HeadOutputs = None, arena_rankings: torch.tensor = None, **kwargs
):
    eta = torch.tensor([0])

    if arena_rankings is not None:
        return HeadOutputs(coefs=arena_rankings, eta=eta)
    # arena just has the same betas repeated if not provided
    return HeadOutputs(coefs=head_output.coefs[0, :], eta=eta)


def train_marginal(model_list, labels, loss_type, lr=1.0, tol=1e-9, max_epochs=50):
    model_cls = registered_aggr_models[loss_type]
    model = model_cls(len(model_list))

    optimizer = optim.LBFGS(
        model.parameters(),
        lr=lr,
        max_iter=max_epochs,
        tolerance_grad=tol,
        tolerance_change=tol,
    )

    loss_func = registered_losses[loss_type]
    labels = (
        labels.squeeze() if labels.dim() > 2 else labels
    )  # marginal doesn't use batching since one at a time

    def closure():
        optimizer.zero_grad()
        coefs, eta = model()

        coefs_expanded = coefs[0].expand(len(labels), -1)
        eta_expanded = eta[0].expand(len(labels), -1) if eta is not None else None

        head_output = HeadOutputs(coefs=coefs_expanded, eta=eta_expanded)
        loss = loss_func(head_output=head_output, labels=labels)
        loss.backward()
        return loss

    optimizer.step(closure)

    true_coefs, true_eta = model()
    return true_coefs.detach(), true_eta.detach() if true_eta is not None else None


def train_aggr_prob(
    model_list,
    head_outputs,
    labels,
    loss_type,
    is_batch,
    lr=1.0,
    tol=1e-9,
    max_epochs=50,
):
    true_probs_func = registered_helpers[loss_type]["pairwise_probs"]
    true_probs = true_probs_func(real_output=head_outputs)
    # add a batch size of 1 since aggregration is done in batches (only necessary if data isn't in batch format)
    if not is_batch:
        true_probs = true_probs.unsqueeze(0)

    batch_size = true_probs.shape[0]
    model_cls = registered_aggr_models[loss_type]
    model = model_cls(len(model_list), batch_size)

    optimizer = optim.LBFGS(
        model.parameters(),
        lr=lr,
        max_iter=max_epochs,
        tolerance_grad=tol,
        tolerance_change=tol,
    )
    loss_func = registered_pairwise_losses[loss_type]

    count = 0
    prev_loss = 0

    def closure():
        optimizer.zero_grad()
        coefs, eta = model()
        aggr_output = HeadOutputs(coefs=coefs, eta=eta)
        loss = loss_func(
            real_output=head_outputs,
            aggregated_output=aggr_output,
            true_probs=true_probs,
        )
        loss.backward()

        nonlocal count
        count += 1
        if count == 49:
            raise Warning("Batch training did not converge")

        return loss

    optimizer.step(closure)

    true_coefs, true_eta = model()
    return true_coefs.detach(), true_eta.detach() if true_eta is not None else None


def rk_eta(output):
    if output.eta is None:
        return None
    BETA = 0.1
    return torch.clamp(
        torch.nn.functional.softplus(output.eta - 22.5, BETA).squeeze(-1), min=0.02
    )


@register_helper("rk", "pairwise_probs")
def pairwise_RK_probs(real_output: HeadOutputs):

    real_betas = real_output.coefs
    real_eta = rk_eta(real_output)
    real_eta = real_eta.unsqueeze(-1)

    num_models = real_betas.shape[-1]

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    # elipses allow for both batched/unbatched
    beta_i_real = real_betas[..., pair_indices[:, 0]]
    beta_j_real = real_betas[..., pair_indices[:, 1]]

    true_probs_win = torch.sigmoid(beta_i_real - beta_j_real - real_eta)
    true_probs_loss = torch.sigmoid(beta_j_real - beta_i_real - real_eta)
    true_probs_tie = 1.0 - true_probs_win - true_probs_loss

    true_probs = torch.stack((true_probs_win, true_probs_loss, true_probs_tie), dim=-1)
    return true_probs


@register_helper("rk-reparam", "pairwise_probs")
def pairwise_RK_reparam_probs(real_output: HeadOutputs, **kwargs):
    real_betas = real_output.coefs
    real_theta = torch.exp(real_output.eta) + 1.000001

    num_models = real_betas.shape[-1]

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    beta_i_real = real_betas[..., pair_indices[:, 0]]
    beta_j_real = real_betas[..., pair_indices[:, 1]]

    pi_win = torch.exp(beta_i_real)
    pi_lose = torch.exp(beta_j_real)

    p_win = pi_win / (pi_win + real_theta * pi_lose + 1.0)
    p_lose = pi_lose / (pi_lose + real_theta * pi_win + 1.0)
    p_tie = 1.0 - p_win - p_lose

    true_probs = torch.stack((p_win, p_lose, p_tie), dim=-1)
    return true_probs


@register_helper("bag", "pairwise_probs")
def pairwise_bag_probs(real_output: HeadOutputs, **kwargs):
    real_betas = real_output.coefs
    real_theta = torch.exp(real_output.eta) + 1.000001

    num_models = real_betas.shape[-1]

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    beta_i_real = real_betas[..., pair_indices[:, 0]]
    beta_j_real = real_betas[..., pair_indices[:, 1]]

    pi_win = torch.exp(beta_i_real)
    pi_lose = torch.exp(beta_j_real)
    pi_gamma = 1.0

    p_win = pi_win / (pi_win + real_theta * pi_lose + pi_gamma)

    p_lose = pi_lose / (pi_lose + real_theta * pi_win + pi_gamma)

    p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose)

    p_tie = 1.0 - p_win - p_lose - p_tie_bb

    true_probs = torch.stack((p_win, p_lose, p_tie, p_tie_bb), dim=-1)
    return true_probs


@register_helper("bt", "pairwise_probs")
@register_helper("bt-tie", "pairwise_probs")
def pairwise_BT_probs(real_output: HeadOutputs):
    real_betas = real_output.coefs

    num_models = real_betas.shape[-1]

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    beta_i_real = real_betas[..., pair_indices[:, 0]]
    beta_j_real = real_betas[..., pair_indices[:, 1]]

    true_probs = torch.sigmoid(beta_i_real - beta_j_real)
    return true_probs


# removes nan from tensor, indices will be shifted
def remove_beta_nan(beta1, beta2):
    beta_mask = ~torch.isnan(beta1) & ~torch.isnan(beta2)
    return beta1[beta_mask], beta2[beta_mask]


@register_aggr_metric("bt", "Leaderboard")
@register_aggr_metric("bt-tie", "Leaderboard")
@register_aggr_metric("rk", "Leaderboard")
@register_aggr_metric("rk-reparam", "Leaderboard")
@register_aggr_metric("bag", "Leaderboard")
def leaderboard(
    gt_output: HeadOutputs, model_output: HeadOutputs, model_list: np.array, **kwargs
):
    gt_lb = get_leaderboard(gt_output, model_list)
    model_lb = get_leaderboard(model_output, model_list)

    return {"ground-truth": list(gt_lb), "model-aggr": list(model_lb)}


def get_leaderboard(output, model_list):
    coefs = output.coefs

    sorted_indices = torch.argsort(coefs, descending=True)
    sorted_model_names = [model_list[i] for i in sorted_indices]
    sorted_betas = coefs[sorted_indices]

    leaderboard = []
    for i in range(len(sorted_model_names)):
        beta = (
            round(sorted_betas[i].item(), 4)
            if not torch.isnan(sorted_betas[i])
            else "nan"
        )
        cur_model = str(sorted_model_names[i]) + ": " + str(beta)
        leaderboard.append(cur_model)

    return np.array(leaderboard)


@register_aggr_metric("bt", "L1-Dist-Prob")
@register_aggr_metric("bt-tie", "L1-Dist-Prob")
def l1_dist_prob_bt(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):
    beta1 = gt_output.coefs
    beta2 = model_output.coefs

    # if arena is one, there may be nan if model not present in that file
    beta1, beta2 = remove_beta_nan(beta1, beta2)

    diff_matrix1 = beta1.unsqueeze(1) - beta1.unsqueeze(0)
    diff_matrix2 = beta2.unsqueeze(1) - beta2.unsqueeze(0)

    prob_vec1 = torch.sigmoid(diff_matrix1).flatten()
    prob_vec2 = torch.sigmoid(diff_matrix2).flatten()

    return torch.abs(prob_vec2 - prob_vec1).mean().item()


@register_aggr_metric("rk-reparam", "L1-Dist-Prob")
@register_aggr_metric("rk", "L1-Dist-Prob")
def l1_dist_prob_rk(
    gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, **kwargs
):
    eta1 = gt_output.eta
    eta2 = model_output.eta
    # need to both have eta
    if eta1 is None or eta2 is None:
        return l1_dist_prob_bt(gt_output, model_output)

    pair_probs_func = registered_helpers[loss_type]["pairwise_probs"]

    p_win1, p_lose1, p_tie1 = torch.unbind(pair_probs_func(gt_output), -1)
    p_win2, p_lose2, p_tie2 = torch.unbind(pair_probs_func(model_output), -1)

    win_diff = torch.abs(p_win1 - p_win2).mean().item()
    lose_diff = torch.abs(p_lose1 - p_lose2).mean().item()
    tie_diff = torch.abs(p_tie1 - p_tie2).mean().item()
    return (win_diff + lose_diff + tie_diff) / 3


@register_aggr_metric("bag", "L1-Dist-Prob")
def l1_dist_prob_bag(
    gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, **kwargs
):
    eta1 = gt_output.eta
    eta2 = model_output.eta
    # need to both have eta
    if eta1 is None or eta2 is None:
        return l1_dist_prob_bt(gt_output, model_output)

    pair_probs_func = registered_helpers[loss_type]["pairwise_probs"]

    p_win1, p_lose1, p_tie1, p_tie_bb1 = torch.unbind(pair_probs_func(gt_output), -1)
    p_win2, p_lose2, p_tie2, p_tie_bb2 = torch.unbind(pair_probs_func(model_output), -1)

    win_diff = torch.abs(p_win1 - p_win2).mean().item()
    lose_diff = torch.abs(p_lose1 - p_lose2).mean().item()
    tie_diff = torch.abs(p_tie1 - p_tie2).mean().item()
    tie_bb_diff = torch.abs(p_tie_bb2 - p_tie_bb1).mean().item()
    return (win_diff + lose_diff + tie_diff + tie_bb_diff) / 4


@register_aggr_metric("bt", "IQR-BT")
@register_aggr_metric("bt-tie", "IQR-BT")
@register_aggr_metric("rk", "IQR-BT")
@register_aggr_metric("rk-reparam", "IQR-BT")
@register_aggr_metric("bag", "IQR-BT")
def beta_iqr(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):
    (
        gt_coefs,
        model_coefs,
    ) = (
        gt_output.coefs,
        model_output.coefs,
    )
    gt_iqr = (torch.quantile(gt_coefs, 0.75) - torch.quantile(gt_coefs, 0.25)).item()
    model_iqr = (
        torch.quantile(model_coefs, 0.75) - torch.quantile(model_coefs, 0.25)
    ).item()

    return {"ground-truth": round(gt_iqr, 4), "model-aggr": round(model_iqr, 4)}


@register_aggr_metric("bt", "Std-BT")
@register_aggr_metric("bt-tie", "Std-BT")
@register_aggr_metric("rk", "Std-BT")
@register_aggr_metric("rk-reparam", "Std-BT")
@register_aggr_metric("bag", "Std-BT")
def beta_std_aggr(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):

    gt_betas, model_betas = gt_output.coefs, model_output.coefs
    gt_std, model_std = (
        torch.std(gt_betas.flatten()).item(),
        torch.std(model_betas.flatten()).item(),
    )
    return {"ground-truth": round(gt_std, 4), "model-aggr": round(model_std, 4)}


@register_aggr_metric("bt", "Spread-BT")
@register_aggr_metric("bt-tie", "Spread-BT")
@register_aggr_metric("rk", "Spread-BT")
@register_aggr_metric("rk-reparam", "Spread-BT")
@register_aggr_metric("bag", "Spread-BT")
def beta_spread_aggr(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):
    gt_betas, model_betas = gt_output.coefs.flatten(), model_output.coefs.flatten()

    gt_spread, model_spread = torch.max(gt_betas) - torch.min(gt_betas), torch.max(
        model_betas
    ) - torch.min(model_betas)
    return {
        "ground-truth": round(gt_spread.item(), 4),
        "model-aggr": round(model_spread.item(), 4),
    }


@register_aggr_metric("bt", "Kendall-lbs")
@register_aggr_metric("bt-tie", "Kendall-lbs")
@register_aggr_metric("rk", "Kendall-lbs")
@register_aggr_metric("rk-reparam", "Kendall-lbs")
@register_aggr_metric("bag", "Kendall-lbs")
def kendall_lb(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):
    gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs)
    gt_lb = gt_betas.numpy()
    model_lb = model_betas.numpy()

    return kendalltau(gt_lb, model_lb)[0]


@register_aggr_metric("bt", "Spearman-lbs")
@register_aggr_metric("bt-tie", "Spearman-lbs")
@register_aggr_metric("rk", "Spearman-lbs")
@register_aggr_metric("rk-reparam", "Spearman-lbs")
@register_aggr_metric("bag", "Spearman-lbs")
def spearman_lb(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):
    gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs)
    gt_lb = gt_betas.numpy()
    model_lb = model_betas.numpy()

    return spearmanr(gt_lb, model_lb)[0]


def top_k_frac(gt_betas: torch.tensor, model_betas: torch.tensor, k: int):
    gt_top_indices = set(torch.topk(gt_betas, k).indices.numpy())
    model_top_indices = set(torch.topk(model_betas, k).indices.numpy())
    common_indices = gt_top_indices & model_top_indices

    return len(common_indices) / k


def top_k_displace(gt_betas: torch.tensor, model_betas: torch.tensor, k: int):
    gt_top_indices = torch.topk(gt_betas, k).indices
    model_ranks = torch.argsort(torch.argsort(model_betas, descending=True))
    displacements = torch.abs(model_ranks[gt_top_indices] - torch.arange(k))

    return displacements.float().mean().item()


@register_aggr_metric("bt", "Top-k-fraction")
@register_aggr_metric("bt-tie", "Top-k-fraction")
@register_aggr_metric("rk", "Top-k-fraction")
@register_aggr_metric("rk-reparam", "Top-k-fraction")
@register_aggr_metric("bag", "Top-k-fraction")
def top_k_frac_dict(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):
    gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs)

    res = {}
    for k in [1, 3, 5, 10]:
        res[k] = round(top_k_frac(gt_betas, model_betas, k), 4)

    return res


@register_aggr_metric("bt", "Top-k-displace")
@register_aggr_metric("bt-tie", "Top-k-displace")
@register_aggr_metric("rk", "Top-k-displace")
@register_aggr_metric("rk-reparam", "Top-k-displace")
@register_aggr_metric("bag", "Top-k-displace")
def top_k_dist_dict(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs):
    gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs)

    res = {}
    for k in [1, 3, 5, 10]:
        res[k] = round(top_k_displace(gt_betas, model_betas, k), 4)

    return res


================================================
FILE: p2l/auto_evals.py
================================================
import argparse
import json
import os
import io
import warnings
import math
from tqdm import tqdm
import time
import copy

import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, load_from_disk
from huggingface_hub import hf_hub_download, upload_file, list_repo_files

from model import HeadOutputs
from auto_eval_utils import (
    registered_simple_metrics,
    registered_aggr_metrics,
    registered_helpers,
)


def parse_model_list(hf_model, local_path):
    if not hf_model and not local_path:
        raise ValueError("Either model repo or local model list must be provided.")

    model_list_path = local_path
    # if no local path, try getting from model_repo
    if not model_list_path:
        model_list_path = hf_hub_download(
            repo_id=hf_model, filename="model_list.json", repo_type="model"
        )

    model_list = pd.read_json(model_list_path, lines=False).iloc[:, 0].tolist()
    return np.array(model_list)


def change_beta_model_list(df, old_list, new_list):
    old_list = old_list.tolist()
    old_to_new = [old_list.index(model) for model in new_list]
    betas_array = np.array(df["betas"].to_list())

    betas_array = betas_array[:, old_to_new]
    return betas_array.tolist()


def parse_eval_output_data(
    model_repo,
    local_eval_path,
    local_checkpoint_path,
    hf_checkpoint_repo,
    hf_checkpoint_file,
    loss_type,
    model_list,
    remove_last_hidden_json,
):
    ret_df, ret_model_list = None, None
    if local_checkpoint_path or hf_checkpoint_repo:
        path = local_checkpoint_path
        if not path:
            if not hf_checkpoint_file:
                raise ValueError(
                    "Must provide checkpoint file along with checkpoint repo"
                )
            path = hf_hub_download(
                repo_id=hf_checkpoint_repo,
                filename=hf_checkpoint_file,
                repo_type="dataset",
            )

        df = pd.read_json(path)

        # caching json w/o last hidden layer
        if remove_last_hidden_json and local_checkpoint_path:
            if "last_hidden_state" in df.columns:
                df = df.drop(columns=["last_hidden_state"])
                df.to_json(local_checkpoint_path)

        df = df.rename(columns={"coefs": "betas"})

        # data is stored with nested lists for both etas and betas only in checkpoint data
        # df['eta'] = np.array(df['eta'].to_list()).flatten()
        df["eta"] = df["eta"].apply(lambda x: x[0] if isinstance(x, list) else x)
        df["betas"] = df["betas"].apply(lambda x: x[0] if isinstance(x, list) else x)

        val_model_list = get_model_list_from_df(df)
        # only betas need to be adjusted since labels are correct
        df["betas"] = change_beta_model_list(df, model_list, val_model_list)

        ret_df, ret_model_list = df, val_model_list

    elif local_eval_path:
        ret_df, ret_model_list = pd.read_json(local_eval_path, lines=True), model_list

    elif model_repo:
        files = list_repo_files(repo_id=model_repo, repo_type="model")
        if "eval_output.jsonl" not in files:
            raise FileNotFoundError(
                f"'eval_output.jsonl' not found in the hf repository'{model_repo}'."
            )
        path = hf_hub_download(
            repo_id=model_repo, filename="eval_output.jsonl", repo_type="model"
        )
        ret_df, ret_model_list = pd.read_json(path, lines=True), model_list
    else:
        raise ValueError("need to provide path for eval output data")

    preprocess_func = registered_helpers[loss_type]["preprocess_data"]
    ret_df = preprocess_func(data=ret_df)

    return ret_df, ret_model_list


def add_labels_to_data(data, loss_type, model_list):
    if loss_type == "bt":
        data = data[~data["winner"].isin(["tie", "tie (bothbad)"])]

    def create_labels(row):
        winner = row["winner"]
        model_a = row["model_a"]
        model_b = row["model_b"]

        model_a_idx = np.where(model_list == model_a)[0][0]
        model_b_idx = np.where(model_list == model_b)[0][0]

        tie_bb_label = 2 if loss_type == "bag" else 1
        if winner == "model_a":
            return np.array([model_a_idx, model_b_idx, 0])
        elif winner == "model_b":
            return np.array([model_b_idx, model_a_idx, 0])
        elif winner == "tie":
            return np.array([model_a_idx, model_b_idx, 1])
        else:
            return np.array([model_a_idx, model_b_idx, tie_bb_label])

    data["labels"] = data.apply(create_labels, axis=1)
    return data


# only use if completely necessary
def get_model_list_from_df(df):
    return np.array(sorted(pd.concat([df["model_a"], df["model_b"]]).unique()))


def parse_train_data(hf_data, local_path, loss_type, train_model_list):
    if not hf_data and not local_path:
        warnings.warn(
            "No train data provided, marginal model type will not work if specified"
        )
        return

    if local_path:
        if local_path.endswith(".jsonl"):
            data = pd.read_json(local_path, lines=True)

        else:
            data = load_from_disk(local_path)["train"].to_pandas()
    else:
        data = load_dataset(hf_data, split="train").to_pandas()

    return add_labels_to_data(data, loss_type, train_model_list)


def parse_arena_data(path, initial_rating=1000, BASE=10, SCALE=400):
    if not path:
        warnings.warn("Ground truth arena data not passed in, some metrics not work")
        return

    df = pd.read_csv(path)
    # removes to avoid duplicates since not every model has a style_controlled ranking
    df = df[df["style_control"] == False]
    # ELO to beta using what eval_p2l.ipynb used
    df["beta"] = (df["rating"] - initial_rating) / (SCALE * math.log(BASE))

    pivot = df.pivot(index="model_name", columns="category", values="beta").reindex(
        model_list
    )

    if pivot.isnull().any().any():
        missing_models = pivot[pivot.isnull().any(axis=1)].index.tolist()
        warnings.warn("Model not included in arena leaderboard:" + str(missing_models))

    category_to_betas = {
        category: torch.tensor(pivot[category].values, dtype=torch.float)
        for category in pivot.columns
    }
    return category_to_betas


# NOTE: Only accepts certain categories, needs to be manually added
def filter_battle_data(battles, category):
    if battles is None:
        return None
    # expect category key by itself or key=value
    key_val_pair = category.split("=")
    key = key_val_pair[0]
    val = key_val_pair[1] if len(key_val_pair) == 2 else True
    val = bool(val) if val in ["True", "true", "False", "false"] else val

    try:
        # no filtering
        if key == "all":
            return battles
        # no nesting
        if key == "language" or key == "is_code":
            return battles[battles[key] == val]
        # nested ones need specific cases
        if key == "math":
            return battles[
                battles["category_tag"].apply(lambda x: x["math_v0.1"]["math"])
            ]
        if key == "complexity":
            return battles[
                battles["category_tag"].apply(
                    lambda x: x["criteria_v0.1"]["complexity"]
                )
            ]
        if key == "creative_writing":
            return battles[
                battles["category_tag"].apply(
                    lambda x: x["creative_writing_v0.1"]["creative_writing"]
                )
            ]
        if key == "hard":
            return battles[
                battles["category_tag"].apply(
                    lambda x: sum(x["criteria_v0.1"].values()) >= 6
                )
            ]

        # Category not found
        return None
    except:
        return None


# NOTE: Only accepts certain categories, needs to be manually added
def get_arena_rankings(data, category):
    if data is None:
        return None

    key_val_pair = category.split("=")
    key = key_val_pair[0]
    val = key_val_pair[1] if len(key_val_pair) == 2 else True
    val = bool(val) if val in ["True", "true", "False", "false"] else val

    try:
        # no filtering
        if key == "all":
            return data["full"]
        # no nesting
        if key == "language":
            return data[val.lower()]
        if key == "is_code":
            return data["coding"]
        if key == "math":
            return data["math"]
        if key == "hard":
            return data["hard_6"]
        if key == "creative_writing":
            return data["creative_writing"]

        return None
    except:
        return None


def get_subset_prompts(output, labels, size):
    num_prompts = output.coefs.shape[0]
    sampled_indices = torch.randperm(num_prompts)[:size]
    sampled_coefs = output.coefs[sampled_indices, :]

    sampled_eta = None
    if output.eta is not None:
        sampled_eta = output.eta[sampled_indices]

    sampled_labels = labels[sampled_indices, :]
    sampled_output = HeadOutputs(coefs=sampled_coefs, eta=sampled_eta)
    return sampled_output, sampled_labels


def get_subset_prompts_batch(output, labels, size, batch_size):
    num_prompts, num_models = output.coefs.shape
    sampled_indices = torch.randint(low=0, high=num_prompts, size=(batch_size, size))
    sampled_coefs = output.coefs[sampled_indices]

    sampled_eta = None
    if output.eta is not None:
        sampled_eta = output.eta[sampled_indices]
    sampled_labels = labels[sampled_indices]

    sampled_output = HeadOutputs(coefs=sampled_coefs, eta=sampled_eta)

    return sampled_output, sampled_labels


def get_ith_output(output, i):
    betas = output.coefs[i]
    eta = output.eta[i] if output.eta is not None else None
    return HeadOutputs(coefs=betas, eta=eta)


def save_output(results, local_dir, hf_dir, file_name):
    if not local_dir and not hf_dir:
        raise ValueError("Specify a directory for outputs.")

    results["params"]["output_file_name"] = file_name

    file_name += ".json"
    if local_dir:
        path = os.path.join(local_dir, file_name)
        with open(path, "w") as file:
            json.dump(results, file, indent=4, separators=(",", ": "))
    if hf_dir:
        output = json.dumps(results, indent=4, separators=(",", ": "))
        tmp_file = io.BytesIO(output.encode("utf-8"))

        upload_file(
            path_or_fileobj=tmp_file,
            path_in_repo=file_name,
            repo_id=hf_dir,
            repo_type="model",
        )


def simple_metrics(metrics, output, labels, loss_type):
    results = {}

    for metric in tqdm(metrics, desc="Simple Metrics", unit="metrics"):
        metric_dict = registered_simple_metrics[loss_type]
        metric_func = metric_dict[metric]
        metric_val = metric_func(head_output=output, labels=labels, loss_type=loss_type)

        results[metric] = (
            round(metric_val, 4) if isinstance(metric_val, float) else metric_val
        )

    return results


def category_metrics(
    metrics,
    output,
    labels,
    loss_type,
    model_type,
    model_list,
    ground_truth,
    arena_rankings,
):
    results = {}

    aggr_func_model = registered_helpers[model_type]["aggregrate"]
    # our default ground truth is marginal-gt but we can switch to arena or add configurability if desired
    aggr_func_gt = registered_helpers[ground_truth]["aggregrate"]

    model_output = aggr_func_model(
        head_output=output, labels=labels, model_list=model_list, loss_type=loss_type
    )
    gt_output = aggr_func_gt(
        labels=labels,
        model_list=model_list,
        loss_type=loss_type,
        arena_rankings=arena_rankings,
    )

    for metric in tqdm(metrics, desc="Category Metrics", unit="metric"):
        metric_dict = registered_aggr_metrics[loss_type]
        metric_func = metric_dict[metric]
        metric_val = metric_func(
            gt_output=gt_output,
            model_output=model_output,
            model_list=model_list,
            loss_type=loss_type,
            labels=labels,
        )
        results[metric] = (
            round(metric_val, 4) if isinstance(metric_val, float) else metric_val
        )

    return results


def random_subset_metrics(
    metrics,
    output,
    labels,
    subset_sizes,
    trials_per_subset,
    loss_type,
    model_type,
    model_list,
):
    results = {}

    aggr_func_model = registered_helpers[model_type]["aggregrate"]
    # our default ground truth is marginal-gt but we can switch to arena or add configurability if desired
    aggr_func_gt = registered_helpers["marginal-gt"]["aggregrate"]

    for idx, size in enumerate(subset_sizes):
        size = int(size)
        subset_results = {metric: 0 for metric in metrics}

        for _ in tqdm(
            range(trials_per_subset[idx]),
            desc=f"Random Subset size {size}",
            unit="trial",
        ):
            sample_output, sample_labels = get_subset_prompts(output, labels, size)

            model_output = aggr_func_model(
                head_output=sample_output,
                labels=sample_labels,
                model_list=model_list,
                loss_type=loss_type,
            )
            gt_output = aggr_func_gt(
                labels=sample_labels, model_list=model_list, loss_type=loss_type
            )

            for metric in metrics:
                metric_dict = registered_aggr_metrics[loss_type]
                metric_func = metric_dict[metric]
                metric_val = metric_func(
                    gt_output=gt_output,
                    model_output=model_output,
                    model_list=model_list,
                    loss_type=loss_type,
                )

                subset_results[metric] += metric_val

        for metric in metrics:
            subset_results[metric] = round(
                subset_results[metric] / trials_per_subset, 4
            )

        results[size] = subset_results

    return results


def aggr_scale_metrics(
    metrics,
    output,
    labels,
    subset_sizes,
    trials_per_subset,
    loss_type,
    model_type,
    model_list,
    arena_rankings,
    gt,
):
    results = {}
    aggr_func_model = registered_helpers[model_type]["aggregrate-batch"]
    # our default ground truth is arena ranking but we can switch to arena or add configurability if desired

    aggr_func_gt = registered_helpers[gt]["aggregrate"]
    gt_output = aggr_func_gt(
        labels=labels,
        model_list=model_list,
        loss_type=loss_type,
        arena_rankings=arena_rankings,
    )

    # TODO: arbitray threshold to limit memory consumption for batching
    # max_prompts_times_samples_squared = 2e4

    for idx, size in enumerate(subset_sizes):
        size = int(size)
        num_samples = int(trials_per_subset[idx])

        subset_results = {metric: 0 for metric in metrics}

        # num_full_mini_batches = int(max(
        #     1, (size * (num_samples ** 2)) // max_prompts_times_samples_squared
        # ))

        num_full_mini_batches = int(max(1, num_samples // 100))

        mini_batch_size = num_samples // num_full_mini_batches
        leftover = num_samples - (num_full_mini_batches * mini_batch_size)

        with tqdm(total=num_samples, desc=f"Aggr Subset Size {size}") as pbar:

            def run_mini_batch(batch_count):
                sample_output, sample_labels = get_subset_prompts_batch(
                    output, labels, size, batch_count
                )
                batch_output = aggr_func_model(
                    head_output=sample_output,
                    labels=sample_labels,
                    model_list=model_list,
                    loss_type=loss_type,
                )

                for cur_output in batch_output:
                    for metric in metrics:
                        metric_dict = registered_aggr_metrics[loss_type]
                        metric_func = metric_dict[metric]
                        metric_val = metric_func(
                            gt_output=gt_output,
                            model_output=cur_output,
                            model_list=model_list,
                            loss_type=loss_type,
                        )
                        subset_results[metric] += metric_val
                    pbar.update(1)

            for _ in range(num_full_mini_batches):
                run_mini_batch(mini_batch_size)

            if leftover > 0:
                run_mini_batch(leftover)

        for metric in metrics:
            subset_results[metric] = round(
                subset_results[metric] / float(trials_per_subset[idx]), 4
            )

        results[size] = subset_results

    return results


def get_metrics(
    val_data, train_data, arena_rankings, val_model_list, train_model_list, args
):
    results = {}
    to_inc = set(args.metrics_to_inc)
    output_label_func = registered_helpers[args.model_type]["output_labels"]
    output, labels = output_label_func(
        val_data=val_data,
        train_data=train_data,
        arena_rankings=arena_rankings,
        loss_type=args.loss_type,
        model_list=val_model_list,
        train_model_list=train_model_list,
    )

    if "simple" in to_inc:
        simple_results = simple_metrics(
            metrics=args.simple_metrics,
            output=output,
            labels=labels,
            loss_type=args.loss_type,
        )
        results["simple_metrics"] = simple_results

    if "category" in to_inc:
        category_results = category_metrics(
            metrics=args.category_metrics,
            loss_type=args.loss_type,
            model_type=args.model_type,
            model_list=val_model_list,
            output=output,
            labels=labels,
            ground_truth=args.ground_truth,
            arena_rankings=arena_rankings,
        )
        results["category_metrics"] = category_results

    if "random_subsets" in to_inc:
        subset_results = random_subset_metrics(
            metrics=args.rand_subset_metrics,
            subset_sizes=args.rand_subset_sizes,
            trials_per_subset=args.rand_num_samples,
            loss_type=args.loss_type,
            model_type=args.model_type,
            model_list=val_model_list,
            output=output,
            labels=labels,
        )
        results["random_subsets"] = subset_results

    if "aggr_scale" in to_inc:
        scale_results = aggr_scale_metrics(
            metrics=args.aggr_scale_metrics,
            subset_sizes=args.aggr_scale_subset_sizes,
            trials_per_subset=args.aggr_scale_num_samples,
            loss_type=args.loss_type,
            model_type=args.model_type,
            model_list=val_model_list,
            output=output,
            labels=labels,
            arena_rankings=arena_rankings,
            gt=args.ground_truth,
        )
        results["aggr_scale"] = scale_results

    return results


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # model repo contains model list and potentially, eval data (eval_output.jsonl)
    parser.add_argument("--model_repo", type=str, default=None)
    parser.add_argument("--model_list_path", type=str, default=None)

    # val data is either in model repo, local file, or remotely as checkpoint file
    parser.add_argument("--eval_path", nargs="+", type=str, default=None)
    parser.add_argument("--checkpoint_path", nargs="+", type=str, default=None)
    parser.add_argument("--hf_checkpoint_repo", type=str, default=None)
    parser.add_argument("--hf_checkpoint_file", nargs="+", type=str, default=None)

    parser.add_argument("--output_dir", type=str, default=None)
    parser.add_argument("--hf_output_dir", type=str, default=None)
    parser.add_argument(
        "--output_file_name", type=str, nargs="+", default=["eval_metrics"]
    )

    parser.add_argument("--hf_train_dataset", type=str, default=None)
    parser.add_argument("--train_path", type=str, default=None)

    parser.add_argument("--arena_path", type=str, default=None)

    parser.add_argument("--loss_type", type=str, default="bt", help="bt, bt_tie, rk")
    parser.add_argument(
        "--model_type", type=str, default="p2l", help="p2l, marginal, arena"
    )

    parser.add_argument(
        "--categories",
        nargs="*",
        default=[
            "all",
            "creative_writing",
            "math",
            "language=Chinese",
            "is_code",
            "hard",
        ],
    )

    parser.add_argument(
        "--simple_metrics",
        nargs="*",
        default=[
            "Loss",
            "BCELoss",
            "MSELoss",
            "Accuracy",
            "Tie_Loss",
            "Tie_Accuracy",
            "Tie_bb_Accuracy",
            "Tie_bb_Loss",
            "Mean-BT",
            "Std-BT",
            "Spread-BT",
            "Mean-Spread-BT",
            "Mean-IQR-BT",
            "Mean-Std-BT",
        ],
    )

    parser.add_argument("--train_checkpoints", nargs="+", type=int, default=[])
    parser.add_argument("--checkpoint_size", type=int, default=0)

    # gt is marginal on val
    parser.add_argument(
        "--category_metrics",
        nargs="*",
        default=[
            "Leaderboard",
            "Aggr_Loss",
            "Aggr_BCELoss",
            "Aggr_Tie_Loss",
            "Aggr_Tie_Accuracy",
            "Aggr_Tie_bb_Accuracy",
            "Aggr_Tie_bb_Loss",
            "L1-Dist-Prob",
            "Spearman-lbs",
            "Kendall-lbs",
            "IQR-BT",
            "Std-BT",
            "Spread-BT",
            "Top-k-fraction",
            "Top-k-displace",
        ],
    )

    parser.add_argument(
        "--rand_subset_sizes", nargs="*", default=[250, 500, 1000, 2000]
    )

    parser.add_argument("--rand_num_samples", nargs="*", default=[50, 20, 5, 3])
    parser.add_argument(
        "--rand_subset_metrics",
        nargs="*",
        default=["L1-Dist-Prob", "Spearman-lbs", "Kendall-lbs"],
    )
    # gt is arena leaderboard
    parser.add_argument(
        "--aggr_scale_subset_sizes",
        nargs="*",
        default=[1, 10, 25, 100, 250, 500, 1000, 2000],
    )
    parser.add_argument(
        "--aggr_scale_num_samples",
        nargs="*",
        default=[500, 500, 500, 200, 100, 40, 10, 6],
    )

    parser.add_argument(
        "--aggr_scale_metrics",
        nargs="*",
        default=["L1-Dist-Prob", "Spearman-lbs", "Kendall-lbs"],
    )
    parser.add_argument("--ground_truth", type=str, default="marginal-gt")

    parser.add_argument(
        "--metrics_to_inc",
        nargs="*",
        default=["simple", "category", "random_subsets", "aggr_scale"],
    )

    parser.add_argument("--remove_last_hidden_json", default=True)

    args = parser.parse_args()
    start_time = time.time()
    for idx in range(len(args.output_file_name)):
        results = {}
        results["params"] = copy.deepcopy(vars(args))

        train_model_list = parse_model_list(args.model_repo, args.model_list_path)

        eval_path = args.eval_path[idx] if args.eval_path else None
        checkpoint_path = args.checkpoint_path[idx] if args.checkpoint_path else None
        hf_checkpoint_file = (
            args.hf_checkpoint_file[idx] if args.hf_checkpoint_file else None
        )

        # make sure right params are dumped
        results["params"]["eval_path"] = eval_path
        results["params"]["checkpoint_path"] = checkpoint_path
        results["params"]["hf_checkpoint_file"] = hf_checkpoint_file

        val_data, val_model_list = parse_eval_output_data(
            args.model_repo,
            eval_path,
            checkpoint_path,
            args.hf_checkpoint_repo,
            hf_checkpoint_file,
            args.loss_type,
            train_model_list,
            args.remove_last_hidden_json,
        )

        train_data = parse_train_data(
            args.hf_train_dataset, args.train_path, args.loss_type, train_model_list
        )
        arena_data = parse_arena_data(args.arena_path)

        models = {}
        for category in args.categories:

            cat_val_data = filter_battle_data(val_data, category)
            cat_train_data = filter_battle_data(train_data, category)

            arena_rankings = get_arena_rankings(arena_data, category)

            current_model = str(args.model_type) + "-" + category
            models[current_model] = get_metrics(
                cat_val_data,
                cat_train_data,
                arena_rankings,
                val_model_list,
                train_model_list,
                args,
            )

            # merely for marginal train checkpointing
            for checkpoint in args.train_checkpoints:
                num_data = checkpoint * args.checkpoint_size
                checkpoint_train_data = train_data.head(num_data)

                cat_train_data = filter_battle_data(checkpoint_train_data, category)
                models[current_model + f"-checkpoint-{checkpoint}"] = get_metrics(
                    cat_val_data,
                    cat_train_data,
                    arena_rankings,
                    val_model_list,
                    train_model_list,
                    args,
                )

        results["models"] = models
        save_output(
            results, args.output_dir, args.hf_output_dir, args.output_file_name[idx]
        )

    end_time = time.time()
    total_time = end_time - start_time

    minutes = int(total_time // 60)
    seconds = int(total_time % 60)

    print(f"\nTotal time taken: {minutes} minutes and {seconds} seconds")


================================================
FILE: p2l/dataset.py
================================================
from transformers import PreTrainedTokenizer
from datasets import Dataset, DatasetDict, load_dataset, load_from_disk
import torch
from typing import List


def get_model_list(dataset: Dataset):

    model_a_values = dataset.unique("model_a")
    model_b_values = dataset.unique("model_b")

    model_list_with_repeats = []

    for value in model_a_values:
        model_list_with_repeats.append(value)

    for value in model_b_values:
        model_list_with_repeats.append(value)

    model_set = set(model_list_with_repeats)

    model_list = sorted(list(model_set))

    return model_list


def get_dataset(path: str, split: str, from_disk=False):
    if from_disk:
        dataset = load_from_disk(path)

        if isinstance(dataset, DatasetDict):
        
            dataset = dataset[split]

        return dataset
    else:
        return load_dataset(path, split=split)


def _translate_label(
    labels: List[int], train_model_list: List[str], val_model_list: List[str]
) -> List[int]:
    label_copy = labels[:]

    label_copy[0] = train_model_list.index(val_model_list[labels[0]])
    label_copy[1] = train_model_list.index(val_model_list[labels[1]])

    return label_copy


def translate_val_data(
    val_data: Dataset, train_model_list: List[str], val_model_list: List[str]
) -> Dataset:

    # Validate val models
    for val_model in val_model_list:
        assert val_model in train_model_list, val_model

    # Translate val dataset
    val_data = val_data.map(
        lambda labels: {
            "labels": _translate_label(labels, train_model_list, val_model_list)
        },
        input_columns="labels",
        num_proc=16,
    )

    return val_data


class DataCollator:
    def __init__(self, tokenizer, max_length, weight=None, reweight_scale=None):
        self.tokenizer: PreTrainedTokenizer = tokenizer
        self.max_length: int = max_length
        self.weight: bool = weight
        self.reweight_scale: float = reweight_scale
        self.first = True

    def __call__(self, data):

        prompts = []

        for seq in data:

            if isinstance(seq["prompt"], str):
                prompts.append([{"role": "user", "content": seq["prompt"]}])
            else:
                prompts.append([{"role": "user", "content": turn} for turn in seq["prompt"]])
        
        labels = torch.tensor([seq["labels"].tolist() for seq in data])

        formatted_prompts = self.tokenizer.apply_chat_template(
            prompts,
            tokenize=False,
            add_generation_prompt=False,
            add_special_tokens=False,
        )

        # Scrub any instances of cls token from the data, otherwise model will error.
        formatted_prompts = [
            prompt.replace(self.tokenizer.cls_token, "<cls>")
            for prompt in formatted_prompts
        ]

        formatted_prompts = [
            seq + self.tokenizer.cls_token for seq in formatted_prompts
        ]

        if self.first:
            print(formatted_prompts)
            self.first = False

        encoded = self.tokenizer(
            formatted_prompts,
            padding=True,
            return_tensors="pt",
            add_special_tokens=False,
            truncation=True,
            max_length=self.max_length,
        )

        out = {
            "input_ids": encoded["input_ids"],
            "attention_mask": encoded["attention_mask"],
            "labels": labels,
        }

        if self.weight:
            if "weight" in data[0]:
                out["weights"] = torch.tensor([seq["weight"].tolist() for seq in data])
                if self.reweight_scale:
                    out["weights"] *= self.reweight_scale
            else:
                out["weights"] = None

        return out


================================================
FILE: p2l/endpoint.py
================================================
import argparse
import json
from typing import Dict, Tuple, List, Optional

import torch
import uvicorn
from fastapi import FastAPI, Header, HTTPException
from huggingface_hub import hf_hub_download
from pydantic import BaseModel
from transformers import (
    AutoTokenizer,
    TextClassificationPipeline,
    pipeline,
    PreTrainedModel,
)

from p2l.model import get_p2l_model, P2LOutputs
from contextlib import asynccontextmanager
import logging

logging.getLogger().setLevel(logging.DEBUG)


def parse_args():

    parser = argparse.ArgumentParser(description="Run FastAPI with Uvicorn")

    parser.add_argument(
        "--model-path",
        "-m",
        type=str,
        default="p2el/Qwen2.5-7B-Instruct-rk-full-train",
        help="Path to the model repository",
    )
    parser.add_argument(
        "--model-type",
        "-mt",
        type=str,
        default="qwen2",
        help="Type of the model",
    )
    parser.add_argument(
        "--head-type",
        "-ht",
        type=str,
        default="rk",
        help="Type of model head",
    )
    parser.add_argument(
        "--loss-type",
        "-lt",
        type=str,
        default="rk",
        help="Type of the loss function",
    )
    parser.add_argument(
        "--api-key",
        "-a",
        type=str,
        default="-",
        help="API key for authorization",
    )
    parser.add_argument(
        "--host",
        "-H",
        type=str,
        default="0.0.0.0",
        help="Host to run the server on",
    )
    parser.add_argument(
        "--port",
        "-p",
        type=int,
        default=10250,
        help="Port to run the server on",
    )

    parser.add_argument(
        "--reload",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Whether to reload the endpoint on detected code change, needs workers to be 1.",
    )
    parser.add_argument(
        "--workers",
        type=int,
        default=1,
        help="Number of endpoint workers (will hold a model per worker).",
    )
    parser.add_argument(
        "--cuda",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Flag to enable using a GPU to host the model. Flag is true by default.",
    )

    args = parser.parse_args()

    return args


@asynccontextmanager
async def lifespan(app: FastAPI):

    args = parse_args()

    model, tokenizer, model_list = load_model(
        args.model_path,
        args.model_type,
        args.head_type,
        args.loss_type,
    )

    pipe = pipeline(
        task="text-classification",
        model=model,
        tokenizer=tokenizer,
        device="cuda" if args.cuda else "cpu",
        pipeline_class=P2LPipeline,
    )

    app.state.api_key = args.api_key
    app.state.model_list = model_list
    app.state.model = model
    app.state.tokenizer = tokenizer
    app.state.pipe = pipe

    try:

        yield

    finally:

        pass


# Initialize FastAPI app
app = FastAPI(lifespan=lifespan)


# Define the input data structure
class InputData(BaseModel):
    prompt: list[str]


class OutputData(BaseModel):
    coefs: List[float]
    eta: Optional[float] = None


class ModelList(BaseModel):
    models: List[str]


class P2LPipeline(TextClassificationPipeline):
    def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, torch.Tensor]:
        return_tensors = self.framework

        inputs = inputs["prompt"]

        messages = [{"role": "user", "content": p} for p in inputs]

        formatted = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
            add_special_tokens=False,
        )
        formatted = formatted + self.tokenizer.cls_token

        logging.debug(f"Formatted input: {formatted}")

        return self.tokenizer(
            formatted,
            return_tensors=return_tensors,
            max_length=8192,
            padding="longest",
            truncation=True,
        )

    def postprocess(
        self, model_outputs: P2LOutputs, function_to_apply=None, top_k=1, _legacy=True
    ):
        model_outputs = P2LOutputs(model_outputs)

        eta = model_outputs.eta

        return OutputData(
            coefs=model_outputs.coefs.cpu().float().tolist()[0],
            eta=eta.cpu().float().item() if eta else None,
        )


@app.post("/predict")
async def predict(input_data: InputData, api_key: str = Header(...)):

    logging.debug(f"Received Request: {input_data}.")

    if api_key != app.state.api_key:

        raise HTTPException(status_code=403, detail="Unauthorized")

    try:
        pipe: P2LPipeline = app.state.pipe

        logging.debug(f"Input Prompt: {input_data.prompt}")

        output = pipe(inputs=input_data.model_dump())

        logging.debug(f"Output: {output}")

        return output

    except Exception as e:

        logging.debug(e)

        raise HTTPException(status_code=500, detail=str(e))


@app.get("/models")
async def models(api_key: str = Header(...)):

    logging.debug(f"Received Model List Request.")

    if api_key != app.state.api_key:

        raise HTTPException(status_code=403, detail="Unauthorized")

    try:

        return ModelList(
            models=app.state.model_list,
        )

    except Exception as e:

        raise HTTPException(status_code=500, detail=str(e))


def load_model(
    model_name, model_type, head_type, loss_type
) -> Tuple[PreTrainedModel, AutoTokenizer, List[str]]:

    # Download and load the model list
    fname = hf_hub_download(
        repo_id=model_name, filename="model_list.json", repo_type="model"
    )
    with open(fname) as fin:
        model_list = json.load(fin)

    # Initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.truncation_side = "left"
    tokenizer.padding_side = "right"

    # Get the model class and load the model
    model_cls = get_p2l_model(model_type, loss_type, head_type)

    model = model_cls.from_pretrained(
        model_name,
        CLS_id=tokenizer.cls_token_id,
        num_models=len(model_list),
        torch_dtype=torch.bfloat16,
    )
    return model, tokenizer, model_list


if __name__ == "__main__":

    args = parse_args()

    uvicorn.run(
        "p2l.endpoint:app",
        port=args.port,
        host=args.host,
        reload=args.reload,
        workers=args.workers,
    )


================================================
FILE: p2l/eval.py
================================================
import argparse
from p2l.model import get_p2l_model, P2LOutputs
from transformers import pipeline, TextClassificationPipeline, AutoTokenizer
from huggingface_hub import hf_hub_download
from datasets import load_dataset
import torch
from typing import Dict
import pandas as pd
import os
import json
from tqdm.auto import tqdm
from torch.utils.data import Dataset
from glob import glob


class P2LPipeline(TextClassificationPipeline):
    def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, torch.Tensor]:
        return_tensors = self.framework

        messages = [{"role": "user", "content": inputs}]

        formatted = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
            add_special_tokens=False,
        )

        formatted = formatted + self.tokenizer.cls_token

        return self.tokenizer(
            formatted,
            return_tensors=return_tensors,
            max_length=8192,
            padding="longest",
            truncation=True,
        )

    def postprocess(
        self, model_outputs: P2LOutputs, function_to_apply=None, top_k=1, _legacy=True
    ):

        model_outputs = P2LOutputs(model_outputs)

        eta = model_outputs.eta
        gamma = model_outputs.gamma


        return dict(
            coefs=model_outputs.coefs.cpu().float().numpy(),
            eta=eta.cpu().float().numpy() if eta else None,
            gamma=gamma.cpu().float().numpy() if gamma else None,
            last_hidden_state=model_outputs.last_hidden_state.cpu().float().numpy(),
        )


class ListDataset(Dataset):
    def __init__(self, original_list):
        self.original_list = original_list

    def __len__(self):
        return len(self.original_list)

    def __getitem__(self, i):
        return self.original_list[i]


def main(args, local_file=None):

    os.makedirs(args.output_dir, exist_ok=True)

    dataset = load_dataset(args.dataset, split=args.dataset_split)
    
    if local_file:
        fname = os.path.join(local_file, "model_list.json")
    else:
        fname = hf_hub_download(
            repo_id=args.model_path, filename="model_list.json", repo_type="model"
        )

    with open(fname) as fin:
        model_list = json.load(fin)

    model_cls = get_p2l_model(args.model_type, args.loss_type, args.head_type)

    if local_file:
        tokenizer = AutoTokenizer.from_pretrained(local_file, local_files_only=True)
        model = model_cls.from_pretrained(
            local_file,
            CLS_id=tokenizer.cls_token_id,
            num_models=len(model_list),
            torch_dtype=torch.bfloat16,
            local_files_only=True,
        )
    else:
        tokenizer = AutoTokenizer.from_pretrained(args.model_path)
        model = model_cls.from_pretrained(
            args.model_path,
            CLS_id=tokenizer.cls_token_id,
            num_models=len(model_list),
            torch_dtype=torch.bfloat16,
        )

    device = "cuda" if torch.cuda.is_available() else "cpu"

    pipe = pipeline(
        task="text-classification",
        model=model,
        tokenizer=tokenizer,
        device=device,
        pipeline_class=P2LPipeline,
    )

    prompts = ListDataset(dataset["prompt"])

    with torch.no_grad():
        outputs = [
            out
            for out in tqdm(
                pipe(prompts, batch_size=args.batch_size), total=len(prompts)
            )
        ]

    df = dataset.to_pandas()

    outputs_df = pd.DataFrame.from_records(outputs)

    if args.drop_hidden:

        outputs_df = outputs_df.drop("last_hidden_state", axis=1)

    df = pd.concat((df, outputs_df), axis=1)

    if local_file:
        fname = local_file.split("/")[-1] + ".json"
    else:
        fname = args.model_path.split("/")[-1] + ".json"
    fpath = os.path.join(args.output_dir, fname)
    df.to_json(fpath, orient="records", indent=4, force_ascii=False)

    if args.output_hf_path:
        from datasets import Dataset

        df = pd.read_json(fpath)
        hf_dataset = Dataset.from_pandas(df)
        hf_dataset.push_to_hub(args.output_hf_path, private=True)
        print("Results pushed to hub!")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model-path", "-m", type=str, default=None, help="Huggingface model path"
    )
    parser.add_argument(
        "--training-output-dir", "-t", type=str, default=None
    )
    parser.add_argument(
        "--dataset", "-d", type=str, required=True, help="Huggingface dataset path"
    )
    parser.add_argument("--output-hf-path", "-oh", type=str, default=None)
    parser.add_argument(
        "--dataset-split",
        "-ds",
        type=str,
        default="train",
        help="Huggingface dataset split",
    )
    parser.add_argument(
        "--model-type",
        "-mt",
        type=str,
        default="qwen2",
        help="Model type (qwen2, llama, etc)",
    )
    parser.add_argument(
        "--head-type",
        "-ht",
        type=str,
        default="bt",
        help="Head type (Bradely Terry, Rao-Kupper, etc)",
    )
    parser.add_argument(
        "--loss-type",
        "-lt",
        type=str,
        default="bt",
        help="Loss type (Bradely Terry, Rao-Kupper, etc)",
    )
    parser.add_argument("--batch-size", "-bs", type=int, default=1, help="Batch size")
    parser.add_argument("--output-dir", "-od", type=str, default="outputs")
    parser.add_argument("--drop-hidden", action=argparse.BooleanOptionalAction, default=False)

    args = parser.parse_args()

    if args.training_output_dir:
        for file in glob(os.path.join(args.training_output_dir, "*")):
            main(args, file)
    else:
        main(args)


================================================
FILE: p2l/model.py
================================================
import torch
from transformers import (
    Qwen2Model,
    Qwen2PreTrainedModel,
    LlamaModel,
    LlamaPreTrainedModel,
    PreTrainedModel,
    AutoTokenizer,
)
from transformers.utils import ModelOutput
from dataclasses import dataclass
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, Tuple, Callable, Optional


registered_transformers: Dict[str, Tuple[PreTrainedModel, PreTrainedModel]] = {
    "qwen2": (Qwen2PreTrainedModel, Qwen2Model),
    "llama": (LlamaPreTrainedModel, LlamaModel),
}

registered_losses: Dict[str, Callable] = {}
registered_heads: Dict[str, nn.Module] = {}
registered_inits: Dict[str, Callable] = {}

registered_aggr_models: Dict[str, nn.Module] = {}
registered_pairwise_losses: Dict[str, Callable] = {}


def register_loss(name: str):
    def decorator(func: Callable):
        registered_losses[name] = func
        return func

    return decorator


def register_head(name: str):
    def decorator(func: Callable):
        registered_heads[name] = func
        return func

    return decorator


def register_init(name: str):
    def decorator(func: Callable):
        registered_inits[name] = func
        return func

    return decorator


def register_aggr_model(name: str):
    def decorator(func: Callable):
        registered_aggr_models[name] = func
        return func

    return decorator


def register_pairwise_loss(name: str):
    def decorator(func: Callable):
        registered_pairwise_losses[name] = func
        return func

    return decorator


def register_init(name: str):
    def decorator(func: Callable):
        registered_inits[name] = func
        return func

    return decorator


@dataclass
class HeadOutputs(ModelOutput):
    coefs: torch.FloatTensor = None
    eta: Optional[torch.FloatTensor] = None
    gamma: Optional[torch.FloatTensor] = None


@dataclass
class P2LOutputs(ModelOutput):
    coefs: torch.FloatTensor = None
    eta: Optional[torch.FloatTensor] = None
    gamma: Optional[torch.FloatTensor] = None
    loss: Optional[torch.FloatTensor] = None
    last_hidden_state: torch.FloatTensor = None


@register_loss("bt")
def BT_loss(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    weights: torch.Tensor = None,
    **kwargs,
):
    # labels columns are in the form (winner_idx, loser_idx)

    coefs = head_output.coefs

    paired_coefs = coefs.gather(dim=-1, index=labels).contiguous()

    paired_delta_logit = (
        paired_coefs[:, 0] - paired_coefs[:, 1]
    )  # subtract winner bt from loser bt

    neg_log_sigma = -F.logsigmoid(paired_delta_logit)  # get neg log prob

    if weights is not None:
        neg_log_sigma = neg_log_sigma * weights

    loss = neg_log_sigma.mean()

    return loss


@register_loss("bt-tie")
def BT_tie_loss(
    head_output: HeadOutputs,
    labels: torch.Tensor,
    weights: torch.Tensor = None,
    **kwargs,
):
    # labels columns are in the form (winner_idx, loser_idx, tie_indicator)

    coefs = head_output.coefs

    model_idx = labels[:, :2]  # (batch_dim, 2)
    tie_ind = labels[:, -1]

    paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous()

    paired_delta_logit = (
        paired_coefs[:, 0] - paired_coefs[:, 1]
    )  # subtract winner bt from loser bt

    # computes bradley-terry loss where tie is half win and half loss
    neg_log_sigma = -1 * torch.where(
        tie_ind == 0,
        F.logsigmoid(paired_delta_logit),
        0.5
        * (F.logsigmoid(paired_delta_logit) + F.logsigmoid(-1 * paired_delta_logit)),
    )

    if weights is not None:
        neg_log_sigma = neg_log_sigma * weights

    loss = neg_log_sigma.mean()

    return loss


BETA = 0.1


@register_loss("rk")
def RK_Loss(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    # labels columns are in form (winner_idx, loser_idx, tie_indicator)
    coefs = head_output.coefs
    # eta = torch.exp(head_output.eta).squeeze(-1)  # eta > 0
    eta = torch.clamp(
        torch.nn.functional.softplus(head_output.eta - 22.5, BETA).squeeze(-1), min=0.02
    )
    # eta = torch.abs(head_output.eta).squeeze(-1)
    model_idx = labels[:, :2]  # (batch_dim, 2)
    paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous()

    paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1]

    # compute RK probabilities
    p_w = torch.sigmoid(paired_delta_logit - eta)
    p_l = torch.sigmoid(-1 * paired_delta_logit - eta)
    p_t = 1 - p_w - p_l

    # point-wise likelihood
    A = torch.stack((p_w, p_t))  # (2, batch_dim)

    tie_ind = labels[:, -1].unsqueeze(0)  # (1, batch_dim)
    p = A.take_along_dim(dim=0, indices=tie_ind)

    # mathematically p_t < 1 always but bfloat smh
    p = torch.clamp(p, min=1e-3)

    # eps = 1e-10
    loss = -torch.log(p)

    if weights:
        loss = loss * weights

    loss = loss.mean()

    return loss


@register_loss("rk-reparam")
def RK_Reparam_Loss(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):

    coefs = head_output.coefs
    eta = head_output.eta

    theta = torch.exp(eta) + 1.000001

    winner_idx = labels[:, 0:1]
    loser_idx = labels[:, 1:2]

    beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()
    beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()

    pi_win = torch.exp(beta_win)
    pi_lose = torch.exp(beta_lose)

    p_win = pi_win / (pi_win + theta * pi_lose + 1.0)

    p_lose = pi_lose / (pi_lose + theta * pi_win + 1.0)

    p_tie = 1.0 - p_win - p_lose

    assert p_win.shape == p_lose.shape == p_tie.shape

    P = torch.hstack((p_win, p_tie))
    tie_ind = labels[:, -1].unsqueeze(-1)

    p = P.gather(dim=-1, index=tie_ind).contiguous()

    p = torch.clamp(p, min=1e-6)

    loss = -torch.log(p)

    if weights:
        loss = loss * weights

    loss = loss.mean()

    return loss


@register_loss("ba")
def BA_loss(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    # labels are (winner_idx, loser_idx, tie_indicator (0 for no tie, 1 for tie, 2 for tie both bad))

    coefs = head_output.coefs
    eta = head_output.eta
    gamma = head_output.gamma

    theta = torch.exp(eta) + 1.02

    winner_idx = labels[:, 0:1]
    loser_idx = labels[:, 1:2]

    beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()
    beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()

    pi_win = torch.exp(beta_win)
    pi_lose = torch.exp(beta_lose)
    pi_gamma = torch.exp(gamma)

    p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma)

    p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma)

    p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose)

    p_tie = 1.0 - p_win - p_lose - p_tie_bb

    P = torch.hstack((p_win, p_tie, p_tie_bb))

    tie_ind = labels[:, -1].unsqueeze(-1)

    p = P.gather(dim=-1, index=tie_ind).contiguous()

    p = torch.clamp(p, min=1e-2)

    loss = -torch.log(p)

    if weights:
        loss = loss * weights

    loss = loss.mean()

    print("loss: ", loss.item())

    return loss


@register_loss("bag")
@register_loss("grk")
def GRK_loss(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    # labels are (winner_idx, loser_idx, tie_indicator (0 for no tie, 1 for tie, 2 for tie both bad))

    coefs = head_output.coefs.float()
    eta = head_output.eta.float()

    theta = torch.exp(eta) + 1.000001

    winner_idx = labels[:, 0:1]
    loser_idx = labels[:, 1:2]

    beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()
    beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()

    pi_win = torch.exp(beta_win)
    pi_lose = torch.exp(beta_lose)
    pi_gamma = 1.0

    p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma)

    p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma)

    p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose)

    p_tie = 1.0 - p_win - p_lose - p_tie_bb

    assert p_win.shape == p_lose.shape == p_tie_bb.shape == p_tie.shape
    P = torch.hstack((p_win, p_tie, p_tie_bb))

    tie_ind = labels[:, -1].unsqueeze(-1)

    p = P.gather(dim=-1, index=tie_ind).contiguous()

    p = torch.clamp(p, min=1e-6)

    loss = -torch.log(p)

    if weights:
        loss = loss * weights

    loss = loss.mean()

    # print("loss: ", loss.item())

    return loss


@register_head("bt")
class BTHead(nn.Module):
    def __init__(
        self, input_dim, output_dim, linear_head_downsize_factor=None, **kwargs
    ) -> None:
        super().__init__()

        if linear_head_downsize_factor:
            inner_dim = int(output_dim // linear_head_downsize_factor)
            self.head = nn.Sequential(
                nn.Linear(in_features=input_dim, out_features=inner_dim, bias=True),
                nn.Linear(in_features=inner_dim, out_features=output_dim, bias=True),
            )
        else:
            self.head = nn.Linear(
                in_features=input_dim, out_features=output_dim, bias=True
            )

    def forward(self, last_hidden_dim: torch.Tensor):
        coefs = self.head(last_hidden_dim)
        return HeadOutputs(coefs=coefs)


@register_head("rk")
class RKHead(nn.Module):
    def __init__(
        self,
        input_dim,
        output_dim,
        eta_dim=1,
        linear_head_downsize_factor=None,
        eta_downsize=False,
        **kwargs,
    ) -> None:
        super().__init__()
        # If linear header downsize factor and eta downsize, then eta is calculated off of the downsized dim, not the hidden dim.
        if linear_head_downsize_factor:
            inner_dim = output_dim // linear_head_downsize_factor
            share_layer = nn.Linear(
                in_features=input_dim, out_features=inner_dim, bias=True
            )
            self.head = nn.Sequential(
                share_layer,
                nn.Linear(in_features=inner_dim, out_features=output_dim, bias=True),
            )
            if eta_downsize:
                self.eta_head = nn.Sequential(
                    share_layer,
                    nn.Linear(in_features=inner_dim, out_features=eta_dim, bias=True),
                )
            else:
                self.eta_head = nn.Linear(
                    in_features=output_dim, out_features=eta_dim, bias=True
                )
        else:
            self.head = nn.Linear(
                in_features=input_dim, out_features=output_dim, bias=True
            )
            self.eta_head = nn.Linear(
                in_features=input_dim, out_features=eta_dim, bias=True
            )

    def forward(self, last_hidden_dim: torch.Tensor):
        coefs = self.head(last_hidden_dim)
        eta = self.eta_head(last_hidden_dim)

        return HeadOutputs(coefs=coefs, eta=eta)


@register_head("ba")
class BAHead(nn.Module):
    def __init__(
        self,
        input_dim,
        output_dim,
        linear_head_downsize_factor=None,
        **kwargs,
    ) -> None:
        super().__init__()

        if linear_head_downsize_factor:
            raise NotImplementedError("Sorry I didn't implement this.")

        self.head = nn.Linear(in_features=input_dim, out_features=output_dim, bias=True)
        self.eta_head = nn.Linear(in_features=input_dim, out_features=1, bias=True)
        self.gamma_head = nn.Linear(in_features=input_dim, out_features=1, bias=True)

    def forward(self, last_hidden_dim: torch.Tensor):

        coefs = self.head(last_hidden_dim)
        eta = self.eta_head(last_hidden_dim)
        gamma = self.gamma_head(last_hidden_dim)

        return HeadOutputs(coefs=coefs, eta=eta, gamma=gamma)


@register_init("reset_params")
def reset_params_init(module):
    return module.reset_parameters()


@register_init("he_unif")
def he_unif_init(module):
    return nn.init.kaiming_uniform_(module.weight, nonlinearity="sigmoid")


@register_init("xavier_unif")
def xavier_unif_init(module):
    return nn.init.xavier_uniform_(module.weight)


@register_init("tiny_normal")
def tiny_normal_init(module):
    return nn.init.kaiming_normal_(module.weight)


def get_p2l_model(
    model_type: str, loss_type: str, head_type: str, init_type: str = "reset_params"
) -> PreTrainedModel:
    pretrained_model_cls, model_cls = registered_transformers[model_type]

    criterion = registered_losses[loss_type]

    head_layer = registered_heads[head_type]

    init_func = registered_inits[init_type]

    class CustomPretrainedModel(pretrained_model_cls):
        """Defines the appropriate pretrained class for the given model name.  This is done so that the value head init scheme is correct."""

        def _init_weights(self, module):
            std = self.config.initializer_range
            if isinstance(module, nn.Linear):
                init_func(module)  # was reset params
                if module.bias is not None:
                    module.bias.data.zero_()
            elif isinstance(module, nn.Embedding):
                module.weight.data.normal_(mean=0.0, std=std)
                if module.padding_idx is not None:
                    module.weight.data[module.padding_idx].zero_()

    class P2LModel(CustomPretrainedModel):
        def __init__(
            self,
            config,
            CLS_id,
            num_models,
            linear_head_downsize_factor=None,
            head_kwargs={},
            **kwargs,
        ):
            super().__init__(config)

            self.num_models = num_models
            self.cls_token_id = CLS_id

            self.model = model_cls(config)

            self.head = head_layer(
                input_dim=config.hidden_size,
                output_dim=self.num_models,
                linear_head_downsize_factor=linear_head_downsize_factor,
                **head_kwargs,
            )

            self.post_init()

        def freeze_transformer(self):
            for param in self.model.parameters():
                param.requires_grad = False

        def get_input_embeddings(self):
            return self.model.embed_tokens

        def set_input_embeddings(self, value):
            self.model.embed_tokens = value

        def forward(self, input_ids, attention_mask, labels=None, weights=None):
            batch_size = input_ids.shape[0]

            hidden_outputs = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                output_hidden_states=False,
            ).last_hidden_state  # (bs, num_token, embed_dim)

            cls_mask = input_ids == self.cls_token_id

            # double check this is getting the current CLS token
            cls_hidden_dim = hidden_outputs[cls_mask]

            assert (
                cls_hidden_dim.shape[0] == batch_size
            ), f"input ids {input_ids.shape}, cls_mask {cls_mask.shape}, cls_logit {cls_hidden_dim.shape}"

            head_output = self.head(cls_hidden_dim)

            if labels is not None:
                loss = criterion(head_output, labels, weights=weights)

                outputs = P2LOutputs(
                    coefs=head_output.coefs,
                    last_hidden_state=cls_hidden_dim,
                    eta=head_output.eta,
                    gamma=head_output.gamma,
                    loss=loss,
                )

            else:
                outputs = P2LOutputs(
                    coefs=head_output.coefs,
                    last_hidden_state=cls_hidden_dim,
                    eta=head_output.eta,
                    gamma=head_output.gamma,
                )

            return outputs

    return P2LModel


def get_tokenizer(
    tokenizer_name,
    chat_template,
    pad_token_if_none="<|pad|>",
    cls_token_if_none="<|cls|>",
):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    tokenizer.truncation_side = "left"
    tokenizer.padding_side = "right"

    if chat_template:
        tokenizer.chat_template = chat_template

    if "pad_token" not in tokenizer.special_tokens_map:
        tokenizer.add_special_tokens({"pad_token": pad_token_if_none})
    if "cls_token" not in tokenizer.special_tokens_map:
        tokenizer.add_special_tokens({"cls_token": cls_token_if_none})

    return tokenizer


@register_aggr_model("bt")
@register_aggr_model("bt-tie")
class BTAggrModel(nn.Module):
    def __init__(self, num_models, batch_size=1):
        super().__init__()
        self.coefs = nn.Parameter(
            nn.init.constant_(torch.empty(batch_size, num_models), 0.5)
        )
        self.eta = None

    def forward(self):
        return self.coefs, self.eta


@register_aggr_model("rk")
@register_aggr_model("rk-reparam")
@register_aggr_model("bag")
@register_aggr_model("grk")
class RKAggrModel(nn.Module):
    def __init__(self, num_models, batch_size=1):
        super().__init__()
        self.coefs = nn.Parameter(
            nn.init.constant_(torch.empty(batch_size, num_models), 0.5)
        )
        self.eta = nn.Parameter(nn.init.constant_(torch.empty(batch_size, 1), 0.1))

    def forward(self):
        return self.coefs, self.eta


@register_pairwise_loss("bt")
@register_pairwise_loss("bt-tie")
def pairwise_batch_BT_loss(
    real_output: HeadOutputs, aggregated_output: HeadOutputs, true_probs: torch.tensor
):
    real_betas = real_output.coefs
    aggregated_betas = aggregated_output.coefs

    num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1]

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    beta_i_agg = aggregated_betas[:, pair_indices[:, 0]]
    beta_j_agg = aggregated_betas[:, pair_indices[:, 1]]

    pred_probs = torch.sigmoid(beta_i_agg - beta_j_agg)

    pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1)

    eps = 1e-9
    neg_log_prob = -(
        true_probs * torch.log(pred_probs_expanded + eps)
        + (1 - true_probs) * torch.log(1 - pred_probs_expanded + eps)
    )

    batch_losses = neg_log_prob.mean(dim=(1, 2))
    loss = batch_losses.mean()

    return loss


# batched loss
@register_pairwise_loss("rk")
def pairwise_batch_RK_loss(
    real_output: HeadOutputs, aggregated_output: HeadOutputs, true_probs: torch.tensor
):
    real_betas = real_output.coefs
    num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1]

    aggregated_betas = aggregated_output.coefs
    BETA = 0.1
    aggregated_eta = torch.clamp(
        torch.nn.functional.softplus(aggregated_output.eta - 22.5, BETA).squeeze(-1),
        min=0.02,
    )

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    beta_i_agg = aggregated_betas[:, pair_indices[:, 0]]
    beta_j_agg = aggregated_betas[:, pair_indices[:, 1]]

    aggregated_eta = aggregated_eta.unsqueeze(-1)
    pred_probs_win = torch.sigmoid(beta_i_agg - beta_j_agg - aggregated_eta)
    pred_probs_loss = torch.sigmoid(beta_j_agg - beta_i_agg - aggregated_eta)
    pred_probs_tie = 1 - pred_probs_win - pred_probs_loss

    pred_probs = torch.stack((pred_probs_win, pred_probs_loss, pred_probs_tie), dim=-1)

    pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1, -1)

    eps = 1e-9
    neg_log_prob = -torch.sum(true_probs * torch.log(pred_probs_expanded + eps), dim=-1)

    batch_losses = neg_log_prob.mean(dim=(1, 2))
    loss = batch_losses.mean()

    return loss


# batched
@register_pairwise_loss("rk-reparam")
def pairwise_batch_RK_reparam_loss(
    real_output: HeadOutputs,
    aggregated_output: HeadOutputs,
    true_probs: torch.tensor,
    **kwargs,
):
    real_betas = real_output.coefs
    num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1]

    aggregated_betas = aggregated_output.coefs
    aggregrated_theta = torch.exp(aggregated_output.eta) + 1.000001

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    beta_i_agg = aggregated_betas[:, pair_indices[:, 0]]
    beta_j_agg = aggregated_betas[:, pair_indices[:, 1]]

    pi_win = torch.exp(beta_i_agg)
    pi_lose = torch.exp(beta_j_agg)

    p_win = pi_win / (pi_win + aggregrated_theta * pi_lose + 1.0)
    p_lose = pi_lose / (pi_lose + aggregrated_theta * pi_win + 1.0)
    p_tie = 1.0 - p_win - p_lose

    pred_probs = torch.stack((p_win, p_lose, p_tie), dim=-1)
    pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1, -1)

    eps = 1e-9
    neg_log_prob = -torch.sum(true_probs * torch.log(pred_probs_expanded + eps), dim=-1)
    batch_losses = neg_log_prob.mean(dim=(1, 2))
    loss = batch_losses.mean()

    return loss


def get_bag_probs(beta_win, beta_lose, gamma, theta):
    pi_win = torch.exp(beta_win)
    pi_lose = torch.exp(beta_lose)
    pi_gamma = 1.0

    p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma)

    p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma)

    p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose)

    p_tie = 1.0 - p_win - p_lose - p_tie_bb

    return torch.stack((p_win, p_lose, p_tie, p_tie_bb), dim=-1)


# batched
@register_pairwise_loss("bag")
@register_pairwise_loss("grk")
def pairwise_batch_bag_loss(
    real_output: HeadOutputs,
    aggregated_output: HeadOutputs,
    true_probs: torch.tensor,
    **kwargs,
):
    real_betas = real_output.coefs
    num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1]

    aggregated_betas = aggregated_output.coefs
    aggregrated_theta = torch.exp(aggregated_output.eta) + 1.000001

    pair_indices = torch.tensor(
        [(i, j) for i in range(num_models) for j in range(i + 1, num_models)],
        dtype=torch.long,
    )

    beta_i_agg = aggregated_betas[:, pair_indices[:, 0]]
    beta_j_agg = aggregated_betas[:, pair_indices[:, 1]]

    pred_probs = get_bag_probs(beta_i_agg, beta_j_agg, 1.0, aggregrated_theta)

    pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1, -1)

    eps = 1e-9
    neg_log_prob = -torch.sum(true_probs * torch.log(pred_probs_expanded + eps), dim=-1)
    batch_losses = neg_log_prob.mean(dim=(1, 2))
    loss = batch_losses.mean()

    return loss


@register_loss("tie-rk")
def RK_Tie_Loss(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    coefs = head_output.coefs
    eta = torch.clamp(
        torch.nn.functional.softplus(head_output.eta - 22.5, BETA).squeeze(-1), min=0.02
    )
    model_idx = labels[:, :2]
    paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous()

    paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1]

    p_w = torch.sigmoid(paired_delta_logit - eta)
    p_l = torch.sigmoid(-1 * paired_delta_logit - eta)
    p_t = 1 - p_w - p_l

    p_not_t = p_w + p_l
    p_t = p_t

    A = torch.stack((p_not_t, p_t))

    tie_ind = labels[:, -1].unsqueeze(0)
    p = A.take_along_dim(dim=0, indices=tie_ind)

    p = torch.clamp(p, min=1e-3)

    loss = -torch.log(p)
    if weights:
        loss = loss * weights
    loss = loss.mean()

    return loss


@register_loss("tie-bag")
@register_loss("tie-grk")
def bag_tie_loss(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    coefs = head_output.coefs
    eta = head_output.eta

    theta = torch.exp(eta) + 1.000001

    winner_idx = labels[:, 0:1]
    loser_idx = labels[:, 1:2]

    beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()
    beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()

    p_win, p_lose, p_tie, p_tie_bb = torch.unbind(
        get_bag_probs(beta_win, beta_lose, 1.0, theta), dim=-1
    )

    P = torch.hstack((p_win + p_lose, p_tie + p_tie_bb))

    tie_ind = labels[:, -1].unsqueeze(-1)
    tie_ind = torch.where(tie_ind == 0, 0, 1)  # segment into ties and not ties

    p = P.gather(dim=-1, index=tie_ind).contiguous()

    p = torch.clamp(p, min=1e-6)

    loss = -torch.log(p)

    if weights:
        loss = loss * weights

    loss = loss.mean()
    return loss


@register_loss("tie-bb-bag")
@register_loss("tie-bb-grk")
def bag_tie_bb_loss(
    head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs
):
    coefs = head_output.coefs
    eta = head_output.eta

    theta = torch.exp(eta) + 1.000001

    winner_idx = labels[:, 0:1]
    loser_idx = labels[:, 1:2]

    beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()
    beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()

    p_win, p_lose, p_tie, p_tie_bb = torch.unbind(
        get_bag_probs(beta_win, beta_lose, 1.0, theta), dim=-1
    )

    P = torch.hstack((p_win + p_lose + p_tie, p_tie_bb))

    tie_ind = labels[:, -1].unsqueeze(-1)
    tie_ind = torch.where(tie_ind == 2, 1, 0)  # index should be 1 if tie-bb

    p = P.gather(dim=-1, index=tie_ind).contiguous()

    p = torch.clamp(p, min=1e-6)

    loss = -torch.log(p)

    if weights:
        loss = loss * weights

    loss = loss.mean()
    return loss


================================================
FILE: p2l/train.py
================================================
import argparse
import os
import yaml
import json
import random
from transformers import Trainer, TrainingArguments, set_seed
from p2l.dataset import DataCollator, get_model_list, get_dataset, translate_val_data
from p2l.model import get_p2l_model, get_tokenizer
from torch.utils.data import Sampler
from typing import Optional
from huggingface_hub import HfApi

# Want control over data ordering, use no shuffle trainer.
class NoShuffleTrainer(Trainer):
    def _get_train_sampler(self) -> Optional[Sampler]:
        return None


def train_model(args):

    with open(args.config, "r") as file:
        config = yaml.safe_load(file)

    learning_rate = config["learning_rate"]
    # Microbatch size
    batch_size = config["batch_size"]
    # HF data path
    train_data_path = config["train_data_path"]
    val_data_path = config["val_data_path"]
    output_dir = config["output_dir"]
    pretrain_model_name = config["pretrain_model_name"]
    # Prompts will be truncted to this length
    max_length = config["max_length"]
    gradient_accumulation_steps = config["gradient_accumulation_steps"]
    # Deepspeed config choices can be found in the deepspeed directory
    deepspeed_config_path = config["deepspeed_config_path"]
    # Type of transformer, see model.py for options.
    model_type = config["model_type"]
    # Loss type (e.g, bt, rk), see model.py for options.
    loss_type = config["loss_type"]
    # The linear head type, see model.py for options.
    head_type = config["head_type"]

    # Epsilon value for Adam
    adam_epsilon = config["adam_epsilon"]

    # Optional
    epochs = config.get("num_train_epochs", 1)
    lr_scheduler = config.get("lr_schedule", "constant")
    chat_template = config.get("chat_template", None)
    # Downsize the rank of the classification head.
    linear_head_downsize_factor = config.get("linear_head_downsize_factor", None)
    # Whether to weight the loss. If this is true, it expects that the dataset has a "weight" column.
    weighted_loss = config.get("weighted_loss", False)
    # kwargs for the head init.
    head_config = config.get("head_config", {})
    # If the tokenizer/model does not already have a cls token, this will be used.
    cls_token_if_none = config.get("cls_token_if_none", "<|cls|>")
    # If the tokenizer/model does not already have a pad token, this will be used.
    pad_token_if_none = config.get("pad_token_if_none", "<|pad|>")
    # If using weighted loss, scalar reweight factor
    reweight_scale = config.get("reweight_scale", None)
    proj_name = config.get("proj_name", None)
    init_type = config.get("init_type", "reset_params")
    train_head_only = config.get("train_head_only", False)
    load_train_data_from_disk = config.get("load_train_data_from_disk", False)
    load_val_data_from_disk = config.get("load_val_data_from_disk", False)

    LOCAL_RANK = int(os.environ.get("LOCAL_RANK", -1))

    os.makedirs(output_dir, exist_ok=True)

    # define project name
    if not proj_name:
        proj_name = f"{pretrain_model_name.split('/')[1]}_lr{learning_rate}_bs{batch_size}_ep{epochs}"

    print(f"project name: {proj_name}")

    output_path = os.path.join(output_dir, proj_name)

    if args.checkpoint:
        resume_from_checkpoint = args.checkpoint
        print("resuming from checkpoint")
    else:
        resume_from_checkpoint = False

    if not resume_from_checkpoint:
        version = 1
        while os.path.exists(output_path):
            output_path = output_path.replace(f"_{version - 1}", "")
            output_path = output_path + f"_{version}"
            version += 1

    with open(deepspeed_config_path) as fin:
        deepspeed_config = json.load(fin)

    random.seed(42)
    set_seed(42)

    training_args = TrainingArguments(
        output_dir=output_path,
        report_to="wandb",
        run_name=proj_name,
        num_train_epochs=epochs,
        gradient_accumulation_steps=gradient_accumulation_steps,
        save_strategy="no" if args.save_steps == -1 else "steps",
        save_steps=None if args.save_steps == -1 else args.save_steps,
        save_only_model=True,
        eval_strategy="no",
        logging_strategy="steps",
        logging_steps=1,
        ddp_timeout=9999999,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        eval_accumulation_steps=1,
        eval_steps=args.eval_steps,
        lr_scheduler_type=lr_scheduler,
        logging_dir="./logs",
        fp16=False,
        bf16=True,
        learning_rate=learning_rate,
        adam_epsilon=adam_epsilon,
        load_best_model_at_end=False,
        gradient_checkpointing=True,
        do_train=True,
        bf16_full_eval=True,
        save_safetensors=True,
        disable_tqdm=False,
        remove_unused_columns=False,
        deepspeed=deepspeed_config,
        seed=42,
        data_seed=42,
        local_rank=LOCAL_RANK,
    )

    tokenizer = get_tokenizer(
        pretrain_model_name,
        chat_template,
        pad_token_if_none=pad_token_if_none,
        cls_token_if_none=cls_token_if_none,
    )

    data_collator = DataCollator(
        tokenizer, max_length, weight=weighted_loss, reweight_scale=reweight_scale
    )

    train_data = get_dataset(
        train_data_path, "train", from_disk=load_train_data_from_disk
    )

    if not args.no_eval:
        val_data = get_dataset(val_data_path, "train", from_disk=load_val_data_from_disk)

    # with training_args.main_process_first():

    model_list = get_model_list(train_data)

    if not args.no_eval:
        val_model_list = get_model_list(val_data)

        if model_list != val_model_list:
            print("WARNING: Val model list is different, translating...")
            val_data = translate_val_data(val_data, model_list, val_model_list)

    if LOCAL_RANK <= 0:
        # Document the configuration in the output path.
        os.makedirs(output_path, exist_ok=False)

        with open(os.path.join(output_path, "training_config.json"), "w") as fout:
            json.dump(config, fout, indent=1)

        # Save the model list so we know which models this model was trained on. The model list is ALWAYS sorted alphabetically.
        with open(os.path.join(output_path, "model_list.json"), "w") as fout:
            json.dump(model_list, fout, indent=1)

    # Get the model class
    model_cls = get_p2l_model(
        model_type=model_type,
        loss_type=loss_type,
        head_type=head_type,
        init_type=init_type,
    )

    if resume_from_checkpoint:
        print(f"Loading model from checkpoint: {resume_from_checkpoint}")
        model = model_cls.from_pretrained(
            resume_from_checkpoint,
            CLS_id=tokenizer.cls_token_id,
            num_models=len(model_list),
            linear_head_downsize_factor=linear_head_downsize_factor,
        )
    else:
        model = model_cls.from_pretrained(
            pretrain_model_name,
            CLS_id=tokenizer.cls_token_id,
            num_models=len(model_list),
            linear_head_downsize_factor=linear_head_downsize_factor,
        )

    if model.config.vocab_size < len(tokenizer):
        print("WARNING: Resizing Token Embedding")
        model.resize_token_embeddings(len(tokenizer))

    if train_head_only:
        print("Freezing transformer, only training head.")
        model.freeze_transformer()

    trainer = NoShuffleTrainer(
        model=model,
        args=training_args,
        train_dataset=train_data.with_format("torch"),
        # eval_dataset=val_data.with_format("torch"),
        data_collator=data_collator,
    )

    print("begin training")
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)

    trainer.save_model(output_path)
    tokenizer.save_pretrained(output_path)
    print("saved model and tokenizer")

    if not args.no_eval:
        print("starting eval")
        eval_results = trainer.predict(val_data.with_format("torch"))
        eval_metrics = eval_results.metrics
        eval_predictions = eval_results.predictions
        print(f"Evaluation Results: {eval_metrics}")

        val_set = val_data.add_column("betas", list(eval_predictions[0]))

        if LOCAL_RANK <= 0:
            with open(os.path.join(output_path, "eval_results.json"), "w") as fout:
                json.dump(eval_metrics, fout, indent=1)

            val_dir = os.path.join(output_path, "eval_output.jsonl")
            val_set.to_json(val_dir)
            print(f"saved merged eval results")

    if LOCAL_RANK <= 0:
        if args.push_to_hf:
            api = HfApi()
            repo_id = config.get("repo_id", f"p2el/{proj_name}")
            assert not api.repo_exists(
                repo_id=repo_id, repo_type="model"
            ), "repo already exists"

            api.create_repo(repo_id=repo_id, private=True, repo_type="model")
            api.upload_folder(
                folder_path=output_path,
                repo_id=repo_id,
                repo_type="model",
            )

            print("pushed to hub")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Argument Parser")
    parser.add_argument(
        "--config", type=str, help="path to config file for model training"
    )
    parser.add_argument(
        "--checkpoint",
        type=str,
        help="path to checkpoint directory to resume training from",
        default=None,
    )
    parser.add_argument(
        "--push-to-hf",
        action="store_true",
        help="True if push directly to huggingface",
    )
    parser.add_argument(
        "--eval-steps", type=int, default=60, help="Number of steps between evaluation."
    )
    parser.add_argument(
        "--local_rank", type=int, default=-1, help="Local rank passed by DeepSpeed"
    )
    parser.add_argument(
        "--no-eval",
        action="store_true",
        help="If flagged eval will not end at end of training loop.",
    )
    parser.add_argument("--save-steps", type=int, default=-1)

    args = parser.parse_args()

    train_model(args)


================================================
FILE: probe_barrier.py
================================================
# probe_barrier.py
import os, sys, time, datetime, argparse
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def log(msg: str, rank: int):
    """timestamped, unbuffered print"""
    print(f"[{rank}|{time.time():.3f}] {msg}", flush=True)

def worker(rank: int, world_size: int, backend: str):
    # ─── mandatory NCCL housekeeping ────────────────────────────
    os.environ.setdefault("MASTER_ADDR", "127.0.0.1")
    os.environ.setdefault("MASTER_PORT", "29501")
    os.environ["RANK"]       = str(rank)
    os.environ["WORLD_SIZE"] = str(world_size)

    if backend == "nccl":
        torch.cuda.set_device(rank)           # 1 GPU per rank
    # ────────────────────────────────────────────────────────────

    dist.init_process_group(
        backend          = backend,
        rank             = rank,
        world_size       = world_size,
        timeout          = datetime.timedelta(seconds=30)  # fail fast
    )

    log("reached barrier()", rank)
    dist.barrier()
    log("*** passed  barrier()", rank)

    # Try another collective just to be sure
    tensor = torch.tensor([rank], device="cuda" if backend == "nccl" else "cpu")
    dist.all_reduce(tensor)
    log(f"all_reduce ok, value={tensor.item()}", rank)

    dist.destroy_process_group()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--nprocs",  type=int,   default=2)
    parser.add_argument("--backend", choices=["gloo", "nccl"], default="gloo")
    args = parser.parse_args()

    mp.spawn(
        worker,
        args=(args.nprocs, args.backend),
        nprocs=args.nprocs,
        join=True
    )

if __name__ == "__main__":
    # Completely unbuffered stdout/stderr
    os.environ["PYTHONUNBUFFERED"] = "1"
    main()


================================================
FILE: route/chat.py
================================================
from typing import List, Dict, Iterator, Tuple
import openai.resources
from abc import ABC, abstractmethod
import openai
from openai import OpenAI
import anthropic
from route.utils import get_registry_decorator
import time
from route.datatypes import (
    Roles,
    ChatMessage,
    ChatCompletionResponse,
    Choice,
    ChatMessageDelta,
    ChoiceDelta,
    ChatCompletionResponseChunk,
    RouterOutput,
    ModelConfig,
)
import logging
from openai.types.chat.chat_completion_chunk import ChatCompletionChunk
from openai.types.chat.chat_completion import ChatCompletion
from anthropic.lib.streaming import MessageStream
from anthropic.types.message_start_event import MessageStartEvent
from uuid import uuid4


class BaseChatHandler(ABC):

    @staticmethod
    @abstractmethod
    def _create_client(model_config: ModelConfig):
        pass

    @staticmethod
    @abstractmethod
    def _handle_system_prompt(
        messages: List[ChatMessage], model_config: ModelConfig
    ) -> List[ChatMessage]:
        pass

    @staticmethod
    @abstractmethod
    def generate(
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> ChatCompletionResponse:
        pass

    @staticmethod
    @abstractmethod
    def generate_stream(
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> Iterator[ChatCompletionResponseChunk]:
        pass


CHAT_HANDLERS: Dict[str, BaseChatHandler] = {}

register = get_registry_decorator(CHAT_HANDLERS)


@register("openai")
class OpenAIChatHandler(BaseChatHandler):

    @staticmethod
    def _create_client(model_config: ModelConfig):

        api_key = model_config.get_api_key()
        base_url = model_config.get_base_url()

        if api_key or base_url:

            client = openai.OpenAI(
                base_url=base_url,
                api_key=api_key,
            )

        else:

            client = openai.OpenAI()

        return client

    @staticmethod
    def _handle_system_prompt(
        messages: List[ChatMessage], model_config: ModelConfig
    ) -> List[ChatMessage]:

        system_prompt = model_config.get_system_prompt()

        if system_prompt != None and messages[0].role != Roles.SYSTEM.value:

            system_message = ChatMessage(
                role=Roles.SYSTEM.value,
                content=system_prompt,
            )

            messages = [system_message] + messages

        return messages

    @staticmethod
    def _create_completion(
        client: OpenAI,
        model_config: ModelConfig,
        messages: List[ChatMessage],
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
        stream=False,
    ) -> ChatCompletion | Iterator[ChatCompletionChunk]:

        completion = client.chat.completions.create(
            model=model_config.get_name(),
            messages=messages,
            temperature=model_config.get_temp() if not temp else temp,
            top_p=model_config.get_top_p() if not top_p else top_p,
            max_tokens=(
                model_config.get_max_tokens(default=openai.NOT_GIVEN)
                if not max_tokens
                else max_tokens
            ),
            stream=stream,
        )

        return completion

    @classmethod
    def generate(
        cls,
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> ChatCompletionResponse:

        model_config = router_output.chosen_model_config

        client = cls._create_client(model_config=model_config)

        messages = cls._handle_system_prompt(
            messages=messages, model_config=model_config
        )

        completion: ChatCompletion = cls._create_completion(
            client=client,
            model_config=model_config,
            messages=messages,
            temp=temp,
            top_p=top_p,
            max_tokens=max_tokens,
            stream=False,
        )

        logging.info(f"{int(time.time())} Chosen Model Completion: {completion}")

        chat_completion = ChatCompletionResponse(
            id=str(completion.id),
            object="chat.completion",
            created=completion.created,
            model=completion.model,
            choices=[
                Choice(
                    index=choice.index,
                    message=ChatMessage(
                        role=choice.message.role,
                        content=choice.message.content,
                        model=router_output.chosen_model_name,
                    ),
                    finish_reason=choice.finish_reason,
                )
                for choice in completion.choices
            ],
            usage=completion.usage,
            router_outputs=router_output.model_scores,
        )

        return chat_completion

    def _skip(chunk: ChatCompletionChunk) -> bool:

        try:

            content = chunk.choices[0].delta.content

            return content == "" or content == None
        except Exception as e:
            return True

    @classmethod
    def generate_stream(
        cls,
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> Iterator[ChatCompletionResponseChunk]:

        model_config = router_output.chosen_model_config

        client = cls._create_client(model_config=model_config)

        messages = cls._handle_system_prompt(
            messages=messages, model_config=model_config
        )

        chunks: Iterator[ChatCompletionChunk] = cls._create_completion(
            client=client,
            model_config=model_config,
            messages=messages,
            temp=temp,
            top_p=top_p,
            max_tokens=max_tokens,
            stream=True,
        )

        first_chunk = True

        logging_content = ""

        for chunk in chunks:

            if cls._skip(chunk):
                continue

            logging_content += chunk.choices[0].delta.content

            out_chunk = ChatCompletionResponseChunk(
                id=str(chunk.id),
                object="chat.completion.chunk",
                created=chunk.created,
                model=chunk.model,
                choices=[
                    ChoiceDelta(
                        index=choice.index,
                        delta=ChatMessageDelta(
                            role=choice.delta.role,
                            content=choice.delta.content,
                            model=router_output.chosen_model_name,
                        ),
                    )
                    for choice in chunk.choices
                ],
                usage=chunk.usage,
                router_outputs=router_output.model_scores if first_chunk else None,
            ).model_dump_json()

            yield f"data: {out_chunk}\n\n"

            first_chunk = False

        logging.info(
            f"{int(time.time())} Chat Output (OpenAI Client): {logging_content}"
        )

        yield "data: [DONE]\n\n"


@register("openai-reasoning")
class OpenaiReasoningChatHandler(OpenAIChatHandler):

    @staticmethod
    def _create_completion(
        client: OpenAI,
        model_config: ModelConfig,
        messages: List[ChatMessage],
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
        stream=False,
    ) -> ChatCompletion | Iterator[ChatCompletionChunk]:
        
        extra_field = model_config.get_extra_fields()

        # No max tokens argument
        completion = client.chat.completions.create(
            model=model_config.get_name(), messages=messages, stream=stream, reasoning_effort=extra_field.get("reasoning_effort", openai.NOT_GIVEN),
        )

        return completion


@register("openai-o1")
class OpenaiO1ChatHandler(OpenaiReasoningChatHandler):

    @classmethod
    def generate_stream(
        cls,
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> Iterator[ChatCompletionResponseChunk]:

        model_config = router_output.chosen_model_config

        client = cls._create_client(model_config=model_config)

        messages = cls._handle_system_prompt(
            messages=messages, model_config=model_config
        )

        chunk: ChatCompletion = cls._create_completion(
            client=client,
            model_config=model_config,
            messages=messages,
            temp=temp,
            top_p=top_p,
            max_tokens=max_tokens,
            stream=False,
        )

        out_chunk = ChatCompletionResponseChunk(
            id=str(chunk.id),
            object="chat.completion.chunk",
            created=chunk.created,
            model=chunk.model,
            choices=[
                ChoiceDelta(
                    index=choice.index,
                    delta=ChatMessageDelta(
                        role=choice.message.role,
                        content=choice.message.content,
                        model=router_output.chosen_model_name,
                    ),
                )
                for choice in chunk.choices
            ],
            usage=chunk.usage,
            router_outputs=router_output.model_scores,
        ).model_dump_json()

        yield f"data: {out_chunk}\n\n"

        logging.info(
            f"{int(time.time())} Chat Output (OpenAI O1 Client): {chunk.choices[0].message.content}"
        )

        yield "data: [DONE]\n\n"


@register("anthropic")
class AnthropicChatHandler(BaseChatHandler):

    @staticmethod
    def _create_client(model_config: ModelConfig):
        client = anthropic.Anthropic(api_key=model_config.get_api_key())
        return client

    @staticmethod
    @abstractmethod
    def _handle_system_prompt(
        messages: List[ChatMessage], model_config: ModelConfig
    ) -> Tuple[List[ChatMessage], str | anthropic.NotGiven]:

        system_message = model_config.get_system_prompt(default=anthropic.NOT_GIVEN)

        if system_message == None:
            system_message = anthropic.NOT_GIVEN

        if messages[0].role == Roles.SYSTEM.value:

            system_message = messages[0].content

            messages = messages[1:]

        return messages, system_message

    @staticmethod
    def generate(
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> ChatCompletionResponse:

        model_config = router_output.chosen_model_config

        client = AnthropicChatHandler._create_client(model_config=model_config)

        messages, system_message = AnthropicChatHandler._handle_system_prompt(
            messages=messages, model_config=model_config
        )

        completion = client.messages.create(
            model=model_config.get_name(),
            messages=messages,
            stop_sequences=[anthropic.HUMAN_PROMPT],
            temperature=model_config.get_temp() if not temp else temp,
            top_p=model_config.get_top_p() if not top_p else top_p,
            max_tokens=model_config.get_max_tokens() if not max_tokens else max_tokens,
            system=system_message,
        )

        chat_completion = ChatCompletionResponse(
            id=completion.id,
            object="chat.completion",
            created=int(time.time()),
            model=completion.model,
            choices=[
                Choice(
                    index=i,
                    message=ChatMessage(
                        role=completion.role,
                        content=content.text,
                        model=router_output.chosen_model_name,
                    ),
                    finish_reason=completion.stop_reason,
                )
                for i, content in enumerate(completion.content)
            ],
            usage=completion.usage,
            router_outputs=router_output.model_scores,
        )

        return chat_completion

    @staticmethod
    def generate_stream(
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> Iterator[ChatCompletionResponseChunk]:

        model_config = router_output.chosen_model_config

        client = AnthropicChatHandler._create_client(model_config=model_config)

        messages, system_message = AnthropicChatHandler._handle_system_prompt(
            messages=messages, model_config=model_config
        )

        with client.messages.stream(
            model=model_config.get_name(),
            messages=messages,
            stop_sequences=[anthropic.HUMAN_PROMPT],
            temperature=model_config.get_temp() if not temp else temp,
            top_p=model_config.get_top_p() if not top_p else top_p,
            max_tokens=model_config.get_max_tokens() if not max_tokens else max_tokens,
            system=system_message,
        ) as _stream:

            stream: MessageStream = _stream

            # This contains the metadata
            message_start: MessageStartEvent = next(stream)

            resp_id = message_start.message.id
            model = message_start.message.model
            role = message_start.message.role

            # Ignore this useless chunk.
            next(stream)

            first_chunk = True

            logging_content = ""

            for text in stream.text_stream:

                logging_content += text

                out_chunk = ChatCompletionResponseChunk(
                    id=resp_id,
                    created=int(time.time()),
                    model=model,
                    object="chat.completion.chunk",
                    choices=[
                        ChoiceDelta(
                            delta=ChatMessageDelta(
                                content=text,
                                role=role,
                                model=router_output.chosen_model_name,
                            ),
                            index=0,
                        )
                    ],
                    router_outputs=router_output.model_scores if first_chunk else None,
                ).model_dump_json()

                yield f"data: {out_chunk}\n\n"

                first_chunk = False

            logging.info(
                f"{int(time.time())} Chat Output (Anthropic Client): {logging_content}"
            )

            yield "data: [DONE]\n\n"


import google.generativeai as genai
from google.generativeai.types.generation_types import GenerateContentResponse


@register("gemini")
class GeminiChatHandler(BaseChatHandler):

    safety_settings = [
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
    ]

    @staticmethod
    def _create_client(model_config: ModelConfig):

        api_key = model_config.get_api_key()

        if api_key:

            genai.configure(api_key=api_key)

    @staticmethod
    def _handle_system_prompt(
        messages: List[ChatMessage], model_config: ModelConfig
    ) -> List[ChatMessage]:

        system_prompt = model_config.get_system_prompt()

        if system_prompt != None and messages[0].role != Roles.SYSTEM.value:

            system_message = ChatMessage(
                role=Roles.SYSTEM.value,
                content=system_prompt,
            )

            messages = [system_message] + messages

        return messages

    @staticmethod
    def _create_completion(
        model_config: ModelConfig,
        messages: List[ChatMessage],
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
        stream=False,
    ) -> GenerateContentResponse | Iterator[GenerateContentResponse]:

        generation_config = genai.GenerationConfig(
            max_output_tokens=model_config.get_max_tokens(default=8192) if not max_tokens else max_tokens,
            temperature=model_config.get_temp() if not temp else temp,
            top_p=model_config.get_top_p() if not top_p else top_p,
            top_k=model_config.get_top_k(),
        )

        history = []
        system_prompt = None

        for message in messages[:-1]:

            if message.role == Roles.SYSTEM.value:
                system_prompt = message.content

            elif message.role == Roles.ASSISTANT.value:
                history.append({"role": "model", "parts": message.content})

            else:
                history.append({"role": "user", "parts": message.content})

        model = genai.GenerativeModel(
            model_name=model_config.get_name(),
            system_instruction=system_prompt,
            generation_config=generation_config,
            safety_settings=GeminiChatHandler.safety_settings,
        )

        chat_session = model.start_chat(history=history)

        completion = chat_session.send_message(
            content=messages[-1].content, stream=stream
        )

        return completion

    @classmethod
    def generate(
        cls,
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> ChatCompletionResponse:

        model_config = router_output.chosen_model_config

        cls._create_client(model_config=model_config)

        messages = cls._handle_system_prompt(
            messages=messages, model_config=model_config
        )

        completion: GenerateContentResponse = cls._create_completion(
            model_config=model_config,
            messages=messages,
            temp=temp,
            top_p=top_p,
            max_tokens=max_tokens,
            stream=False,
        )

        logging.info(f"{int(time.time())} Chosen Model Completion: {completion}")

        chat_completion = ChatCompletionResponse(
            id=str(uuid4()),
            object="chat.completion",
            created=int(time.time()),
            model=model_config.get_name(),
            choices=[
                Choice(
                    index=0,
                    message=ChatMessage(
                        role=Roles.ASSISTANT.value,
                        content=completion.text,
                        model=router_output.chosen_model_name,
                    ),
                    finish_reason="STOP",
                )
            ],
            router_outputs=router_output.model_scores,
        )

        return chat_completion

    @classmethod
    def generate_stream(
        cls,
        messages: List[ChatMessage],
        router_output: RouterOutput,
        temp: float | None,
        top_p: float | None,
        max_tokens: int | None,
    ) -> Iterator[ChatCompletionResponseChunk]:

        model_config = router_output.chosen_model_config

        cls._create_client(model_config=model_config)

        messages = cls._handle_system_prompt(
            messages=messages, model_config=model_config
        )

        chunks: Iterator[GenerateContentResponse] = cls._create_completion(
            model_config=model_config,
            messages=messages,
            temp=temp,
            top_p=top_p,
            max_tokens=max_tokens,
            stream=True,
        )

        first_chunk = True

        chat_id = str(uuid4())

        logging_content = ""

        for chunk in chunks:

            logging_content += chunk.text

            out_chunk = ChatCompletionResponseChunk(
                id=chat_id,
                object="chat.completion.chunk",
                created=int(time.time()),
                model=model_config.get_name(),
                choices=[
                    ChoiceDelta(
                        index=0,
                        delta=ChatMessageDelta(
                            role=Roles.ASSISTANT.value,
                            content=chunk.text,
                            model=router_output.chosen_model_name,
                        ),
                    )
                ],
                router_outputs=router_output.model_scores if first_chunk else None,
            ).model_dump_json()

            yield f"data: {out_chunk}\n\n"

            first_chunk = False

        logging.info(
            f"{int(time.time())} Chat Output (Gemini Client): {logging_content}"
        )

        yield "data: [DONE]\n\n"


================================================
FILE: route/cost_optimizers.py
================================================
from abc import ABC, abstractmethod
from route.utils import get_registry_decorator
from typing import List, Dict
import numpy as np
import cvxpy as cp
from scipy.special import expit


class UnfulfillableException(Exception):
    pass


class BaseCostOptimizer(ABC):
    def __init__(self):
        super().__init__()

    @staticmethod
    @abstractmethod
    def select_model(
        cost: float,
        model_list: List[str],
        model_costs: np.ndarray[float],
        model_scores: np.ndarray[float],
        **kwargs,
    ) -> str:
        pass

    @staticmethod
    def select_max_score_model(
        model_list: List[str], model_scores: np.ndarray[float]
    ) -> str:

        max_idx = np.argmax(model_scores)

        return model_list[max_idx]


COST_OPTIMIZERS: Dict[str, BaseCostOptimizer] = {}

register = get_registry_decorator(COST_OPTIMIZERS)


@register("strict")
class StrictCostOptimizer(BaseCostOptimizer):

    def __init__(self):
        super().__init__()

    @staticmethod
    def select_model(
        cost: float | None,
        model_list: List[str],
        model_costs: np.ndarray[float],
        model_scores: np.ndarray[float],
        **kwargs,
    ) -> str:

        if cost == None:
            return StrictCostOptimizer.select_max_score_model(model_list, model_scores)

        best_model: str | None = None
        best_score = -float("inf")

        for model, model_cost, model_score in zip(
            model_list, model_costs, model_scores
        ):

            if model_cost > cost:
                continue

            elif model_score > best_score:
                best_model = model
                best_score = model_score

        if best_model is None:
            raise UnfulfillableException(
                f"Cost of {cost} impossible to fulfill with available models {model_list} with costs {model_costs}."
            )

        return best_model


@register("simple-lp")
class SimpleLPCostOptimizer(BaseCostOptimizer):

    def __init__(self):
        super().__init__()

    @staticmethod
    def select_model(
        cost: float | None,
        model_list: List[str],
        model_costs: np.ndarray[float],
        model_scores: np.ndarray[float],
        **kwargs,
    ) -> str:

        if cost == None:
            return StrictCostOptimizer.select_max_score_model(model_list, model_scores)

        p = cp.Variable(len(model_costs))

        prob = cp.Problem(
            cp.Maximize(cp.sum(model_scores @ p)),
            [model_costs.T @ p <= cost, cp.sum(p) == 1, p >= 0],
        )

        status = prob.solve()

        if status < 0.0:
            raise UnfulfillableException(
                f"Cost of {cost} impossible to fulfill with available models {model_list} with costs {model_costs}."
            )

        ps = np.clip(p.value, a_min=0.0, a_max=1.0)
        ps = ps / ps.sum()

        return np.random.choice(model_list, p=ps)


@register("optimal-lp")
class OptimalLPCostOptimizer(BaseCostOptimizer):

    def __init__(self):
        super().__init__()

    @staticmethod
    def select_model(
        cost: float | None,
        model_list: List[str],
        model_costs: np.ndarray[float],
        model_scores: np.ndarray[float],
        opponent_scores: np.ndarray[float] = None,
        opponent_distribution: np.ndarray[float] = None,
    ) -> str:

        if cost == None:
            return StrictCostOptimizer.select_max_score_model(model_list, model_scores)

        W = OptimalLPCostOptimizer._construct_W(model_scores, opponent_scores)

        Wq = W @ opponent_distribution

        p = cp.Variable(len(model_costs))

        prob = cp.Problem(
            cp.Maximize(p @ Wq), [model_costs.T @ p <= cost, cp.sum(p) == 1, p >= 0]
        )

        status = prob.solve()

        if status < 0.0:
            raise UnfulfillableException(
                f"Cost of {cost} impossible to fulfill with available models {model_list} with costs {model_costs}."
            )

        ps = np.clip(p.value, a_min=0.0, a_max=1.0)
        ps = ps / ps.sum()

        return np.random.choice(model_list, p=ps)

    @staticmethod
    def _construct_W(
        router_model_scores: np.ndarray[float], opponent_model_scores: np.ndarray[float]
    ) -> np.ndarray[float]:

        num_rows = router_model_scores.shape[-1]
        num_cols = opponent_model_scores.shape[-1]

        chosen = np.tile(router_model_scores, (num_cols, 1)).T
        rejected = np.tile(opponent_model_scores, (num_rows, 1))

        assert chosen.shape == rejected.shape, (chosen.shape, rejected.shape)

        diff_matrix = chosen - rejected

        W = expit(diff_matrix)

        return W


================================================
FILE: route/datatypes.py
================================================
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from pydantic import BaseModel
from enum import Enum


class ModelConfig:

    def __init__(self, config: Dict[str, Any]):
        self.config = config

    def get_name(self) -> str:
        return self.config["name"]

    def get_temp(self) -> float:
        return self.config["temp"]

    def get_top_p(self) -> float:
        return self.config["top_p"]

    def get_top_k(self, default=None) -> int:
        return self.config.get("top_k", default)

    def get_system_prompt(self, default=None) -> str | None | Any:
        return self.config.get("system_prompt", default)

    def get_api_key(self, default=None) -> str | None | Any:
        return self.config.get("api_key", default)

    def get_base_url(self, default=None) -> str | None | Any:
        return self.config.get("base_url", default)

    def get_type(self) -> str:
        return self.config["type"]

    def get_cost(self) -> float:
        return self.config["cost"]

    def get_max_tokens(self, default=None) -> int | None | Any:
        return self.config.get("max_tokens", default)
    
    def get_extra_fields(self) -> Dict:
        return self.config.get("extra_fields", {}) # Maybe should be None...

    def __repr__(self):
        return repr(
            dict(
                name=self.get_name(),
                type=self.get_type(),
                cost=self.get_cost(),
            )
        )


class ModelConfigContainer:
    def __init__(self, model_config_dicts: Dict[str, Dict[str, Any]]):
        self.model_configs: Dict[str, ModelConfig] = dict(
            (name, ModelConfig(config)) for name, config in model_config_dicts.items()
        )

    def get_model_config(self, model_name: str) -> ModelConfig:
        return self.model_configs[model_name]

    def list_models(self) -> List[str]:
        return list(self.model_configs.keys())

    def list_costs(self) -> List[float]:

        costs: List[float] = []

        for model_name in self.list_models():
            model_config = self.get_model_config(model_name)
            costs.append(model_config.get_cost())

        return costs

    def __repr__(self):
        return repr(self.model_configs)


class Roles(Enum):
    USER = "user"
    ASSISTANT = "assistant"
    SYSTEM = "system"


class ChatMessage(BaseModel):
    """
    Represents a single message in the conversation.
    role: "system", "user", or "assistant"
    content: the actual text
    """

    role: str
    content: str
    model: Optional[str] = None


class ChatCompletionRequest(BaseModel):
    """
    Request body for Chat Completion.
    """

    model: str
    messages: List[ChatMessage]
    max_tokens: Optional[int] = None
    temperature: Optional[float] = None
    top_p: Optional[float] = None
    n: Optional[int] = 1
    stream: Optional[bool] = False
    stop: Optional[List[str]] = None
    cost: Optional[float] = None
    direct_model: Optional[str] = None


class Choice(BaseModel):
    """
    Represents a single choice in the final response (non-streaming mode).
    """

    index: int
    message: ChatMessage
    finish_reason: str


class ChatCompletionResponse(BaseModel):
    """
    Response model for non-streaming mode.
    """

    id: str
    object: str
    created: int
    model: str
    choices: List[Choice]
    usage: Optional[BaseModel] = None
    router_outputs: Optional[Dict[str, float]] = None


class ChatMessageDelta(BaseModel):
    content: Optional[str] = None
    role: Optional[str] = None
    model: Optional[str] = None


class ChoiceDelta(BaseModel):
    delta: ChatMessageDelta
    finish_reason: Optional[str] = None
    index: int


class ChatCompletionResponseChunk(BaseModel):
    id: str
    choices: List[ChoiceDelta]
    created: int
    model: str
    object: str
    usage: Optional[BaseModel] = None
    router_outputs: Optional[Dict[str, float]] = None


@dataclass
class RouterOutput:
    chosen_model_name: str
    chosen_model_config: ModelConfig
    model_scores: Dict[str, float] | None


================================================
FILE: route/example_config.yaml
================================================
model_configs:
  athene-v2-chat:
    api_key: <your-api-key>
    base_url: http://38.142.9.21:10245/v1
    cost: 0.8097264049
    name: im-a-little-birdie
    temp: 0.7
    top_p: 1.0
    type: openai
  claude-3-5-haiku-20241022:
    api_key: <your-api-key>
    base_url: null
    cost: 2.1765185825
    max_tokens: 8192
    name: claude-3-5-haiku-20241022
    temp: 0.7
    top_p: 0.7
    type: anthropic
  claude-3-5-sonnet-20240620:
    api_key: <your-api-key>
    base_url: null
    cost: 9.4453041863
    max_tokens: 8192
    name: claude-3-5-sonnet-20240620
    system_prompt: '<claude_info> The assistant is Claude, created by Anthropic. The
      current date is 2025-01-06. Claude''s knowledge base was last updated on April
      2024. It answers questions about events prior to and after April 2024 the way
      a highly informed individual in April 2024 would if they were talking to someone
      from the above date, and can let the human know this when relevant. Claude cannot
      open URLs, links, or videos. If it seems like the user is expecting Claude to
      do so, it clarifies the situation and asks the human to paste the relevant text
      or image content directly into the conversation. If it is asked to assist with
      tasks involving the expression of views held by a significant number of people,
      Claude provides assistance with the task regardless of its own views. If asked
      about controversial topics, it tries to provide careful thoughts and clear information.
      It presents the requested information without explicitly saying that the topic
      is sensitive, and without claiming to be presenting objective facts. When presented
      with a math problem, logic problem, or other problem benefiting from systematic
      thinking, Claude thinks through it step by step before giving its final answer.
      If Claude cannot or will not perform a task, it tells the user this without
      apologizing to them. It avoids starting its responses with "I''m sorry" or "I
      apologize". If Claude is asked about a very obscure person, object, or topic,
      i.e. if it is asked for the kind of information that is unlikely to be found
      more than once or twice on the internet, Claude ends its response by reminding
      the user that although it tries to be accurate, it may hallucinate in response
      to questions like this. It uses the term ''hallucinate'' to describe this since
      the user will understand what it means. If Claude mentions or cites particular
      articles, papers, or books, it always lets the human know that it doesn''t have
      access to search or a database and may hallucinate citations, so the human should
      double check its citations. Claude is very smart and intellectually curious.
      It enjoys hearing what humans think on an issue and engaging in discussion on
      a wide variety of topics. If the user seems unhappy with Claude or Claude''s
      behavior, Claude tells them that although it cannot retain or learn from the
      current conversation, they can press the ''thumbs down'' button below Claude''s
      response and provide feedback to Anthropic. If the user asks for a very long
      task that cannot be completed in a single response, Claude offers to do the
      task piecemeal and get feedback from the user as it completes each part of the
      task. Claude uses markdown for code. Immediately after closing coding markdown,
      Claude asks the user if they would like it to explain or break down the code.
      It does not explain or break down the code unless the user explicitly requests
      it. </claude_info>

      <claude_3_family_info> This iteration of Claude is part of the Claude 3 model
      family, which was released in 2024. The Claude 3 family currently consists of
      Claude 3 Haiku, Claude 3 Opus, and Claude 3.5 Sonnet. Claude 3.5 Sonnet is the
      most intelligent model. Claude 3 Opus excels at writing and complex tasks. Claude
      3 Haiku is the fastest model for daily tasks. The version of Claude in this
      chat is Claude 3.5 Sonnet. Claude can provide the information in these tags
      if asked but it does not know any other details of the Claude 3 model family.
      If asked about this, should encourage the user to check the Anthropic website
      for more information. </claude_3_family_info>

      Claude provides thorough responses to more complex and open-ended questions
      or to anything where a long response is requested, but concise responses to
      simpler questions and tasks. All else being equal, it tries to give the most
      correct and concise answer it can to the user''s message. Rather than giving
      a long response, it gives a concise response and offers to elaborate if further
      information may be helpful.

      Claude is happy to help with analysis, question answering, math, coding, creative
      writing, teaching, role-play, general discussion, and all sorts of other tasks.

      Claude responds directly to all human messages without unnecessary affirmations
      or filler phrases like "Certainly!", "Of course!", "Absolutely!", "Great!",
      "Sure!", etc. Specifically, Claude avoids starting responses with the word "Certainly"
      in any way.

      Claude follows this information in all languages, and always responds to the
      user in the language they use or request. The information above is provided
      to Claude by Anthropic. Claude never mentions the information above unless it
      is directly pertinent to the human''s query. Claude is now being connected with
      a human.

      '
    temp: 0.7
    top_p: 0.7
    type: anthropic
  claude-3-5-sonnet-20241022:
    api_key: <your-api-key>
    base_url: null
    cost: 9.3110239362
    max_tokens: 8192
    name: claude-3-5-sonnet-20241022
    system_prompt: null
    temp: 0.7
    top_p: 0.7
    type: anthropic
  deepseek-v3:
    api_key: <your-api-key>
    base_url: https://api.deepseek.com
    cost: 0.3002758331
    name: deepseek-chat
    temp: 1.5
    top_p: 1.0
    type: openai
  gemini-1.5-flash-001:
    api_key: <your-api-key>
    cost: 0.4549682765
    name: gemini-1.5-flash-001
    temp: 0.7
    top_p: 1.0
    type: gemini
  gemini-1.5-flash-002:
    api_key: <your-api-key>
    cost: 0.6330942997
    name: gemini-1.5-flash-002
    system_prompt: All questions should be answered comprehensively with details,
      unless the user requests a concise response specifically. Respond in the same
      language as the query.
    temp: 0.7
    top_p: 1.0
    type: gemini
  gemini-1.5-pro-001:
    api_key: <your-api-key>
    cost: 6.7456245955
    name: gemini-1.5-pro-001
    temp: 0.7
    top_p: 0.7
    type: gemini
  gemini-1.5-pro-002:
    api_key: <your-api-key>
    cost: 9.6885059428
    name: gemini-1.5-pro-002-test
    system_prompt: All questions should be answered comprehensively with details,
      unless the user requests a concise response specifically. Respond in the same
      language as the query.
    temp: 0.7
    top_p: 1.0
    type: gemini
  gemini-2.0-flash-exp:
    api_key: <your-api-key>
    cost: 0.8978088229
    name: gemini-test-14
    temp: 1.0
    top_k: 64
    top_p: 0.95
    type: gemini
  gemini-2.0-flash-thinking-exp-1219:
    api_key: <your-api-key>
    cost: 0.4626591495
    name: gemini-test-15
    temp: 1.0
    top_k: 64
    top_p: 0.95
    type: gemini
  gemini-exp-1206:
    api_key: <your-api-key>
    cost: 6.7210154899
    name: gemini-test-12
    temp: 1.0
    top_k: 64
    top_p: 0.95
    type: gemini
  gemma-2-27b-it:
    api_key: <your-api-key>
    cost: 0.4732936067
    name: gemma-2-27b-no-filter
    temp: 0.7
    top_p: 0.7
    type: gemini
  gemma-2-9b-it:
    api_key: <your-api-key>
    cost: 0.0873672873
    name: gemma-2-9b-no-filter
    temp: 0.7
    top_p: 1.0
    type: gemini
  glm-4-plus:
    api_key: <your-api-key>
    base_url: https://open.bigmodel.cn/api/paas/v4
    cost: 0.3175377664
    name: glm-4-plus
    temp: 0.7
    top_p: 1.0
    type: openai
  gpt-4-1106-preview:
    api_key: <your-api-key>
    base_url: null
    cost: 16.3622976323
    name: gpt-4-1106-preview
    system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based
      on the GPT-4 architecture.

      Current date: 2025-01-06


      Image input capabilities: Enabled

      Personality: v2'
    temp: 0.7
    top_p: 1.0
    type: openai
  gpt-4-turbo-2024-04-09:
    api_key: <your-api-key>
    base_url: null
    cost: 17.4092447612
    name: gpt-4-turbo-2024-04-09
    system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based
      on the GPT-4 architecture.

      Current date: 2025-01-06


      Image input capabilities: Enabled

      Personality: v2'
    temp: 0.7
    top_p: 1.0
    type: openai
  gpt-4o-2024-05-13:
    api_key: <your-api-key>
    base_url: null
    cost: 12.3166873868
    name: gpt-4o-2024-05-13
    system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based
      on the GPT-4 architecture.

      Current date: 2025-01-06


      Image input capabilities: Enabled

      Personality: v2'
    temp: 0.7
    top_p: 1.0
    type: openai
  gpt-4o-2024-08-06:
    api_key: <your-api-key>
    base_url: null
    cost: 6.9944337124
    name: gpt-4o-2024-08-06
    system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based
      on the GPT-4 architecture.

      Current date: 2025-01-06


      Image input capabilities: Enabled

      Personality: v2'
    temp: 0.7
    top_p: 1.0
    type: openai
  gpt-4o-mini-2024-07-18:
    api_key: <your-api-key>
    base_url: null
    cost: 0.563652953
    name: gpt-4o-mini-2024-07-18
    system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based
      on the GPT-4 architecture.

      Current date: 2025-01-06


      Image input capabilities: Enabled

      Personality: v2'
    temp: 0.7
    top_p: 1.0
    type: openai
  llama-3-70b-instruct:
    api_key: <your-api-key>
    base_url: https://api.together.xyz/v1
    cost: 0.4186380435
    name: meta-llama/Llama-3-70b-chat-hf
    temp: 0.7
    top_p: 1.0
    type: openai
  llama-3.1-405b-instruct-fp8:
    api_key: <your-api-key>
    base_url: https://api.fireworks.ai/inference/v1
    cost: 2.4340008579
    name: accounts/fireworks/models/llama-v3p1-405b-instruct
    system_prompt: 'Cutting Knowledge Date: December 2023

      Today Date: 06 Jan 2025'
    temp: 0.6
    top_p: 1.0
    type: openai
  llama-3.1-70b-instruct:
    api_key: <your-api-key>
    base_url: https://api.fireworks.ai/inference/v1
    cost: 0.7204016024
    name: accounts/fireworks/models/llama-v3p1-70b-instruct
    system_prompt: "Cutting Knowledge Date: December 2023\nToday Date: 06 Jan 2025\n\
      \nCarefully read the user prompt. Your responses are comprehensive and easy\
      \ to understand. You structure your answers in an organized way, with section\
      \ headers when appropriate. You use consistent formatting in your responses.\
      \ You follow user instructions. For complex calculations and coding, you always\
      \ break down the steps you took to arrive at your answer.\n\nPay extra attention\
      \ to prompts in the following categories:\n * Non-English queries: Read the\
      \ prompt carefully and pay close attention to formatting requests and the level\
      \ of detail; ensure you are giving factual and precise responses using correct\
      \ grammar in the correct language.\n * Coding queries: You prioritize code organization\
      \ and documentation. Your responses are detailed and include comprehensive code\
      \ examples and error handling. Include comments to explain the code's purpose\
      \ and behavior. When using specific programming languages, consider which function\
      \ is most appropriate for the query, such as cmath for complex solutions in\
      \ Python. Check for errors.\n * For mathematical reasoning: Before responding,\
      \ review your output for reasoning, algebraic manipulation and calculation errors\
      \ and fix before responding. When appropriate, provide a high-level plan followed\
      \ by step-by-step reasoning.\n\nRemember your instructions."
    temp: 0.7
    top_p: 1.0
    type: openai
  llama-3.1-8b-instruct:
    api_key: <your-api-key>
    base_url: https://api.fireworks.ai/inference/v1
    cost: 0.1573721045
    name: accounts/fireworks/models/llama-v3p1-8b-instruct
    system_prompt: "Cutting Knowledge Date: December 2023\nToday Date: 06 Jan 2025\n\
      \nCarefully read the user prompt. Your responses are comprehensive and easy\
      \ to understand. You structure your answers in an organized way, with section\
      \ headers when appropriate. You use consistent formatting in your responses.\
      \ You follow user instructions. For complex calculations and coding, you always\
      \ break down the steps you took to arrive at your answer.\n\nPay extra attention\
      \ to prompts in the following categories:\n * Non-English queries: Read the\
      \ prompt carefully and pay close attention to formatting requests and the level\
      \ of detail; ensure you are giving factual and precise responses using correct\
      \ grammar in the correct language.\n * Coding queries: You prioritize code organization\
      \ and documentation. Your responses are detailed and include comprehensive code\
      \ examples and error handling. Include comments to explain the code's purpose\
      \ and behavior. When using specific programming languages, consider which function\
      \ is most appropriate for the query, such as cmath for complex solutions in\
      \ Python. Check for errors.\n * For mathematical reasoning: Before responding,\
      \ review your output for reasoning, algebraic manipulation and calculation errors\
      \ and fix before responding. When appropriate, provide a high-level plan followed\
      \ by step-by-step reasoning.\n\nRemember your instructions."
    temp: 0.7
    top_p: 1.0
    type: openai
  llama-3.3-70b-instruct:
    api_key: <your-api-key>
    base_url: https://api.fireworks.ai/inference/v1
    cost: 0.706256804
    name: accounts/fireworks/models/llama-v3p3-70b-instruct
    temp: 0.6
    top_p: 1.0
    type: openai
  mistral-large-2407:
    api_key: <your-api-key>
    base_url: https://api.mistral.ai/v1
    cost: 4.3956843814
    name: mistral-large-2407
    temp: 0.7
    top_p: 0.7
    type: openai
  mixtral-8x22b-instruct-v0.1:
    api_key: <your-api-key>
    base_url: https://api.mistral.ai/v1
    cost: 2.5814904104
    name: mixtral-8x22b-instruct-v0.1
    temp: 0.7
    top_p: 0.7
    type: openai
  mixtral-8x7b-instruct-v0.1:
    api_key: <your-api-key>
    base_url: https://api.together.xyz/v1
    cost: 0.2839726899
    name: mistralai/Mixtral-8x7B-Instruct-v0.1
    temp: 0.7
    top_p: 0.7
    type: openai
  o1-2024-12-17:
    api_key: <your-api-key>
    cost: 72.3693462194
    name: o1-2024-12-17
    system_prompt: Formatting re-enabled.
    temp: 1.0
    top_p: 1.0
    type: openai-o1
  o1-mini:
    api_key: <your-api-key>
    base_url: null
    cost: 16.4809912657
    name: o1-mini-2024-09-12
    system_prompt: null
    temp: 1.0
    top_p: 1.0
    type: openai-reasoning
  o1-preview:
    api_key: <your-api-key>
    base_url: null
    cost: 72.481802295
    name: o1-preview
    system_prompt: null
    temp: 1.0
    top_p: 1.0
    type: openai-reasoning
  qwen2.5-72b-instruct:
    api_key: <your-api-key>
    base_url: https://dashscope.aliyuncs.com/compatible-mode/v1
    cost: 1.1805173434
    name: qwen2.5-72b-instruct
    temp: 0.7
    top_p: 1.0
    type: openai
  yi-lightning:
    api_key: <your-api-key>
    base_url: https://api.lingyiwanwu.com/v1
    cost: 0.0057351688
    name: yi-lightning
    temp: 0.6
    top_p: 1.0
    type: openai
  chatgpt-4o-latest-20241120:
    api_key: <your-api-key>
    cost: 12.9070929223
    name: gpt-4o-2024-11-20
    temp: 0.7
    top_p: 1.0
    system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.

      Current date: 2025-01-06


      Image input capabilities: Enabled

      Personality: v2'
    type: openai
name: test-router

================================================
FILE: route/openai_server.py
================================================
import argparse
from fastapi import FastAPI, HTTPException, Header
from fastapi.responses import StreamingResponse
from route.datatypes import (
    ModelConfigContainer,
    ChatCompletionRequest,
    ChatCompletionResponse,
    ChatCompletionResponseChunk,
)
from route.chat import CHAT_HANDLERS
from route.routers import ROUTERS, BaseRouter
import uvicorn
import yaml
from contextlib import asynccontextmanager
from typing import List
import logging
import time
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)


def parse_args():

    parser = argparse.ArgumentParser()

    parser.add_argument("--config", "-c", type=str, required=True)
    parser.add_argument("--router-type", type=str, required=True)
    parser.add_argument("--router-model-name", type=str, default=None)
    parser.add_argument("--router-model-endpoint", type=str, default=None)
    parser.add_argument("--router-api-key", type=str, default="-")
    parser.add_argument("--cost-optimizer", type=str, default="simple-lp")
    parser.add_argument("--port", "-p", type=int, default=8000)
    parser.add_argument("--host", type=str, default="0.0.0.0")
    parser.add_argument("--api-key", type=str, default="-")
    parser.add_argument("--reload", action=argparse.BooleanOptionalAction, default=True)
    parser.add_argument("--workers", type=int, default=1)

    args = parser.parse_args()

    return args


@asynccontextmanager
async def lifespan(app: FastAPI):
    """
    This context manager is called once at startup and once at shutdown.
    We move all config-loading and router-creation logic here.
    """
    # --- PARSE ARGS & LOAD CONFIG ---

    logging.info(f"Starting up...")

    args = parse_args()

    with open(args.config) as cfile:
        config = yaml.safe_load(cfile)

    model_config_dicts = config["model_configs"]
    model_config_container = ModelConfigContainer(model_config_dicts)

    router_cls = ROUTERS[args.router_type]

    router_kwargs = {
        "router_model_name": args.router_model_name,
        "router_model_endpoint": args.router_model_endpoint,
        "router_api_key": args.router_api_key,
    }

    router = router_cls(model_config_container, args.cost_optimizer, **router_kwargs)

    app.state.router = router
    app.state.model_config_container = model_config_container
    app.state.api_key = args.api_key

    logging.info(f"Finished startup.")

    try:

        yield

    finally:

        pass


app = FastAPI(lifespan=lifespan)

# ====== API Endpoint ======


@app.post("/v1/chat/completions")
async def create_chat_completion(
    request: ChatCompletionRequest,
    authorization: str = Header(None),
) -> ChatCompletionResponse | ChatCompletionResponseChunk:
    """
    Mimics the OpenAI Chat Completions endpoint (both streaming and non-streaming).
    """

    logging.info(f"{int(time.time())} Recieved Request: {request}")

    if not authorization or not authorization.startswith("Bearer "):
        raise HTTPException(status_code=401, detail="Invalid or missing API key")

    # Strip out the 'Bearer ' portion to isolate the token
    token = authorization.removeprefix("Bearer ")

    if token != app.state.api_key:
        raise HTTPException(status_code=403, detail="Unauthorized")

    try:

        router_output = None
        type = None

        direct_model = request.direct_model

        router: BaseRouter = app.state.router

        messages = request.messages

        if direct_model:

            router_output = router.get_model_direct(direct_model)

        else:

            router_output = router.route(messages, request.cost)

        logging.info(f"{int(time.time())} Router Output: {router_output}")

        type = router_output.chosen_model_config.get_type()

        chat_handler = CHAT_HANDLERS[type]

    except Exception as e:

        logging.info(
            f"{int(time.time())} ***Routing Error Start***\nError Message: {e}\nRouter Output: {router_output}\nChat Handler: {type}\nDirect Model: {direct_model}.***Routing Error End***"
        )

        raise HTTPException(status_code=500, detail=str(e))

    try:

        if request.stream:

            chat_output_chunk = chat_handler.generate_stream(
                messages=messages,
                router_output=router_output,
                temp=request.temperature,
                top_p=request.top_p,
                max_tokens=request.max_tokens,
            )

            return StreamingResponse(chat_output_chunk, media_type="text/event-stream")

        else:

            chat_output = chat_handler.generate(
                messages=messages,
                router_output=router_output,
                temp=request.temperature,
                top_p=request.top_p,
                max_tokens=request.max_tokens,
            )

            return chat_output

    except Exception as e:

        logging.info(
            f"{int(time.time())} ***Endpoint Error Start***\nError Message: {e}\nRouter Output: {router_output}\nChat Handler: {type}.***Endpoint Error End***"
        )

        raise e


@app.get("/v1/models")
async def models(authorization: str = Header(None)) -> List[str]:

    logging.info(f"Recieved Get Request for Models.")

    if not authorization or not authorization.startswith("Bearer "):
        raise HTTPException(status_code=401, detail="Invalid or missing API key")

    # Strip out the 'Bearer ' portion to isolate the token
    token = authorization.removeprefix("Bearer ")

    if token != app.state.api_key:
        raise HTTPException(status_code=403, detail="Unauthorized")

    router: BaseRouter = app.state.router

    return router.model_list


if __name__ == "__main__":

    args = parse_args()

    uvicorn.run(
        "route.openai_server:app",
        port=args.port,
        host=args.host,
        reload=args.reload,
        workers=args.workers,
    )


================================================
FILE: route/requirements.txt
================================================
uvicorn
fastapi
openai
anthropic
google-generativeai
scipy
cvxpy

================================================
FILE: route/routers.py
================================================
from abc import ABC, abstractmethod
from typing import Dict, List, Tuple
from route.utils import (
    get_registry_decorator,
    query_p2l_endpoint,
    get_p2l_endpoint_models,
)
from route.datatypes import ModelConfigContainer, Roles, ChatMessage, RouterOutput
from route.cost_optimizers import COST_OPTIMIZERS, BaseCostOptimizer
import numpy as np
from scipy.special import expit


class BaseRouter(ABC):

    def __init__(
        self,
        model_config_container: ModelConfigContainer,
        cost_optimizer_type: str,
        **kwargs,
    ):
        super().__init__()
        self.model_config_container = model_config_container
        self.model_list: List[str] = None
        self.model_costs: np.ndarray[float] = None
        self.cost_optimizer: BaseCostOptimizer = COST_OPTIMIZERS[cost_optimizer_type]

    @abstractmethod
    def _get_model_scores(self, messages: List[ChatMessage]) -> np.ndarray[float]:
        pass

    def _get_previous_response_model(self, messages: List[ChatMessage]) -> str | None:

        for message in reversed(messages):

            if message.role == Roles.ASSISTANT.value:

                return message.model

        return None

    def _get_prompt(self, messages: List[ChatMessage]) -> list[str]:

        prompts = []

        for message in messages:

            if message.role == Roles.USER.value:

                prompts.append(message.content)

        if len(prompts) == 0:

            raise Exception(f"No user prompt found in messages {messages}.")

        return prompts

    def get_model_direct(self, model_name: str) -> RouterOutput:
        return RouterOutput(
            chosen_model_name=model_name,
            chosen_model_config=self.model_config_container.get_model_config(
                model_name=model_name
            ),
            model_scores=None,
        )

    def route(self, messages: List[ChatMessage], cost: float = None) -> RouterOutput:

        model_scores = self._get_model_scores(messages)

        chosen_model_name = self.cost_optimizer.select_model(
            cost, self.model_list, self.model_costs, model_scores
        )

        model_scores_dict = dict(zip(self.model_list, model_scores))

        chosen_model_config = self.model_config_container.get_model_config(
            chosen_model_name
        )

        return RouterOutput(
            chosen_model_name=chosen_model_name,
            chosen_model_config=chosen_model_config,
            model_scores=model_scores_dict,
        )


ROUTERS: Dict[str, BaseRouter] = {}

register = get_registry_decorator(ROUTERS)


@register("random")
class RandomRouter(BaseRouter):
    """For debugging and gamblers."""

    def __init__(
        self,
        model_config_container: ModelConfigContainer,
        cost_optimizer_type: str,
        **kwargs,
    ):
        super().__init__(
            model_config_container=model_config_container,
            cost_optimizer_type=cost_optimizer_type,
        )

        self.model_list = model_config_container.list_models()
        self.model_costs = np.array(model_config_container.list_costs())

    def _get_model_scores(self, messages: List[ChatMessage]) -> np.ndarray[float]:
        return np.random.uniform(0.0, 1.0, size=len(self.model_list))


@register("bt-endpoint")
class EndpointP2LRouter(BaseRouter):

    # Hardcoding this because I'm tired man...
    SAMPLING_WEIGHTS = {
        "chatgpt-4o-latest-20241120": 4,
        "o1-mini": 4,
        "o1-2024-12-17": 4,
        "gpt-4o-mini-2024-07-18": 2,
        "gemma-2-27b-it": 2,
        "gemma-2-9b-it": 2,
        "gemma-2-2b-it": 2,
        "claude-3-5-sonnet-20241022": 4,
        "claude-3-opus-20240229": 4,
        "claude-3-5-haiku-20241022": 4,
        "qwen2.5-72b-instruct": 2,
        "qwen2.5-plus-1127": 4,
        "llama-3.1-405b-instruct-bf16": 4,
        "mistral-large-2411": 4,
        "grok-2-2024-08-13": 4,
        "grok-2-mini-2024-08-13": 2,
        "deepseek-v3": 6,
        "gemini-1.5-pro-002": 4,
        "gemini-1.5-flash-002": 2,
        "gemini-1.5-flash-8b-001": 2,
        "c4ai-aya-expanse-32b": 2,
        "c4ai-aya-expanse-8b": 2,
        "athene-v2-chat": 4,
        "gemini-exp-1206": 4,
        "gemini-2.0-flash-exp": 4,
        "llama-3.3-70b-instruct": 4,
        "amazon-nova-pro-v1.0": 4,
        "amazon-nova-lite-v1.0": 2,
        "amazon-nova-micro-v1.0": 2,
        "llama-3.1-tulu-3-8b": 6,
        "llama-3.1-tulu-3-70b": 6,
        "granite-3.1-8b-instruct": 6,
        "granite-3.1-2b-instruct": 6,
    }

    def __init__(
        self,
        model_config_container: ModelConfigContainer,
        cost_optimizer_type: str,
        router_model_endpoint: str,
        router_api_key: str,
        **kwargs,
    ):
        super().__init__(
            model_config_container=model_config_container,
            cost_optimizer_type=cost_optimizer_type,
        )

        self.base_url = router_model_endpoint
        self.api_key = router_api_key

        router_model_list = get_p2l_endpoint_models(self.base_url, self.api_key)

        config_model_list = model_config_container.list_models()

        self.mask = [
            router_model in config_model_list for router_model in router_model_list
        ]

        self.q_mask = [
            router_model in self.SAMPLING_WEIGHTS for router_model in router_model_list
        ]

        self.q = np.array(
            [
                float(self.SAMPLING_WEIGHTS[router_model])
                for router_model in router_model_list
                if router_model in self.SAMPLING_WEIGHTS
            ]
        )

        self.model_list = [
            model for model, keep in zip(router_model_list, self.mask) if keep
        ]

        self.model_costs = np.array(
            [
                model_config_container.get_model_config(model).get_cost()
                for model in self.model_list
            ]
        )

    def _get_model_scores(
        self, messages: List[ChatMessage]
    ) -> Tuple[np.ndarray[float], float]:

        prompt = self._get_prompt(messages)

        p2l_output = query_p2l_endpoint(prompt, self.base_url, self.api_key)

        coefs = np.array(p2l_output["coefs"])

        return coefs

    def route(self, messages: List[ChatMessage], cost: float = None) -> RouterOutput:

        model_scores = self._get_model_scores(messages)

        router_choice_scores = model_scores[self.mask]

        router_opponent_scores = model_scores[self.q_mask]

        chosen_model_name = self.cost_optimizer.select_model(
            cost,
            self.model_list,
            self.model_costs,
            router_choice_scores,
            opponent_scores=router_opponent_scores,
            opponent_distribution=self.q,
        )

        model_scores_dict = dict(zip(self.model_list, router_choice_scores))

        chosen_model_config = self.model_config_container.get_model_config(
            chosen_model_name
        )

        return RouterOutput(
            chosen_model_name=chosen_model_name,
            chosen_model_config=chosen_model_config,
            model_scores=model_scores_dict,
        )


@register("bag-endpoint")
@register("grk-endpoint")
class EndpointP2LRouter(BaseRouter):
    def __init__(
        self,
        model_config_container: ModelConfigContainer,
        cost_optimizer_type: str,
        router_model_endpoint: str,
        router_api_key: str,
        **kwargs,
    ):
        super().__init__(
            model_config_container=model_config_container,
            cost_optimizer_type=cost_optimizer_type,
        )

        self.base_url = router_model_endpoint
        self.api_key = router_api_key

        router_model_list = get_p2l_endpoint_models(self.base_url, self.api_key)

        config_model_list = model_config_container.list_models()

        self.mask = [
            router_model in config_model_list for router_model in router_model_list
        ]

        self.model_list = [
            model for model, keep in zip(router_model_list, self.mask) if keep
        ]
        self.model_costs = np.array(
            [
                model_config_container.get_model_config(model).get_cost()
                for model in self.model_list
            ]
        )

    def _get_model_scores(self, messages: List[ChatMessage]) -> np.ndarray[float]:

        prompt = self._get_prompt(messages)

        p2l_output = query_p2l_endpoint(prompt, self.base_url, self.api_key)

        coefs = np.array(p2l_output["coefs"])

        model_scores: np.ndarray[float] = expit(coefs)

        return model_scores[self.mask]


================================================
FILE: route/utils.py
================================================
from typing import Dict, Callable, List
import requests
import json


def get_registry_decorator(registry: Dict) -> Callable:

    def register(name: str):

        def decorator(cls: Callable):

            assert (
                not name in registry
            ), f"No duplicate registry names. '{name}' was registerd more than once."

            registry[name] = cls

            return cls

        return decorator

    return register


def query_p2l_endpoint(
    prompt: list[str], base_url: str, api_key: str
) -> Dict[str, List]:

    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    payload = {"prompt": prompt}

    try:
        response = requests.post(
            f"{base_url}/predict", headers=headers, data=json.dumps(payload)
        )
        response.raise_for_status()
        result = response.json()
        return result

    except Exception as err:

        raise err


def get_p2l_endpoint_models(base_url: str, api_key: str) -> List[str]:

    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    try:
        response = requests.get(f"{base_url}/models", headers=headers)
        response.raise_for_status()
        result = response.json()
        return result["models"]

    except Exception as err:
        print(f"An error occurred: {err}")


================================================
FILE: serve_requirements.txt
================================================
numpy<2.0.0
torch<=2.4.0
transformers
transformers[torch]
hf_transfer
wandb
scipy
uvicorn
fastapi

================================================
FILE: train_requirements.txt
================================================
numpy<2.0.0
torch<=2.4.0
deepspeed<=0.15.3
datasets>=3.2.0
transformers
transformers[torch]
hf_transfer
wandb
scipy


================================================
FILE: training_configs/Llama3.1-8B-full-train.yaml
================================================
proj_name: Llama-3.1-8B-Instruct-full-train
learning_rate: 4.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: meta-llama/Llama-3.1-8B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
model_type: "llama"
head_type: "bt"
loss_type: "bt_tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true
pad_token_if_none: <|finetune_right_pad_id|>
cls_token_if_none: <|reserved_special_token_3|>

================================================
FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.016-04302025.yaml
================================================
proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.016-04302025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 5
max_length: 16384
num_train_epochs: 1
train_data_path: naive_replay_buffer_eps_0.016
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 13 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.032-04302025.yaml
================================================
proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.032-04302025-2
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 1
max_length: 16384
num_train_epochs: 1
train_data_path: naive_replay_buffer_eps_0.032
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 66 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.06-04302025.yaml
================================================
proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.06-04302025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 16384
num_train_epochs: 1
train_data_path: naive_replay_buffer_eps_0.06
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 17 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.112-04302025.yaml
================================================
proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.112-04302025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 16384
num_train_epochs: 1
train_data_path: naive_replay_buffer_eps_0.112
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 18 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.2-04302025.yaml
================================================
proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.2-04302025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 16384
num_train_epochs: 1
train_data_path: naive_replay_buffer_eps_0.2
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 20 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-1.5B-bag-full-train-02222025.yaml
================================================
proj_name: Qwen2.5-1.5B-Instruct-bag-02222025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-bag-data-02222025
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-1.5B-full-train.yaml
================================================
proj_name: Qwen2.5-1.5B-Instruct-full-train
learning_rate: 4.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt_tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-1.5B-rk-full-train-half-batch.yaml
================================================
proj_name: Qwen2.5-1.5B-Instruct-rk-full-train-half-batch
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "rk"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-1.5B-rk-full-train.yaml
================================================
proj_name: Qwen2.5-1.5B-Instruct-rk-full-train
learning_rate: 1.0e-5
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "rk"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-3B-bag-full-train-02222025.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-bag-02222025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-bag-data-02222025
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-3B-bag-full-train-02242025.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-bag-02242025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 2
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-bag-data-02242025
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-3B-freeze-test-part-2.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-freeze-test-part-2
learning_rate: 1.0e-06
adam_epsilon: 1.0e-8
batch_size: 2
max_length: 8192
num_train_epochs: 1
train_data_path: p2el/tie_included_canonical_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: p2el/Qwen2.5-3B-Instruct-freeze-test
gradient_accumulation_steps: 64 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt_tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params


================================================
FILE: training_configs/Qwen2.5-3B-freeze-test.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-freeze-test
learning_rate: 1.13e-05
adam_epsilon: 1.0e-8
batch_size: 2
max_length: 8192
num_train_epochs: 1
train_data_path: p2el/tie_included_canonical_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 256 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt_tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
train_head_only: true


================================================
FILE: training_configs/Qwen2.5-3B-full-train-double-batch.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-full-train
learning_rate: 1.0e-5
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt_tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-3B-full-train.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-full-train
learning_rate: 4.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt_tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-3B-rk-full-train-half-batch.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-rk-full-train
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "rk"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-3B-rk-full-train.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-rk-full-train
learning_rate: 1.0e-5
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "rk"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-3B-training-bt_data_11092024 copy.yaml
================================================
proj_name: Qwen2.5-3B-Instruct-bt_data_11092024
learning_rate: 1.13e-05
adam_epsilon: 1.0e-08
batch_size: 2
max_length: 4096
num_train_epochs: 1
train_data_path: p2el/canonical_train_data_11092024
val_data_path: p2el/canonical_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: 'Qwen/Qwen2.5-3B-Instruct'
gradient_accumulation_steps: 64
chat_template: "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json


================================================
FILE: training_configs/Qwen2.5-7B-bag-full-train-02222025.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-bag-02222025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-bag-data-02222025
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-7B-bag-full-train-02242025.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-bag-02242025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 2
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-bag-data-02242025
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-7B-bag-full-train-03132025.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-bag-03132025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-bag-data-03132025
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 32 # 4 gpus 
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-7B-bag-full-train-chrono.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-bag-chrono
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 2
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-bag-data-chrono
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-7B-bt-full-train-02222025.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-bt-02222025
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 2
max_length: 16384
num_train_epochs: 1
train_data_path: full-p2l-data-02222025
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt-tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true

================================================
FILE: training_configs/Qwen2.5-7B-full-train.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-full-train
learning_rate: 4.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt_tie"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-7B-rk-full-train-abs.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-rk-full-train-abs
learning_rate: 1.0e-5
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "rk"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-7B-rk-full-train-half-batch.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-rk-full-train-half-batch
learning_rate: 8.0e-6
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 16 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "rk"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/Qwen2.5-7B-rk-full-train.yaml
================================================
proj_name: Qwen2.5-7B-Instruct-rk-full-train
learning_rate: 1.0e-5
adam_epsilon: 1.0e-8
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: full-p2l-data
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-7B-Instruct
gradient_accumulation_steps: 32 # drop to 32 since 8 gpus
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "rk"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params
load_train_data_from_disk: true


================================================
FILE: training_configs/debug.yaml
================================================
proj_name: debug-Qwen2.5-0.5B-Instruct-bt_data_11092024
learning_rate: 2.0e-06
batch_size: 4
max_length: 4096
num_train_epochs: 1
data_path: p2el/bt_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: 'Qwen/Qwen2.5-0.5B-Instruct'
gradient_accumulation_steps: 32
chat_template: "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json


================================================
FILE: training_configs/init_debug_qwen_1.5b_he.yaml
================================================
proj_name: he-Debug-Init-Qwen2.5-1.5B-Instruct
learning_rate: 1.13e-05
adam_epsilon: 7.071068e-09
batch_size: 2
max_length: 4096
num_train_epochs: 1
train_data_path: p2el/canonical_bt_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 64
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: he_unif


================================================
FILE: training_configs/init_debug_qwen_1.5b_reset_params.yaml
================================================
proj_name: reset_param-Debug-Init-Qwen2.5-1.5B-Instruct
learning_rate: 1.13e-05
adam_epsilon: 7.071068e-09
batch_size: 2
max_length: 4096
num_train_epochs: 1
train_data_path: p2el/canonical_bt_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 64
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params


================================================
FILE: training_configs/init_debug_qwen_1.5b_xavier.yaml
================================================
proj_name: xaiver-Debug-Init-Qwen2.5-1.5B-Instruct
learning_rate: 1.13e-05
adam_epsilon: 7.071068e-09
batch_size: 2
max_length: 4096
num_train_epochs: 1
train_data_path: p2el/canonical_bt_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct
gradient_accumulation_steps: 64
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: xavier_unif


================================================
FILE: training_configs/init_debug_qwen_3b_he.yaml
================================================
proj_name: he-Debug-Init-Qwen2.5-3B-Instruct
learning_rate: 1.13e-05
adam_epsilon: 7.071068e-09
batch_size: 2
max_length: 4096
num_train_epochs: 1
train_data_path: p2el/canonical_bt_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 64
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: he_unif


================================================
FILE: training_configs/init_debug_qwen_3b_reset_params.yaml
================================================
proj_name: reset_param-Debug-Init-Qwen2.5-3B-Instruct
learning_rate: 1.13e-05
adam_epsilon: 7.071068e-09
batch_size: 2
max_length: 4096
num_train_epochs: 1
train_data_path: p2el/canonical_bt_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 64
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: reset_params


================================================
FILE: training_configs/init_debug_qwen_3b_xavier.yaml
================================================
proj_name: xaiver-Debug-Init-Qwen2.5-3B-Instruct
learning_rate: 1.13e-05
adam_epsilon: 7.071068e-09
batch_size: 2
max_length: 4096
num_train_epochs: 1
train_data_path: p2el/canonical_bt_train_data_11092024
val_data_path: p2el/canonical_bt_val_data_11092024
output_dir: 'training_outputs'
pretrain_model_name: Qwen/Qwen2.5-3B-Instruct
gradient_accumulation_steps: 64
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "bt"
loss_type: "bt"
weighted_loss: false
deepspeed_config_path: deepspeed/zero1.json
init_type: xavier_unif


================================================
FILE: training_configs/qwen_1.5B_geom_test.yaml
================================================
proj_name: "Qwen2.5-1.5B-Instruct-Geom-Test"
learning_rate: 8.0e-06
adam_epsilon: 1.0e-08
batch_size: 4
max_length: 8192
num_train_epochs: 1
train_data_path: "/root/chrono_train_data"
val_data_path: "p2el/canonical_bt_val_data_11092024"
output_dir: "training_outputs"
pretrain_model_name: "Qwen/Qwen2.5-1.5B-Instruct"
gradient_accumulation_steps: 16
chat_template: "{%- if messages[0]['role'] == 'system' %}\n    {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n"
model_type: "qwen2"
head_type: "rk"
loss_type: "bag"
weighted_loss: false
deepspeed_config_path: "deepspeed/zero1.json"
init_type: "reset_params"
load_train_data_from_disk: true