Repository: lmarena/p2l Branch: main Commit: a905fa5ea94a Files: 59 Total size: 268.0 KB Directory structure: gitextract_yhm008o_/ ├── .gitignore ├── README.md ├── deepspeed/ │ └── zero1.json ├── fast_lambda_setup.sh ├── fast_runpod_setup.sh ├── p2l/ │ ├── auto_eval_utils.py │ ├── auto_evals.py │ ├── dataset.py │ ├── endpoint.py │ ├── eval.py │ ├── model.py │ └── train.py ├── probe_barrier.py ├── route/ │ ├── chat.py │ ├── cost_optimizers.py │ ├── datatypes.py │ ├── example_config.yaml │ ├── openai_server.py │ ├── requirements.txt │ ├── routers.py │ └── utils.py ├── serve_requirements.txt ├── train_requirements.txt └── training_configs/ ├── Llama3.1-8B-full-train.yaml ├── Qwen2.5-1.5B-bag-chrono-eps-0.016-04302025.yaml ├── Qwen2.5-1.5B-bag-chrono-eps-0.032-04302025.yaml ├── Qwen2.5-1.5B-bag-chrono-eps-0.06-04302025.yaml ├── Qwen2.5-1.5B-bag-chrono-eps-0.112-04302025.yaml ├── Qwen2.5-1.5B-bag-chrono-eps-0.2-04302025.yaml ├── Qwen2.5-1.5B-bag-full-train-02222025.yaml ├── Qwen2.5-1.5B-full-train.yaml ├── Qwen2.5-1.5B-rk-full-train-half-batch.yaml ├── Qwen2.5-1.5B-rk-full-train.yaml ├── Qwen2.5-3B-bag-full-train-02222025.yaml ├── Qwen2.5-3B-bag-full-train-02242025.yaml ├── Qwen2.5-3B-freeze-test-part-2.yaml ├── Qwen2.5-3B-freeze-test.yaml ├── Qwen2.5-3B-full-train-double-batch.yaml ├── Qwen2.5-3B-full-train.yaml ├── Qwen2.5-3B-rk-full-train-half-batch.yaml ├── Qwen2.5-3B-rk-full-train.yaml ├── Qwen2.5-3B-training-bt_data_11092024 copy.yaml ├── Qwen2.5-7B-bag-full-train-02222025.yaml ├── Qwen2.5-7B-bag-full-train-02242025.yaml ├── Qwen2.5-7B-bag-full-train-03132025.yaml ├── Qwen2.5-7B-bag-full-train-chrono.yaml ├── Qwen2.5-7B-bt-full-train-02222025.yaml ├── Qwen2.5-7B-full-train.yaml ├── Qwen2.5-7B-rk-full-train-abs.yaml ├── Qwen2.5-7B-rk-full-train-half-batch.yaml ├── Qwen2.5-7B-rk-full-train.yaml ├── debug.yaml ├── init_debug_qwen_1.5b_he.yaml ├── init_debug_qwen_1.5b_reset_params.yaml ├── init_debug_qwen_1.5b_xavier.yaml ├── init_debug_qwen_3b_he.yaml ├── init_debug_qwen_3b_reset_params.yaml ├── init_debug_qwen_3b_xavier.yaml └── qwen_1.5B_geom_test.yaml ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ __pycache__/ ================================================ FILE: README.md ================================================ # Prompt-to-Leaderboard (P2L) This is the codebase for the paper [Prompt-to-Leaderboard](https://arxiv.org/pdf/2502.14855). Models weights found at our [LMArena HF Collection](https://huggingface.co/collections/lmarena-ai/prompt-to-leaderboard-67bcf7ddf6022ef3cfd260cc). Try on Chatbot Arena at the [Prompt-to-Leaderboard](https://lmarena.ai/?p2l) tab! ## Abstract Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt or set of prompts. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot in the Chatbot Arena leaderboard. ## Table of Contents - [P2L](#p2l) - [Abstract](#abstract) - [Table of Contents](#table-of-contents) - [Environment Setup](#environment-setup) - [Installing `uv`](#installing-uv) - [Serving P2L Setup](#serving-p2l-setup) - [Serving a Router Setup](#serving-a-router-setup) - [Training Setup](#training-setup) - [Serving P2L](#serving-p2l) - [Serving an OpenAI Compatible Router](#serving-an-openai-compatible-router) - [Example: serving a Bradley-Terry based cost-optimal router](#example-serving-a-bradley-terry-based-cost-optimal-router) - [Example: serving a Grounded RK based simple cost router](#example-serving-a-grounded-rk-based-simple-cost-router) - [Calling the OpenAI Compatible Router](#calling-the-openai-compatible-router) - [Training a P2L Model](#training-a-p2l-model) - [Inferencing a P2L Model](#inferencing-a-p2l-model) - [AutoEval Suite](#autoeval-suite) - [Params](#params) - [Citation](#citation) ## Environment Setup Setup instuctions will be shown using `uv`, however any package management system will work. All environments are native to Python 3.10, other versions are untested but may also work. ### Installing `uv` If you like the sound of ~50x faster environment setup times, run the following to install `uv`. ```bash curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env ``` To create a Python virtual environment run: ```bash uv venv .env --python 3.10 ``` To activate said environment, run: ```bash source .env/bin/activate ``` ### Serving P2L Setup To serve a P2L model first run: ```bash uv pip install -r serve_requirements.txt ``` ### Serving a Router Setup To serve a OpenAI compatible router, first run: ```bash uv pip install -r route/requirements.txt ``` ### Training Setup To train a P2L model first run: ```bash uv pip install -r train_requirements.txt ``` ## Serving P2L Before getting started, make sure you have followed the steps in [Serving Setup](#serving-p2l-setup). `python p2l.endpoint` considers the following arguments: | Option | Short Flag | Description | |--------|-----------|-------------| | `--help` | `-h` | Show this help message and exit. | | `--model-path MODEL_PATH` | `-m MODEL_PATH` | Path to the model repository. | | `--model-type MODEL_TYPE` | `-mt MODEL_TYPE` | Type of the model. | | `--head-type HEAD_TYPE` | `-ht HEAD_TYPE` | Type of model head. | | `--loss-type LOSS_TYPE` | `-lt LOSS_TYPE` | Type of the loss function. | | `--api-key API_KEY` | `-a API_KEY` | API key for authorization. | | `--host HOST` | `-H HOST` | Host to run the server on. | | `--port PORT` | `-p PORT` | Port to run the server on. | | `--reload, --no-reload` | - | Whether to reload the endpoint on detected code changes (requires workers to be set to 1). | | `--workers WORKERS` | - | Number of endpoint workers (each will hold a model instance). | | `--cuda, --no-cuda` | - | Flag to enable using a GPU to host the model. Flag is true by default. | For example, to run lmarena-ai/p2l-7b-grk-02222025, which is a Qwen2 based "grk" model, which has head type `rk`, we would run: ```bash python -m p2l.endpoint --model-path lmarena-ai/p2l-7b-grk-02222025 --model-type qwen2 --head-type rk --api-key ``` This code will host the model running on 1 worker and host 0.0.0.0 and port 10250 by default. Reload will be enabled meaning code changes will reload the endpoint. Note that by default the endpoint expects to load the model onto a GPU, however by specifying `--no-cuda` you can run this on CPU only, which may work for smaller P2L models. Each P2L model has an associated model list, which specifices which model each index of the outputted coefficients corresponds to. Below is an example function to get this model list from the hosted endpoint: ```python def get_p2l_endpoint_models(base_url: str, api_key: str) -> List[str]: headers = { "Content-Type": "application/json", "api-key": api_key, } try: response = requests.get(f"{base_url}/models", headers=headers) response.raise_for_status() result = response.json() return result["models"] except Exception as err: print(f"An error occurred: {err}") ``` Below is an example python function to query the P2L endpoint: ```python def query_p2l_endpoint( prompt: list[str], base_url: str, api_key: str ) -> Dict[str, List]: headers = { "Content-Type": "application/json", "api-key": api_key, } payload = {"prompt": prompt} try: response = requests.post( f"{base_url}/predict", headers=headers, data=json.dumps(payload) ) response.raise_for_status() result = response.json() return result except Exception as err: raise err ``` Note that the input is a list of strings. This is NOT for a batch of prompts, but rather for each turn in a coversation. For example, given a 2 turn conversation: ``` User: "hi!" Assistant: "Hello" User: "what's 1+1?" ``` The correct P2L input would be: ```python ["hi!", "what's 1+1?"] ``` ## Serving an OpenAI Compatible Router Serve an OpenAI compatible router with `python -m route.openai_server`. The available arguments are shown below. | Option | Short Flag | Description | |--------|-----------|-------------| | `--help` | `-h` | Show this help message and exit. | | `--config CONFIG` | `-c CONFIG` | Path to the configuration file. | | `--router-type ROUTER_TYPE` | - | Type of the router to use. Available types are `bt-endpoint` and `grk-endpoint`.| | `--router-model-name ROUTER_MODEL_NAME` | - | Name of the router model. | | `--router-model-endpoint ROUTER_MODEL_ENDPOINT` | - | Endpoint URL for the router model. | | `--router-api-key ROUTER_API_KEY` | - | API key for the router authentication. | | `--cost-optimizer COST_OPTIMIZER` | - | Enable or configure cost optimization settings. Available types are `optimal-lp`, `simple-lp`, `strict`.| | `--port PORT` | `-p PORT` | Port to run the server on. | | `--host HOST` | - | Host to run the server on. | | `--api-key API_KEY` | - | API key for authorization. | | `--reload, --no-reload` | - | Whether to reload the endpoint on detected code changes (requires workers to be set to 1). | | `--workers WORKERS` | - | Number of endpoint workers (each will hold a model instance). | ### Example: serving a Bradley-Terry based cost-optimal router First, similar to above [above](#serving-p2l), we need to start serving a P2L model, this time Bradley-Terry based. To do this, let's run: ```bash python -m p2l.endpoint --model-path lmarena-ai/p2l-7b-bt-01132025 --model-type qwen2 --head-type bt --api-key ``` Now, we need to configure a routing config file. This will specify the available models and inference details for the router. For example, here is an example configuration that specifies Claude-3.5-Sonnet and GPT-4o: ```yaml model_configs: claude-3-5-sonnet-20241022: api_key: base_url: null cost: 9.3110239362 max_tokens: 8192 name: claude-3-5-sonnet-20241022 system_prompt: null temp: 0.7 top_p: 0.7 type: anthropic gpt-4o-2024-05-13: api_key: base_url: null cost: 12.3166873868 name: gpt-4o-2024-05-13 system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Current date: 2025-01-06 Image input capabilities: Enabled Personality: v2' temp: 0.7 top_p: 1.0 type: openai ``` Notice how the system prompt, temperature, and top_p are defined. These replicate how the models are served on Chatbot Arena. P2L is trained with the expectation that the models are running on this configuration. Therefore, for the most reliable results, we recommend sticking to the configs shown in [`example_config.yaml`](./route/example_config.yaml), though alternatives should still function well. Additionally, we allow for adjustment of the `cost` parameter. One natural choice is just cost per output token, however more accuracte cost estimates are better. For example, the costs in [`example_config.yaml`](./route/example_config.yaml) are calculated to be proportional to the formula `cost_per_output_token * average_output_tokens_per_response`. Now, lets assume we put the above config content into `config.yaml`. To start the OpenAI compatible router we would run: ```bash python -m route.openai_server --config config.yaml --router-type bt-endpoint --router-model-endpoint http://0.0.0.0:10250 --router-api-key --cost-optimizer optimal-lp --api-key ``` Let's break down what this command means: - `--router-type bt-endpoint`: we are using a Bradley-Terry based P2L model hosted on an endpoint. - `--router-model-endpoint http://0.0.0.0:10250`: this is where the router endpoint is, generally the default address will be this if you are running the routing server on the same machine running the P2L endpoint. - `--cost-optimizer optimal-lp`: we are using cost routing using the optimal linear program detailed in Theorem 1 of the paper. >**Note**: `optimal-lp` is only compatible with BT models, and `simple-lp` is only compatible with grounded RK (sometimes specified as bag) models. ### Example: serving a Grounded RK based simple cost router P2L has a class of "Grounded RK" models. These models produces coefficents such that `0.0` represents the threshold for a "usable" answer. We can leverage this to cost route to maximize $P(\text{Not Bad})$... whatever that means exactly. Below we detail the steps to run this routing setup. First, start up the P2L endpoint: ```bash python -m p2l.endpoint --model-path lmarena-ai/p2l-7b-grk-02222025 --model-type qwen2 --head-type rk --api-key ``` Then start up the router server: ```bash python -m router.openai_server --config config.yaml --router-type grk-endpoint --router-model-endpoint http://0.0.0.0:10250 --router-api-key --cost-optimizer simple-lp --api-key ``` ## Calling the OpenAI Compatible Router As aptly named, the router server is OpenAI compatible. We can call it like any other OpenAI compatible model: ```python from openai import OpenAI client = OpenAI( base_url: "/v1", api_key: "", ) prompt = "what's 828913*1234?" response = client.chat.completions.create( model="-", # This field is actually not used message=[{"role": "user", "content": prompt}], stream=True, # Router is compatible with and without streaming. ) # Notice no temperature, top_p, or system prompt is set. # This allows the router to use the default provided by the config file. # If you do pass in these fields, they will override the config. ``` If we want to specify a cost budget, we need to do the following: ```python response = client.chat.completions.create( model="-", # This field is actually not used message=[{"role": "user", "content": prompt}], stream=True, # Router is compatible with and without streaming. extra_body={"cost": } ) ``` ## Training a P2L Model This codebase also contains the training code for P2L models. To train a P2L model, first set up a training config. The [`training_configs`](./training_configs/) directory has many examples. To train run, for example: ```bash deepspeed --num_gpus=8 --module p2l.train --config training_configs/.yaml --no-eval --save-steps 512 ``` ## Inferencing a P2L Model To quickly inference on a dataset using P2L, run: ```bash python -m p2l.eval --model --dataset --head-type --model-type --batch-size 2 ``` This will work on any dataset of single turn prompts under the column name `prompt`. ## AutoEval Suite Our in-depth evaluation code can be run using `p2l.auto_evals`. ### Params - **a. Model List Params** 1. Either provide `--model_repo`, which has a `model_list.json` file. 2. Or provide a local `--model_list_path` file. - **b. Val Data** 1. **Data is in JSONL format**: - Provide a local `--eval_path`. - If no path is provided, the program will look for an `eval_outputs.jsonl` file in the `--model_repo` on HF. 2. **Data is in JSON format (checkpoint files)**: - Provide a local `--checkpoint_path`. - Or provide remote `--hf_checkpoint_repo` and `--hf_checkpoint_file`. - **c. Output Directory** 1. Provide a local `--output_dir` or a remote `--hf_output_dir`. 2. Provide `--output_file_name`. - **d. Train Data (Optional)** - Provide `--hf_train_dataset` or a local `--train_path`. - **e. Arena Data (Optional)** - Provide a local `--arena_path` (CSV with model rankings). - **f. Provide Model Info** 1. `--loss_type` (e.g., `bt`, `bt_tie`, `rk`). 2. `--model_type` (e.g., `p2l`, `marginal`, `arena`, `marginal-gt`). 3. `--categories`. - **g. Provide Types of Metrics** 1. `--simple_metrics`, `--category_metrics`, `--rand_subset_metrics`, `--aggr_scale_subset_metrics`. 2. Use `--metrics_to_inc` to filter out which of the above metrics to include. - **h. Random Subset Params** 1. `--rand_subset_sizes`: Specify subset sizes. 2. `--rand_num_samples`: Specify the number of samples per random subset size. - **i. Aggregation Subset Params** 1. `--aggr_scale_subset_sizes`: Specify subset sizes. 2. `--aggr_scale_num_samples`: Specify the number of samples per random subset size. 3. `--aggr_scale_gt`: Specify whether to use `marginal-gt` or `arena` as ground truth for categories. --- ## Citation ``` @misc{frick2025prompttoleaderboard, title={Prompt-to-Leaderboard}, author={Evan Frick and Connor Chen and Joseph Tennyson and Tianle Li and Wei-Lin Chiang and Anastasios N. Angelopoulos and Ion Stoica}, year={2025}, eprint={2502.14855}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.14855}, } ``` ================================================ FILE: deepspeed/zero1.json ================================================ { "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto" }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 1, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": true, "zero_optimization": { "stage": 1, "reduce_bucket_size": 5e8 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": [ 0.9, 0.999 ], "eps": "auto" } } } ================================================ FILE: fast_lambda_setup.sh ================================================ sudo apt-get update -y sudo apt-get install tmux -y sudo apt-get install python3-dev -y sudo apt-get install tmux libaio-dev libopenmpi-dev python3-mpi4py -y curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env uv venv .env --python 3.10 source .env/bin/activate uv pip install wheel packaging uv pip install -r train_requirements.txt uv pip install flash-attn==2.5.9.post1 --no-build-isolation ================================================ FILE: fast_runpod_setup.sh ================================================ apt-get update -y apt-get install tmux -y apt-get install python3-dev -y apt-get install tmux libaio-dev libopenmpi-dev python3-mpi4py -y curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env uv venv .env --python 3.10 source .env/bin/activate uv pip install wheel packaging uv pip install -r train_requirements.txt uv pip install flash-attn==2.5.9.post1 --no-build-isolation ================================================ FILE: p2l/auto_eval_utils.py ================================================ from typing import Callable, Dict import torch from torch import nn import torch.nn.functional as F import torch.optim as optim import pandas as pd import numpy as np from scipy.optimize import minimize from scipy.stats import kendalltau, spearmanr from model import ( registered_losses, HeadOutputs, registered_aggr_models, registered_pairwise_losses, ) registered_simple_metrics: Dict[str, Dict[str, Callable]] = {} registered_aggr_metrics: Dict[str, Dict[str, Callable]] = {} registered_helpers: Dict[str, Callable] = {} def register_simple_metric(loss_type: str, metric: str): def decorator(func: Callable): if loss_type not in registered_simple_metrics: registered_simple_metrics[loss_type] = {} registered_simple_metrics[loss_type][metric] = func return func return decorator def register_aggr_metric(loss_type: str, metric: str): def decorator(func: Callable): if loss_type not in registered_aggr_metrics: registered_aggr_metrics[loss_type] = {} registered_aggr_metrics[loss_type][metric] = func return func return decorator def register_helper(loss_or_model_type: str, helper_func): def decorator(func: Callable): if loss_or_model_type not in registered_helpers: registered_helpers[loss_or_model_type] = {} registered_helpers[loss_or_model_type][helper_func] = func return func return decorator @register_helper("p2l", "output_labels") def output_labels_p2l(val_data: pd.DataFrame, **kwargs): betas = torch.tensor(np.stack(val_data["betas"]), dtype=torch.float) labels = torch.tensor(np.stack(val_data["labels"])) etas = None if "eta" in val_data.columns: etas = torch.tensor(np.stack(val_data["eta"]), dtype=torch.float) return HeadOutputs(coefs=betas, eta=etas), labels def translate_coefs(coef, old_list, new_list): old_list = old_list.tolist() old_to_new = [old_list.index(model) for model in new_list] betas_array = np.array(coef) betas_array = betas_array[old_to_new] return torch.tensor(betas_array) @register_helper("marginal", "output_labels") def output_labels_marginal( val_data: pd.DataFrame, train_data: pd.DataFrame, model_list: np.array, train_model_list: np.array, loss_type: str, **kwargs, ): train_labels = torch.tensor(np.stack(train_data["labels"])) coefs, eta = train_marginal(train_model_list, train_labels, loss_type) coefs, eta = coefs[0], eta[0] if eta is not None else None coefs = translate_coefs(coefs, train_model_list, model_list) val_labels = torch.tensor(np.stack(val_data["labels"])) coefs = coefs.expand(len(val_labels), -1) eta = eta.expand(len(val_labels), -1) if eta is not None else None return HeadOutputs(coefs=coefs, eta=eta), val_labels @register_helper("marginal-gt", "output_labels") def output_labels_marginal_gt( val_data: pd.DataFrame, model_list: np.array, loss_type: str, **kwargs ): val_labels = torch.tensor(np.stack(val_data["labels"])) coefs, eta = train_marginal(model_list, val_labels, loss_type) coefs = coefs.expand(len(val_labels), -1) eta = eta.expand(len(val_labels), -1) if eta is not None else None return HeadOutputs(coefs=coefs, eta=eta), val_labels @register_helper("arena", "output_labels") def output_labels_arena( arena_rankings: torch.tensor, val_data: pd.DataFrame, loss_type: str, **kwargs ): labels = torch.tensor(np.stack(val_data["labels"])) # arena rankings is already filtered so it will be 1d tensor betas = arena_rankings.expand(len(labels), -1) etas = torch.ones(len(labels)) etas = etas.unsqueeze(-1) # TODO: Cleanup if loss_type == "bt" or loss_type == "bt-tie": etas = None return HeadOutputs(coefs=betas, eta=etas), labels @register_helper("bag", "preprocess_data") def preprocess_data_bag(data: pd.DataFrame, **kwargs): condition = data["winner"] == "tie (bothbad)" data.loc[condition, "labels"] = data.loc[condition, "labels"].apply( lambda arr: arr[:2] + [2] ) return data @register_helper("bt", "preprocess_data") @register_helper("bt-tie", "preprocess_data") @register_helper("rk", "preprocess_data") @register_helper("rk-reparam", "preprocess_data") def preprocess_data(data: pd.DataFrame, **kwargs): return data @register_simple_metric("bt", "Loss") @register_simple_metric("bt", "BCELoss") @register_simple_metric("bt-tie", "Loss") @register_simple_metric("rk", "Loss") @register_simple_metric("rk-reparam", "Loss") @register_simple_metric("bag", "Loss") def loss(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs): loss_func = registered_losses.get(loss_type) return loss_func(head_output=head_output, labels=labels).item() @register_simple_metric("rk", "Tie_Loss") @register_simple_metric("bag", "Tie_Loss") def tie_loss(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs): loss_func = registered_losses.get("tie-" + loss_type) return loss_func(head_output=head_output, labels=labels).item() @register_simple_metric("bag", "Tie_bb_Loss") def tie_bb_loss( head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs ): loss_func = registered_losses.get("tie-bb-" + loss_type) return loss_func(head_output=head_output, labels=labels).item() @register_aggr_metric("bt", "Aggr_Tie_Loss") @register_aggr_metric("bt-tie", "Aggr_Tie_Loss") @register_aggr_metric("rk", "Aggr_Tie_Loss") @register_aggr_metric("rk-reparam", "Aggr_Tie_Loss") @register_aggr_metric("bag", "Aggr_Tie_Loss") def Aggr_Tie_Loss( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, labels: torch.tensor, **kwargs, ): return aggr_metric("Tie_Loss", loss_type, labels, gt_output, model_output) @register_simple_metric("bt-tie", "BCELoss") @register_simple_metric("rk", "BCELoss") @register_simple_metric("rk-reparam", "BCELoss") @register_simple_metric("bag", "BCELoss") def BCE_loss(head_output: HeadOutputs, labels: torch.Tensor, **kwargs): non_tie_index = torch.where(labels[:, -1] == 0)[0] new_coefs = head_output.coefs[non_tie_index, :] new_eta = head_output.eta[non_tie_index] if head_output.eta is not None else None no_tie_output = HeadOutputs(coefs=new_coefs, eta=new_eta) no_tie_labels = labels[non_tie_index, :] return loss(no_tie_output, no_tie_labels, loss_type="bt") def aggr_metric(metric_name, loss_type, labels, gt_output, model_output): func = registered_simple_metrics[loss_type][metric_name] gt = func( labels=labels, head_output=expand_output(gt_output, labels), loss_type=loss_type ) model = func( labels=labels, head_output=expand_output(model_output, labels), loss_type=loss_type, ) return {"ground-truth": round(gt, 4), "model-aggr": round(model, 4)} @register_aggr_metric("bt", "Aggr_Loss") @register_aggr_metric("bt-tie", "Aggr_Loss") @register_aggr_metric("rk", "Aggr_Loss") @register_aggr_metric("rk-reparam", "Aggr_Loss") @register_aggr_metric("bag", "Aggr_Loss") def Aggr_Loss( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, labels: torch.tensor, **kwargs, ): return aggr_metric("Loss", loss_type, labels, gt_output, model_output) @register_aggr_metric("bt", "Aggr_BCELoss") @register_aggr_metric("bt-tie", "Aggr_BCELoss") @register_aggr_metric("rk", "Aggr_BCELoss") @register_aggr_metric("rk-reparam", "Aggr_BCELoss") @register_aggr_metric("bag", "Aggr_BCELoss") def Aggr_BCE_Loss( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, labels: torch.tensor, **kwargs, ): return aggr_metric("BCELoss", loss_type, labels, gt_output, model_output) def expand_output(output, labels): coefs, eta = output.coefs, output.eta new_coefs = coefs.expand(len(labels), -1) if eta is not None: eta = eta.expand(len(labels), -1) return HeadOutputs(coefs=new_coefs, eta=eta) @register_simple_metric("bt", "MSELoss") def BT_mse( head_output: HeadOutputs, labels: torch.Tensor, **kwargs, ): coefs = head_output.coefs paired_coefs = coefs.gather(dim=-1, index=labels).contiguous() paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1] predicted_probs = torch.sigmoid(paired_delta_logit) true_labels = torch.ones_like(predicted_probs) mse = F.mse_loss(predicted_probs, true_labels) return mse.mean().item() @register_simple_metric("bt-tie", "MSELoss") def BT_tie_mst( head_output: HeadOutputs, labels: torch.Tensor, **kwargs, ): coefs = head_output.coefs model_idx = labels[:, :2] paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous() paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1] p_w = torch.sigmoid(paired_delta_logit) tie_ind = labels[:, -1] # let label be 0.5 if there is tie pred_probs = torch.where(tie_ind == 1, 0.5, p_w) true_labels = torch.ones_like(pred_probs) mse = F.mse_loss(pred_probs, true_labels) return mse.mean().item() @register_simple_metric("rk", "MSELoss") @register_simple_metric("rk-reparam", "MSELoss") def RK_mse(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs): probs_func = registered_helpers[loss_type]["probs"] p_w, _, p_t = probs_func(head_output=head_output, labels=labels) tie_ind = labels[:, -1] # True label will always be win (since first index) unless a tie occurs pred_probs = torch.where(tie_ind == 1, p_t, p_w) true_labels = torch.ones_like(pred_probs) mse = F.mse_loss(pred_probs, true_labels) return mse.mean().item() @register_simple_metric("bag", "MSELoss") def bag_mse(head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs): probs_func = registered_helpers[loss_type]["probs"] p_w, _, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels) tie_ind = labels[:, -1].unsqueeze(-1) P = torch.stack([p_w, p_t, p_t_bb], dim=-1) pred_probs = P.gather(dim=-1, index=tie_ind).contiguous().squeeze(-1) true_labels = torch.ones_like(pred_probs) mse = F.mse_loss(pred_probs, true_labels) return mse.mean().item() @register_helper("rk-reparam", "probs") def rk_reparam_probs( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): coefs = head_output.coefs eta = head_output.eta theta = (torch.exp(eta) + 1.000001).squeeze(-1) winner_idx = labels[:, 0:1] loser_idx = labels[:, 1:2] beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()[:, 0] beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()[:, 0] pi_win = torch.exp(beta_win) pi_lose = torch.exp(beta_lose) p_win = pi_win / (pi_win + theta * pi_lose + 1.0) p_lose = pi_lose / (pi_lose + theta * pi_win + 1.0) p_tie = 1.0 - p_win - p_lose return p_win, p_lose, p_tie @register_helper("bag", "probs") def bag_probs( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): coefs = head_output.coefs eta = head_output.eta theta = (torch.exp(eta) + 1.000001).squeeze(-1) winner_idx = labels[:, 0:1] loser_idx = labels[:, 1:2] beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous()[:, 0] beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous()[:, 0] pi_win = torch.exp(beta_win) pi_lose = torch.exp(beta_lose) pi_gamma = 1.0 p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma) p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma) p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose) p_tie = 1.0 - p_win - p_lose - p_tie_bb return p_win, p_lose, p_tie, p_tie_bb @register_helper("rk", "probs") def rk_probs( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): coefs = head_output.coefs eta = rk_eta(head_output) model_idx = labels[:, :2] paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous() paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1] p_w = torch.sigmoid(paired_delta_logit - eta) p_l = torch.sigmoid(-1 * paired_delta_logit - eta) p_t = 1 - p_w - p_l return p_w, p_l, p_t @register_simple_metric("bt", "Accuracy") def BT_accuracy( head_output: HeadOutputs, labels: torch.Tensor, **kwargs, ): coefs = head_output.coefs paired_coefs = coefs.gather(dim=-1, index=labels).contiguous() paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1] # winner would have positive difference correct = (paired_delta_logit > 0).float() return correct.mean().item() @register_simple_metric("bt-tie", "Accuracy") def BT_tie_accuracy( head_output: HeadOutputs, labels: torch.Tensor, **kwargs, ): coefs = head_output.coefs paired_coefs = coefs.gather(dim=-1, index=labels).contiguous() paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1] # winner would have positive difference correct = (paired_delta_logit > 0).float() tie_ind = labels[:, -1] # we give ties half the accuracy correct[tie_ind == 1] = 0.5 return correct.mean().item() @register_simple_metric("rk", "Accuracy") @register_simple_metric("rk-reparam", "Accuracy") def RK_accuracy( head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs ): probs_func = registered_helpers[loss_type]["probs"] p_w, p_l, p_t = probs_func(head_output=head_output, labels=labels) pred_labels = torch.where( p_w >= p_l, torch.where(p_w >= p_t, 1, 0.5), torch.where(p_l >= p_t, 0, 0.5) ) tie_ind = labels[:, -1] # tie if tie index, else winner (first index) predicted to win true_labels = torch.where(tie_ind == 1, 0.5, 1) correct = (pred_labels == true_labels).float() return correct.mean().item() @register_simple_metric("rk", "Tie_Accuracy") def RK_tie_accuracy( head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs ): probs_func = registered_helpers[loss_type]["probs"] p_w, p_l, p_t = probs_func(head_output=head_output, labels=labels) p_nt = p_w + p_l pred_tie = torch.where(p_t >= p_nt, 1, 0) tie_ind = labels[:, -1] correct = (pred_tie == tie_ind).float() return correct.mean().item() @register_simple_metric("bag", "Tie_Accuracy") def bag_tie_accuracy( head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs ): probs_func = registered_helpers[loss_type]["probs"] p_w, p_l, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels) p_nt = p_w + p_l p_tie = p_t + p_t_bb pred_tie = torch.where(p_nt >= p_tie, 0, 1) tie_ind = torch.where(labels[:, -1] == 0, 0, 1) correct = (pred_tie == tie_ind).float() return correct.mean().item() @register_simple_metric("bag", "Tie_bb_Accuracy") def bag_tie_bb_accuracy( head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs ): probs_func = registered_helpers[loss_type]["probs"] p_w, p_l, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels) p_nt_bb = p_w + p_l + p_t pred_tie = torch.where(p_t_bb >= p_nt_bb, 1, 0) tie_ind = torch.where(labels[:, -1] == 2, 1, 0) correct = (pred_tie == tie_ind).float() return correct.mean().item() @register_aggr_metric("bt", "Aggr_Tie_Accuracy") @register_aggr_metric("bt-tie", "Aggr_Tie_Accuracy") @register_aggr_metric("rk", "Aggr_Tie_Accuracy") @register_aggr_metric("rk-reparam", "Aggr_Tie_Accuracy") @register_aggr_metric("bag", "Aggr_Tie_Accuracy") def Aggr_Tie_accuracy( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, labels: torch.tensor, **kwargs, ): return aggr_metric("Tie_Accuracy", loss_type, labels, gt_output, model_output) @register_aggr_metric("bt", "Aggr_Tie_Accuracy") @register_aggr_metric("bt-tie", "Aggr_Tie_Accuracy") @register_aggr_metric("rk", "Aggr_Tie_Accuracy") @register_aggr_metric("rk-reparam", "Aggr_Tie_Accuracy") @register_aggr_metric("bag", "Aggr_Tie_Accuracy") def Aggr_Tie_accuracy( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, labels: torch.tensor, **kwargs, ): return aggr_metric("Tie_Accuracy", loss_type, labels, gt_output, model_output) @register_aggr_metric("bt", "Aggr_Tie_bb_Accuracy") @register_aggr_metric("bt-tie", "Aggr_Tie_bb_Accuracy") @register_aggr_metric("rk", "Aggr_Tie_bb_Accuracy") @register_aggr_metric("rk-reparam", "Aggr_Tie_bb_Accuracy") @register_aggr_metric("bag", "Aggr_Tie_bb_Accuracy") def Aggr_Tie_bb_accuracy( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, labels: torch.tensor, **kwargs, ): return aggr_metric("Tie_bb_Accuracy", loss_type, labels, gt_output, model_output) @register_aggr_metric("bt", "Aggr_Tie_bb_Loss") @register_aggr_metric("bt-tie", "Aggr_Tie_bb_Loss") @register_aggr_metric("rk", "Aggr_Tie_bb_Loss") @register_aggr_metric("rk-reparam", "Aggr_Tie_bb_Loss") @register_aggr_metric("bag", "Aggr_Tie_bb_Loss") def Aggr_Tie_bb_loss( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, labels: torch.tensor, **kwargs, ): return aggr_metric("Tie_bb_Loss", loss_type, labels, gt_output, model_output) @register_simple_metric("rk-reparam", "Tie_Accuracy") @register_simple_metric("bt", "Tie_Accuracy") @register_simple_metric("bt-tie", "Tie_Accuracy") @register_simple_metric("bt", "Tie_bb_Loss") @register_simple_metric("rk-reparam", "Tie_bb_Loss") @register_simple_metric("bt-tie", "Tie_bb_Loss") @register_simple_metric("rk", "Tie_bb_Loss") @register_simple_metric("bt", "Tie_Loss") @register_simple_metric("bt-tie", "Tie_Loss") @register_simple_metric("rk-reparam", "Tie_Loss") @register_simple_metric("rk", "Tie_bb_Accuracy") @register_simple_metric("rk-reparam", "Tie_bb_Accuracy") @register_simple_metric("bt", "Tie_bb_Accuracy") @register_simple_metric("bt-tie", "Tie_bb_Accuracy") def not_implemented( head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs ): return -1 # not implemented @register_simple_metric("bag", "Accuracy") def bag_accuracy( head_output: HeadOutputs, labels: torch.Tensor, loss_type: str, **kwargs ): probs_func = registered_helpers[loss_type]["probs"] p_w, p_l, p_t, p_t_bb = probs_func(head_output=head_output, labels=labels) P = torch.stack([p_w, p_t, p_t_bb, p_l], dim=-1) pred_labels = P.argmax(dim=-1) tie_ind = labels[:, -1] # let win be 0, tie be 1, tie_bb be 2. loss never predicted since winner_idx first true_labels = tie_ind correct = (pred_labels == true_labels).float() return correct.mean().item() @register_simple_metric("bt", "Mean-BT") @register_simple_metric("bt-tie", "Mean-BT") @register_simple_metric("rk", "Mean-BT") @register_simple_metric("rk-reparam", "Mean-BT") @register_simple_metric("bag", "Mean-BT") def beta_mean( head_output: HeadOutputs, **kwargs, ): betas = head_output.coefs flat_betas = betas.flatten() return torch.mean(flat_betas).item() @register_simple_metric("bt", "Std-BT") @register_simple_metric("bt-tie", "Std-BT") @register_simple_metric("rk", "Std-BT") @register_simple_metric("rk-reparam", "Std-BT") @register_simple_metric("bag", "Std-BT") def beta_std( head_output: HeadOutputs, **kwargs, ): betas = head_output.coefs flat_betas = betas.flatten() return torch.std(flat_betas).item() @register_simple_metric("bt", "Spread-BT") @register_simple_metric("bt-tie", "Spread-BT") @register_simple_metric("rk", "Spread-BT") @register_simple_metric("rk-reparam", "Spread-BT") @register_simple_metric("bag", "Spread-BT") def beta_spread( head_output: HeadOutputs, **kwargs, ): betas = head_output.coefs flat_betas = betas.flatten() return (torch.max(flat_betas) - torch.min(flat_betas)).item() @register_simple_metric("bt", "Mean-Spread-BT") @register_simple_metric("bt-tie", "Mean-Spread-BT") @register_simple_metric("rk", "Mean-Spread-BT") @register_simple_metric("rk-reparam", "Mean-Spread-BT") @register_simple_metric("bag", "Mean-Spread-BT") def beta_mean_spread( head_output: HeadOutputs, **kwargs, ): betas = head_output.coefs max_min_per_prompt = ( torch.max(betas, dim=-1).values - torch.min(betas, dim=-1).values ) return torch.mean(max_min_per_prompt).item() @register_simple_metric("bt", "Mean-IQR-BT") @register_simple_metric("bt-tie", "Mean-IQR-BT") @register_simple_metric("rk", "Mean-IQR-BT") @register_simple_metric("rk-reparam", "Mean-IQR-BT") @register_simple_metric("bag", "Mean-IQR-BT") def beta_mean_iqr( head_output: HeadOutputs, **kwargs, ): betas = head_output.coefs iqr_per_prompt = torch.quantile(betas, 0.75, dim=-1) - torch.quantile( betas, 0.25, dim=-1 ) return torch.mean(iqr_per_prompt).item() @register_simple_metric("bt", "Mean-Std-BT") @register_simple_metric("bt-tie", "Mean-Std-BT") @register_simple_metric("rk", "Mean-Std-BT") @register_simple_metric("rk-reparam", "Mean-Std-BT") @register_simple_metric("bag", "Mean-Std-BT") def beta_mean_std( head_output: HeadOutputs, **kwargs, ): betas = head_output.coefs std_per_prompt = torch.std(betas, dim=-1) return torch.mean(std_per_prompt).item() @register_helper("marginal-gt", "aggregrate") def aggr_marginal_gt( labels: torch.Tensor, model_list: torch.Tensor, loss_type: str, **kwargs ): coefs, eta = train_marginal(model_list, labels, loss_type) return HeadOutputs(coefs=coefs[0], eta=eta[0] if eta is not None else None) @register_helper("p2l", "aggregrate") def aggr_p2l( head_output: HeadOutputs, labels: torch.Tensor, model_list: torch.Tensor, loss_type: str, **kwargs, ): coefs, eta = train_aggr_prob( model_list, head_output, labels, loss_type, is_batch=False ) return HeadOutputs(coefs=coefs[0], eta=eta[0] if eta is not None else None) @register_helper("p2l", "aggregrate-batch") def aggr_p2l_batch( head_output: HeadOutputs, labels: torch.Tensor, model_list: torch.Tensor, loss_type: str, **kwargs, ): coefs_batch, eta_batch = train_aggr_prob( model_list, head_output, labels, loss_type, is_batch=True ) return [ HeadOutputs( coefs=coefs_batch[i], eta=eta_batch[i] if eta_batch is not None else None ) for i in range(coefs_batch.shape[0]) ] @register_helper("marginal-gt", "aggregrate-batch") def aggr_p2l_batch( head_output: HeadOutputs, labels: torch.Tensor, model_list: torch.Tensor, loss_type: str, **kwargs, ): # TODO: Make faster if necessary return [ aggr_marginal_gt(labels[i], model_list, loss_type) for i in range(len(labels)) ] @register_helper("marginal", "aggregrate") def aggr_non_p2l(head_output: HeadOutputs, loss_type: str, **kwargs): etas = head_output.eta etas = etas[0, :] if etas is not None else None return HeadOutputs(coefs=head_output.coefs[0, :], eta=etas) @register_helper("arena", "aggregrate") def aggr_non_p2l( head_output: HeadOutputs = None, arena_rankings: torch.tensor = None, **kwargs ): eta = torch.tensor([0]) if arena_rankings is not None: return HeadOutputs(coefs=arena_rankings, eta=eta) # arena just has the same betas repeated if not provided return HeadOutputs(coefs=head_output.coefs[0, :], eta=eta) def train_marginal(model_list, labels, loss_type, lr=1.0, tol=1e-9, max_epochs=50): model_cls = registered_aggr_models[loss_type] model = model_cls(len(model_list)) optimizer = optim.LBFGS( model.parameters(), lr=lr, max_iter=max_epochs, tolerance_grad=tol, tolerance_change=tol, ) loss_func = registered_losses[loss_type] labels = ( labels.squeeze() if labels.dim() > 2 else labels ) # marginal doesn't use batching since one at a time def closure(): optimizer.zero_grad() coefs, eta = model() coefs_expanded = coefs[0].expand(len(labels), -1) eta_expanded = eta[0].expand(len(labels), -1) if eta is not None else None head_output = HeadOutputs(coefs=coefs_expanded, eta=eta_expanded) loss = loss_func(head_output=head_output, labels=labels) loss.backward() return loss optimizer.step(closure) true_coefs, true_eta = model() return true_coefs.detach(), true_eta.detach() if true_eta is not None else None def train_aggr_prob( model_list, head_outputs, labels, loss_type, is_batch, lr=1.0, tol=1e-9, max_epochs=50, ): true_probs_func = registered_helpers[loss_type]["pairwise_probs"] true_probs = true_probs_func(real_output=head_outputs) # add a batch size of 1 since aggregration is done in batches (only necessary if data isn't in batch format) if not is_batch: true_probs = true_probs.unsqueeze(0) batch_size = true_probs.shape[0] model_cls = registered_aggr_models[loss_type] model = model_cls(len(model_list), batch_size) optimizer = optim.LBFGS( model.parameters(), lr=lr, max_iter=max_epochs, tolerance_grad=tol, tolerance_change=tol, ) loss_func = registered_pairwise_losses[loss_type] count = 0 prev_loss = 0 def closure(): optimizer.zero_grad() coefs, eta = model() aggr_output = HeadOutputs(coefs=coefs, eta=eta) loss = loss_func( real_output=head_outputs, aggregated_output=aggr_output, true_probs=true_probs, ) loss.backward() nonlocal count count += 1 if count == 49: raise Warning("Batch training did not converge") return loss optimizer.step(closure) true_coefs, true_eta = model() return true_coefs.detach(), true_eta.detach() if true_eta is not None else None def rk_eta(output): if output.eta is None: return None BETA = 0.1 return torch.clamp( torch.nn.functional.softplus(output.eta - 22.5, BETA).squeeze(-1), min=0.02 ) @register_helper("rk", "pairwise_probs") def pairwise_RK_probs(real_output: HeadOutputs): real_betas = real_output.coefs real_eta = rk_eta(real_output) real_eta = real_eta.unsqueeze(-1) num_models = real_betas.shape[-1] pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) # elipses allow for both batched/unbatched beta_i_real = real_betas[..., pair_indices[:, 0]] beta_j_real = real_betas[..., pair_indices[:, 1]] true_probs_win = torch.sigmoid(beta_i_real - beta_j_real - real_eta) true_probs_loss = torch.sigmoid(beta_j_real - beta_i_real - real_eta) true_probs_tie = 1.0 - true_probs_win - true_probs_loss true_probs = torch.stack((true_probs_win, true_probs_loss, true_probs_tie), dim=-1) return true_probs @register_helper("rk-reparam", "pairwise_probs") def pairwise_RK_reparam_probs(real_output: HeadOutputs, **kwargs): real_betas = real_output.coefs real_theta = torch.exp(real_output.eta) + 1.000001 num_models = real_betas.shape[-1] pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) beta_i_real = real_betas[..., pair_indices[:, 0]] beta_j_real = real_betas[..., pair_indices[:, 1]] pi_win = torch.exp(beta_i_real) pi_lose = torch.exp(beta_j_real) p_win = pi_win / (pi_win + real_theta * pi_lose + 1.0) p_lose = pi_lose / (pi_lose + real_theta * pi_win + 1.0) p_tie = 1.0 - p_win - p_lose true_probs = torch.stack((p_win, p_lose, p_tie), dim=-1) return true_probs @register_helper("bag", "pairwise_probs") def pairwise_bag_probs(real_output: HeadOutputs, **kwargs): real_betas = real_output.coefs real_theta = torch.exp(real_output.eta) + 1.000001 num_models = real_betas.shape[-1] pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) beta_i_real = real_betas[..., pair_indices[:, 0]] beta_j_real = real_betas[..., pair_indices[:, 1]] pi_win = torch.exp(beta_i_real) pi_lose = torch.exp(beta_j_real) pi_gamma = 1.0 p_win = pi_win / (pi_win + real_theta * pi_lose + pi_gamma) p_lose = pi_lose / (pi_lose + real_theta * pi_win + pi_gamma) p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose) p_tie = 1.0 - p_win - p_lose - p_tie_bb true_probs = torch.stack((p_win, p_lose, p_tie, p_tie_bb), dim=-1) return true_probs @register_helper("bt", "pairwise_probs") @register_helper("bt-tie", "pairwise_probs") def pairwise_BT_probs(real_output: HeadOutputs): real_betas = real_output.coefs num_models = real_betas.shape[-1] pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) beta_i_real = real_betas[..., pair_indices[:, 0]] beta_j_real = real_betas[..., pair_indices[:, 1]] true_probs = torch.sigmoid(beta_i_real - beta_j_real) return true_probs # removes nan from tensor, indices will be shifted def remove_beta_nan(beta1, beta2): beta_mask = ~torch.isnan(beta1) & ~torch.isnan(beta2) return beta1[beta_mask], beta2[beta_mask] @register_aggr_metric("bt", "Leaderboard") @register_aggr_metric("bt-tie", "Leaderboard") @register_aggr_metric("rk", "Leaderboard") @register_aggr_metric("rk-reparam", "Leaderboard") @register_aggr_metric("bag", "Leaderboard") def leaderboard( gt_output: HeadOutputs, model_output: HeadOutputs, model_list: np.array, **kwargs ): gt_lb = get_leaderboard(gt_output, model_list) model_lb = get_leaderboard(model_output, model_list) return {"ground-truth": list(gt_lb), "model-aggr": list(model_lb)} def get_leaderboard(output, model_list): coefs = output.coefs sorted_indices = torch.argsort(coefs, descending=True) sorted_model_names = [model_list[i] for i in sorted_indices] sorted_betas = coefs[sorted_indices] leaderboard = [] for i in range(len(sorted_model_names)): beta = ( round(sorted_betas[i].item(), 4) if not torch.isnan(sorted_betas[i]) else "nan" ) cur_model = str(sorted_model_names[i]) + ": " + str(beta) leaderboard.append(cur_model) return np.array(leaderboard) @register_aggr_metric("bt", "L1-Dist-Prob") @register_aggr_metric("bt-tie", "L1-Dist-Prob") def l1_dist_prob_bt(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): beta1 = gt_output.coefs beta2 = model_output.coefs # if arena is one, there may be nan if model not present in that file beta1, beta2 = remove_beta_nan(beta1, beta2) diff_matrix1 = beta1.unsqueeze(1) - beta1.unsqueeze(0) diff_matrix2 = beta2.unsqueeze(1) - beta2.unsqueeze(0) prob_vec1 = torch.sigmoid(diff_matrix1).flatten() prob_vec2 = torch.sigmoid(diff_matrix2).flatten() return torch.abs(prob_vec2 - prob_vec1).mean().item() @register_aggr_metric("rk-reparam", "L1-Dist-Prob") @register_aggr_metric("rk", "L1-Dist-Prob") def l1_dist_prob_rk( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, **kwargs ): eta1 = gt_output.eta eta2 = model_output.eta # need to both have eta if eta1 is None or eta2 is None: return l1_dist_prob_bt(gt_output, model_output) pair_probs_func = registered_helpers[loss_type]["pairwise_probs"] p_win1, p_lose1, p_tie1 = torch.unbind(pair_probs_func(gt_output), -1) p_win2, p_lose2, p_tie2 = torch.unbind(pair_probs_func(model_output), -1) win_diff = torch.abs(p_win1 - p_win2).mean().item() lose_diff = torch.abs(p_lose1 - p_lose2).mean().item() tie_diff = torch.abs(p_tie1 - p_tie2).mean().item() return (win_diff + lose_diff + tie_diff) / 3 @register_aggr_metric("bag", "L1-Dist-Prob") def l1_dist_prob_bag( gt_output: HeadOutputs, model_output: HeadOutputs, loss_type: str, **kwargs ): eta1 = gt_output.eta eta2 = model_output.eta # need to both have eta if eta1 is None or eta2 is None: return l1_dist_prob_bt(gt_output, model_output) pair_probs_func = registered_helpers[loss_type]["pairwise_probs"] p_win1, p_lose1, p_tie1, p_tie_bb1 = torch.unbind(pair_probs_func(gt_output), -1) p_win2, p_lose2, p_tie2, p_tie_bb2 = torch.unbind(pair_probs_func(model_output), -1) win_diff = torch.abs(p_win1 - p_win2).mean().item() lose_diff = torch.abs(p_lose1 - p_lose2).mean().item() tie_diff = torch.abs(p_tie1 - p_tie2).mean().item() tie_bb_diff = torch.abs(p_tie_bb2 - p_tie_bb1).mean().item() return (win_diff + lose_diff + tie_diff + tie_bb_diff) / 4 @register_aggr_metric("bt", "IQR-BT") @register_aggr_metric("bt-tie", "IQR-BT") @register_aggr_metric("rk", "IQR-BT") @register_aggr_metric("rk-reparam", "IQR-BT") @register_aggr_metric("bag", "IQR-BT") def beta_iqr(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): ( gt_coefs, model_coefs, ) = ( gt_output.coefs, model_output.coefs, ) gt_iqr = (torch.quantile(gt_coefs, 0.75) - torch.quantile(gt_coefs, 0.25)).item() model_iqr = ( torch.quantile(model_coefs, 0.75) - torch.quantile(model_coefs, 0.25) ).item() return {"ground-truth": round(gt_iqr, 4), "model-aggr": round(model_iqr, 4)} @register_aggr_metric("bt", "Std-BT") @register_aggr_metric("bt-tie", "Std-BT") @register_aggr_metric("rk", "Std-BT") @register_aggr_metric("rk-reparam", "Std-BT") @register_aggr_metric("bag", "Std-BT") def beta_std_aggr(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): gt_betas, model_betas = gt_output.coefs, model_output.coefs gt_std, model_std = ( torch.std(gt_betas.flatten()).item(), torch.std(model_betas.flatten()).item(), ) return {"ground-truth": round(gt_std, 4), "model-aggr": round(model_std, 4)} @register_aggr_metric("bt", "Spread-BT") @register_aggr_metric("bt-tie", "Spread-BT") @register_aggr_metric("rk", "Spread-BT") @register_aggr_metric("rk-reparam", "Spread-BT") @register_aggr_metric("bag", "Spread-BT") def beta_spread_aggr(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): gt_betas, model_betas = gt_output.coefs.flatten(), model_output.coefs.flatten() gt_spread, model_spread = torch.max(gt_betas) - torch.min(gt_betas), torch.max( model_betas ) - torch.min(model_betas) return { "ground-truth": round(gt_spread.item(), 4), "model-aggr": round(model_spread.item(), 4), } @register_aggr_metric("bt", "Kendall-lbs") @register_aggr_metric("bt-tie", "Kendall-lbs") @register_aggr_metric("rk", "Kendall-lbs") @register_aggr_metric("rk-reparam", "Kendall-lbs") @register_aggr_metric("bag", "Kendall-lbs") def kendall_lb(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs) gt_lb = gt_betas.numpy() model_lb = model_betas.numpy() return kendalltau(gt_lb, model_lb)[0] @register_aggr_metric("bt", "Spearman-lbs") @register_aggr_metric("bt-tie", "Spearman-lbs") @register_aggr_metric("rk", "Spearman-lbs") @register_aggr_metric("rk-reparam", "Spearman-lbs") @register_aggr_metric("bag", "Spearman-lbs") def spearman_lb(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs) gt_lb = gt_betas.numpy() model_lb = model_betas.numpy() return spearmanr(gt_lb, model_lb)[0] def top_k_frac(gt_betas: torch.tensor, model_betas: torch.tensor, k: int): gt_top_indices = set(torch.topk(gt_betas, k).indices.numpy()) model_top_indices = set(torch.topk(model_betas, k).indices.numpy()) common_indices = gt_top_indices & model_top_indices return len(common_indices) / k def top_k_displace(gt_betas: torch.tensor, model_betas: torch.tensor, k: int): gt_top_indices = torch.topk(gt_betas, k).indices model_ranks = torch.argsort(torch.argsort(model_betas, descending=True)) displacements = torch.abs(model_ranks[gt_top_indices] - torch.arange(k)) return displacements.float().mean().item() @register_aggr_metric("bt", "Top-k-fraction") @register_aggr_metric("bt-tie", "Top-k-fraction") @register_aggr_metric("rk", "Top-k-fraction") @register_aggr_metric("rk-reparam", "Top-k-fraction") @register_aggr_metric("bag", "Top-k-fraction") def top_k_frac_dict(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs) res = {} for k in [1, 3, 5, 10]: res[k] = round(top_k_frac(gt_betas, model_betas, k), 4) return res @register_aggr_metric("bt", "Top-k-displace") @register_aggr_metric("bt-tie", "Top-k-displace") @register_aggr_metric("rk", "Top-k-displace") @register_aggr_metric("rk-reparam", "Top-k-displace") @register_aggr_metric("bag", "Top-k-displace") def top_k_dist_dict(gt_output: HeadOutputs, model_output: HeadOutputs, **kwargs): gt_betas, model_betas = remove_beta_nan(gt_output.coefs, model_output.coefs) res = {} for k in [1, 3, 5, 10]: res[k] = round(top_k_displace(gt_betas, model_betas, k), 4) return res ================================================ FILE: p2l/auto_evals.py ================================================ import argparse import json import os import io import warnings import math from tqdm import tqdm import time import copy import torch import pandas as pd import numpy as np from datasets import load_dataset, load_from_disk from huggingface_hub import hf_hub_download, upload_file, list_repo_files from model import HeadOutputs from auto_eval_utils import ( registered_simple_metrics, registered_aggr_metrics, registered_helpers, ) def parse_model_list(hf_model, local_path): if not hf_model and not local_path: raise ValueError("Either model repo or local model list must be provided.") model_list_path = local_path # if no local path, try getting from model_repo if not model_list_path: model_list_path = hf_hub_download( repo_id=hf_model, filename="model_list.json", repo_type="model" ) model_list = pd.read_json(model_list_path, lines=False).iloc[:, 0].tolist() return np.array(model_list) def change_beta_model_list(df, old_list, new_list): old_list = old_list.tolist() old_to_new = [old_list.index(model) for model in new_list] betas_array = np.array(df["betas"].to_list()) betas_array = betas_array[:, old_to_new] return betas_array.tolist() def parse_eval_output_data( model_repo, local_eval_path, local_checkpoint_path, hf_checkpoint_repo, hf_checkpoint_file, loss_type, model_list, remove_last_hidden_json, ): ret_df, ret_model_list = None, None if local_checkpoint_path or hf_checkpoint_repo: path = local_checkpoint_path if not path: if not hf_checkpoint_file: raise ValueError( "Must provide checkpoint file along with checkpoint repo" ) path = hf_hub_download( repo_id=hf_checkpoint_repo, filename=hf_checkpoint_file, repo_type="dataset", ) df = pd.read_json(path) # caching json w/o last hidden layer if remove_last_hidden_json and local_checkpoint_path: if "last_hidden_state" in df.columns: df = df.drop(columns=["last_hidden_state"]) df.to_json(local_checkpoint_path) df = df.rename(columns={"coefs": "betas"}) # data is stored with nested lists for both etas and betas only in checkpoint data # df['eta'] = np.array(df['eta'].to_list()).flatten() df["eta"] = df["eta"].apply(lambda x: x[0] if isinstance(x, list) else x) df["betas"] = df["betas"].apply(lambda x: x[0] if isinstance(x, list) else x) val_model_list = get_model_list_from_df(df) # only betas need to be adjusted since labels are correct df["betas"] = change_beta_model_list(df, model_list, val_model_list) ret_df, ret_model_list = df, val_model_list elif local_eval_path: ret_df, ret_model_list = pd.read_json(local_eval_path, lines=True), model_list elif model_repo: files = list_repo_files(repo_id=model_repo, repo_type="model") if "eval_output.jsonl" not in files: raise FileNotFoundError( f"'eval_output.jsonl' not found in the hf repository'{model_repo}'." ) path = hf_hub_download( repo_id=model_repo, filename="eval_output.jsonl", repo_type="model" ) ret_df, ret_model_list = pd.read_json(path, lines=True), model_list else: raise ValueError("need to provide path for eval output data") preprocess_func = registered_helpers[loss_type]["preprocess_data"] ret_df = preprocess_func(data=ret_df) return ret_df, ret_model_list def add_labels_to_data(data, loss_type, model_list): if loss_type == "bt": data = data[~data["winner"].isin(["tie", "tie (bothbad)"])] def create_labels(row): winner = row["winner"] model_a = row["model_a"] model_b = row["model_b"] model_a_idx = np.where(model_list == model_a)[0][0] model_b_idx = np.where(model_list == model_b)[0][0] tie_bb_label = 2 if loss_type == "bag" else 1 if winner == "model_a": return np.array([model_a_idx, model_b_idx, 0]) elif winner == "model_b": return np.array([model_b_idx, model_a_idx, 0]) elif winner == "tie": return np.array([model_a_idx, model_b_idx, 1]) else: return np.array([model_a_idx, model_b_idx, tie_bb_label]) data["labels"] = data.apply(create_labels, axis=1) return data # only use if completely necessary def get_model_list_from_df(df): return np.array(sorted(pd.concat([df["model_a"], df["model_b"]]).unique())) def parse_train_data(hf_data, local_path, loss_type, train_model_list): if not hf_data and not local_path: warnings.warn( "No train data provided, marginal model type will not work if specified" ) return if local_path: if local_path.endswith(".jsonl"): data = pd.read_json(local_path, lines=True) else: data = load_from_disk(local_path)["train"].to_pandas() else: data = load_dataset(hf_data, split="train").to_pandas() return add_labels_to_data(data, loss_type, train_model_list) def parse_arena_data(path, initial_rating=1000, BASE=10, SCALE=400): if not path: warnings.warn("Ground truth arena data not passed in, some metrics not work") return df = pd.read_csv(path) # removes to avoid duplicates since not every model has a style_controlled ranking df = df[df["style_control"] == False] # ELO to beta using what eval_p2l.ipynb used df["beta"] = (df["rating"] - initial_rating) / (SCALE * math.log(BASE)) pivot = df.pivot(index="model_name", columns="category", values="beta").reindex( model_list ) if pivot.isnull().any().any(): missing_models = pivot[pivot.isnull().any(axis=1)].index.tolist() warnings.warn("Model not included in arena leaderboard:" + str(missing_models)) category_to_betas = { category: torch.tensor(pivot[category].values, dtype=torch.float) for category in pivot.columns } return category_to_betas # NOTE: Only accepts certain categories, needs to be manually added def filter_battle_data(battles, category): if battles is None: return None # expect category key by itself or key=value key_val_pair = category.split("=") key = key_val_pair[0] val = key_val_pair[1] if len(key_val_pair) == 2 else True val = bool(val) if val in ["True", "true", "False", "false"] else val try: # no filtering if key == "all": return battles # no nesting if key == "language" or key == "is_code": return battles[battles[key] == val] # nested ones need specific cases if key == "math": return battles[ battles["category_tag"].apply(lambda x: x["math_v0.1"]["math"]) ] if key == "complexity": return battles[ battles["category_tag"].apply( lambda x: x["criteria_v0.1"]["complexity"] ) ] if key == "creative_writing": return battles[ battles["category_tag"].apply( lambda x: x["creative_writing_v0.1"]["creative_writing"] ) ] if key == "hard": return battles[ battles["category_tag"].apply( lambda x: sum(x["criteria_v0.1"].values()) >= 6 ) ] # Category not found return None except: return None # NOTE: Only accepts certain categories, needs to be manually added def get_arena_rankings(data, category): if data is None: return None key_val_pair = category.split("=") key = key_val_pair[0] val = key_val_pair[1] if len(key_val_pair) == 2 else True val = bool(val) if val in ["True", "true", "False", "false"] else val try: # no filtering if key == "all": return data["full"] # no nesting if key == "language": return data[val.lower()] if key == "is_code": return data["coding"] if key == "math": return data["math"] if key == "hard": return data["hard_6"] if key == "creative_writing": return data["creative_writing"] return None except: return None def get_subset_prompts(output, labels, size): num_prompts = output.coefs.shape[0] sampled_indices = torch.randperm(num_prompts)[:size] sampled_coefs = output.coefs[sampled_indices, :] sampled_eta = None if output.eta is not None: sampled_eta = output.eta[sampled_indices] sampled_labels = labels[sampled_indices, :] sampled_output = HeadOutputs(coefs=sampled_coefs, eta=sampled_eta) return sampled_output, sampled_labels def get_subset_prompts_batch(output, labels, size, batch_size): num_prompts, num_models = output.coefs.shape sampled_indices = torch.randint(low=0, high=num_prompts, size=(batch_size, size)) sampled_coefs = output.coefs[sampled_indices] sampled_eta = None if output.eta is not None: sampled_eta = output.eta[sampled_indices] sampled_labels = labels[sampled_indices] sampled_output = HeadOutputs(coefs=sampled_coefs, eta=sampled_eta) return sampled_output, sampled_labels def get_ith_output(output, i): betas = output.coefs[i] eta = output.eta[i] if output.eta is not None else None return HeadOutputs(coefs=betas, eta=eta) def save_output(results, local_dir, hf_dir, file_name): if not local_dir and not hf_dir: raise ValueError("Specify a directory for outputs.") results["params"]["output_file_name"] = file_name file_name += ".json" if local_dir: path = os.path.join(local_dir, file_name) with open(path, "w") as file: json.dump(results, file, indent=4, separators=(",", ": ")) if hf_dir: output = json.dumps(results, indent=4, separators=(",", ": ")) tmp_file = io.BytesIO(output.encode("utf-8")) upload_file( path_or_fileobj=tmp_file, path_in_repo=file_name, repo_id=hf_dir, repo_type="model", ) def simple_metrics(metrics, output, labels, loss_type): results = {} for metric in tqdm(metrics, desc="Simple Metrics", unit="metrics"): metric_dict = registered_simple_metrics[loss_type] metric_func = metric_dict[metric] metric_val = metric_func(head_output=output, labels=labels, loss_type=loss_type) results[metric] = ( round(metric_val, 4) if isinstance(metric_val, float) else metric_val ) return results def category_metrics( metrics, output, labels, loss_type, model_type, model_list, ground_truth, arena_rankings, ): results = {} aggr_func_model = registered_helpers[model_type]["aggregrate"] # our default ground truth is marginal-gt but we can switch to arena or add configurability if desired aggr_func_gt = registered_helpers[ground_truth]["aggregrate"] model_output = aggr_func_model( head_output=output, labels=labels, model_list=model_list, loss_type=loss_type ) gt_output = aggr_func_gt( labels=labels, model_list=model_list, loss_type=loss_type, arena_rankings=arena_rankings, ) for metric in tqdm(metrics, desc="Category Metrics", unit="metric"): metric_dict = registered_aggr_metrics[loss_type] metric_func = metric_dict[metric] metric_val = metric_func( gt_output=gt_output, model_output=model_output, model_list=model_list, loss_type=loss_type, labels=labels, ) results[metric] = ( round(metric_val, 4) if isinstance(metric_val, float) else metric_val ) return results def random_subset_metrics( metrics, output, labels, subset_sizes, trials_per_subset, loss_type, model_type, model_list, ): results = {} aggr_func_model = registered_helpers[model_type]["aggregrate"] # our default ground truth is marginal-gt but we can switch to arena or add configurability if desired aggr_func_gt = registered_helpers["marginal-gt"]["aggregrate"] for idx, size in enumerate(subset_sizes): size = int(size) subset_results = {metric: 0 for metric in metrics} for _ in tqdm( range(trials_per_subset[idx]), desc=f"Random Subset size {size}", unit="trial", ): sample_output, sample_labels = get_subset_prompts(output, labels, size) model_output = aggr_func_model( head_output=sample_output, labels=sample_labels, model_list=model_list, loss_type=loss_type, ) gt_output = aggr_func_gt( labels=sample_labels, model_list=model_list, loss_type=loss_type ) for metric in metrics: metric_dict = registered_aggr_metrics[loss_type] metric_func = metric_dict[metric] metric_val = metric_func( gt_output=gt_output, model_output=model_output, model_list=model_list, loss_type=loss_type, ) subset_results[metric] += metric_val for metric in metrics: subset_results[metric] = round( subset_results[metric] / trials_per_subset, 4 ) results[size] = subset_results return results def aggr_scale_metrics( metrics, output, labels, subset_sizes, trials_per_subset, loss_type, model_type, model_list, arena_rankings, gt, ): results = {} aggr_func_model = registered_helpers[model_type]["aggregrate-batch"] # our default ground truth is arena ranking but we can switch to arena or add configurability if desired aggr_func_gt = registered_helpers[gt]["aggregrate"] gt_output = aggr_func_gt( labels=labels, model_list=model_list, loss_type=loss_type, arena_rankings=arena_rankings, ) # TODO: arbitray threshold to limit memory consumption for batching # max_prompts_times_samples_squared = 2e4 for idx, size in enumerate(subset_sizes): size = int(size) num_samples = int(trials_per_subset[idx]) subset_results = {metric: 0 for metric in metrics} # num_full_mini_batches = int(max( # 1, (size * (num_samples ** 2)) // max_prompts_times_samples_squared # )) num_full_mini_batches = int(max(1, num_samples // 100)) mini_batch_size = num_samples // num_full_mini_batches leftover = num_samples - (num_full_mini_batches * mini_batch_size) with tqdm(total=num_samples, desc=f"Aggr Subset Size {size}") as pbar: def run_mini_batch(batch_count): sample_output, sample_labels = get_subset_prompts_batch( output, labels, size, batch_count ) batch_output = aggr_func_model( head_output=sample_output, labels=sample_labels, model_list=model_list, loss_type=loss_type, ) for cur_output in batch_output: for metric in metrics: metric_dict = registered_aggr_metrics[loss_type] metric_func = metric_dict[metric] metric_val = metric_func( gt_output=gt_output, model_output=cur_output, model_list=model_list, loss_type=loss_type, ) subset_results[metric] += metric_val pbar.update(1) for _ in range(num_full_mini_batches): run_mini_batch(mini_batch_size) if leftover > 0: run_mini_batch(leftover) for metric in metrics: subset_results[metric] = round( subset_results[metric] / float(trials_per_subset[idx]), 4 ) results[size] = subset_results return results def get_metrics( val_data, train_data, arena_rankings, val_model_list, train_model_list, args ): results = {} to_inc = set(args.metrics_to_inc) output_label_func = registered_helpers[args.model_type]["output_labels"] output, labels = output_label_func( val_data=val_data, train_data=train_data, arena_rankings=arena_rankings, loss_type=args.loss_type, model_list=val_model_list, train_model_list=train_model_list, ) if "simple" in to_inc: simple_results = simple_metrics( metrics=args.simple_metrics, output=output, labels=labels, loss_type=args.loss_type, ) results["simple_metrics"] = simple_results if "category" in to_inc: category_results = category_metrics( metrics=args.category_metrics, loss_type=args.loss_type, model_type=args.model_type, model_list=val_model_list, output=output, labels=labels, ground_truth=args.ground_truth, arena_rankings=arena_rankings, ) results["category_metrics"] = category_results if "random_subsets" in to_inc: subset_results = random_subset_metrics( metrics=args.rand_subset_metrics, subset_sizes=args.rand_subset_sizes, trials_per_subset=args.rand_num_samples, loss_type=args.loss_type, model_type=args.model_type, model_list=val_model_list, output=output, labels=labels, ) results["random_subsets"] = subset_results if "aggr_scale" in to_inc: scale_results = aggr_scale_metrics( metrics=args.aggr_scale_metrics, subset_sizes=args.aggr_scale_subset_sizes, trials_per_subset=args.aggr_scale_num_samples, loss_type=args.loss_type, model_type=args.model_type, model_list=val_model_list, output=output, labels=labels, arena_rankings=arena_rankings, gt=args.ground_truth, ) results["aggr_scale"] = scale_results return results if __name__ == "__main__": parser = argparse.ArgumentParser() # model repo contains model list and potentially, eval data (eval_output.jsonl) parser.add_argument("--model_repo", type=str, default=None) parser.add_argument("--model_list_path", type=str, default=None) # val data is either in model repo, local file, or remotely as checkpoint file parser.add_argument("--eval_path", nargs="+", type=str, default=None) parser.add_argument("--checkpoint_path", nargs="+", type=str, default=None) parser.add_argument("--hf_checkpoint_repo", type=str, default=None) parser.add_argument("--hf_checkpoint_file", nargs="+", type=str, default=None) parser.add_argument("--output_dir", type=str, default=None) parser.add_argument("--hf_output_dir", type=str, default=None) parser.add_argument( "--output_file_name", type=str, nargs="+", default=["eval_metrics"] ) parser.add_argument("--hf_train_dataset", type=str, default=None) parser.add_argument("--train_path", type=str, default=None) parser.add_argument("--arena_path", type=str, default=None) parser.add_argument("--loss_type", type=str, default="bt", help="bt, bt_tie, rk") parser.add_argument( "--model_type", type=str, default="p2l", help="p2l, marginal, arena" ) parser.add_argument( "--categories", nargs="*", default=[ "all", "creative_writing", "math", "language=Chinese", "is_code", "hard", ], ) parser.add_argument( "--simple_metrics", nargs="*", default=[ "Loss", "BCELoss", "MSELoss", "Accuracy", "Tie_Loss", "Tie_Accuracy", "Tie_bb_Accuracy", "Tie_bb_Loss", "Mean-BT", "Std-BT", "Spread-BT", "Mean-Spread-BT", "Mean-IQR-BT", "Mean-Std-BT", ], ) parser.add_argument("--train_checkpoints", nargs="+", type=int, default=[]) parser.add_argument("--checkpoint_size", type=int, default=0) # gt is marginal on val parser.add_argument( "--category_metrics", nargs="*", default=[ "Leaderboard", "Aggr_Loss", "Aggr_BCELoss", "Aggr_Tie_Loss", "Aggr_Tie_Accuracy", "Aggr_Tie_bb_Accuracy", "Aggr_Tie_bb_Loss", "L1-Dist-Prob", "Spearman-lbs", "Kendall-lbs", "IQR-BT", "Std-BT", "Spread-BT", "Top-k-fraction", "Top-k-displace", ], ) parser.add_argument( "--rand_subset_sizes", nargs="*", default=[250, 500, 1000, 2000] ) parser.add_argument("--rand_num_samples", nargs="*", default=[50, 20, 5, 3]) parser.add_argument( "--rand_subset_metrics", nargs="*", default=["L1-Dist-Prob", "Spearman-lbs", "Kendall-lbs"], ) # gt is arena leaderboard parser.add_argument( "--aggr_scale_subset_sizes", nargs="*", default=[1, 10, 25, 100, 250, 500, 1000, 2000], ) parser.add_argument( "--aggr_scale_num_samples", nargs="*", default=[500, 500, 500, 200, 100, 40, 10, 6], ) parser.add_argument( "--aggr_scale_metrics", nargs="*", default=["L1-Dist-Prob", "Spearman-lbs", "Kendall-lbs"], ) parser.add_argument("--ground_truth", type=str, default="marginal-gt") parser.add_argument( "--metrics_to_inc", nargs="*", default=["simple", "category", "random_subsets", "aggr_scale"], ) parser.add_argument("--remove_last_hidden_json", default=True) args = parser.parse_args() start_time = time.time() for idx in range(len(args.output_file_name)): results = {} results["params"] = copy.deepcopy(vars(args)) train_model_list = parse_model_list(args.model_repo, args.model_list_path) eval_path = args.eval_path[idx] if args.eval_path else None checkpoint_path = args.checkpoint_path[idx] if args.checkpoint_path else None hf_checkpoint_file = ( args.hf_checkpoint_file[idx] if args.hf_checkpoint_file else None ) # make sure right params are dumped results["params"]["eval_path"] = eval_path results["params"]["checkpoint_path"] = checkpoint_path results["params"]["hf_checkpoint_file"] = hf_checkpoint_file val_data, val_model_list = parse_eval_output_data( args.model_repo, eval_path, checkpoint_path, args.hf_checkpoint_repo, hf_checkpoint_file, args.loss_type, train_model_list, args.remove_last_hidden_json, ) train_data = parse_train_data( args.hf_train_dataset, args.train_path, args.loss_type, train_model_list ) arena_data = parse_arena_data(args.arena_path) models = {} for category in args.categories: cat_val_data = filter_battle_data(val_data, category) cat_train_data = filter_battle_data(train_data, category) arena_rankings = get_arena_rankings(arena_data, category) current_model = str(args.model_type) + "-" + category models[current_model] = get_metrics( cat_val_data, cat_train_data, arena_rankings, val_model_list, train_model_list, args, ) # merely for marginal train checkpointing for checkpoint in args.train_checkpoints: num_data = checkpoint * args.checkpoint_size checkpoint_train_data = train_data.head(num_data) cat_train_data = filter_battle_data(checkpoint_train_data, category) models[current_model + f"-checkpoint-{checkpoint}"] = get_metrics( cat_val_data, cat_train_data, arena_rankings, val_model_list, train_model_list, args, ) results["models"] = models save_output( results, args.output_dir, args.hf_output_dir, args.output_file_name[idx] ) end_time = time.time() total_time = end_time - start_time minutes = int(total_time // 60) seconds = int(total_time % 60) print(f"\nTotal time taken: {minutes} minutes and {seconds} seconds") ================================================ FILE: p2l/dataset.py ================================================ from transformers import PreTrainedTokenizer from datasets import Dataset, DatasetDict, load_dataset, load_from_disk import torch from typing import List def get_model_list(dataset: Dataset): model_a_values = dataset.unique("model_a") model_b_values = dataset.unique("model_b") model_list_with_repeats = [] for value in model_a_values: model_list_with_repeats.append(value) for value in model_b_values: model_list_with_repeats.append(value) model_set = set(model_list_with_repeats) model_list = sorted(list(model_set)) return model_list def get_dataset(path: str, split: str, from_disk=False): if from_disk: dataset = load_from_disk(path) if isinstance(dataset, DatasetDict): dataset = dataset[split] return dataset else: return load_dataset(path, split=split) def _translate_label( labels: List[int], train_model_list: List[str], val_model_list: List[str] ) -> List[int]: label_copy = labels[:] label_copy[0] = train_model_list.index(val_model_list[labels[0]]) label_copy[1] = train_model_list.index(val_model_list[labels[1]]) return label_copy def translate_val_data( val_data: Dataset, train_model_list: List[str], val_model_list: List[str] ) -> Dataset: # Validate val models for val_model in val_model_list: assert val_model in train_model_list, val_model # Translate val dataset val_data = val_data.map( lambda labels: { "labels": _translate_label(labels, train_model_list, val_model_list) }, input_columns="labels", num_proc=16, ) return val_data class DataCollator: def __init__(self, tokenizer, max_length, weight=None, reweight_scale=None): self.tokenizer: PreTrainedTokenizer = tokenizer self.max_length: int = max_length self.weight: bool = weight self.reweight_scale: float = reweight_scale self.first = True def __call__(self, data): prompts = [] for seq in data: if isinstance(seq["prompt"], str): prompts.append([{"role": "user", "content": seq["prompt"]}]) else: prompts.append([{"role": "user", "content": turn} for turn in seq["prompt"]]) labels = torch.tensor([seq["labels"].tolist() for seq in data]) formatted_prompts = self.tokenizer.apply_chat_template( prompts, tokenize=False, add_generation_prompt=False, add_special_tokens=False, ) # Scrub any instances of cls token from the data, otherwise model will error. formatted_prompts = [ prompt.replace(self.tokenizer.cls_token, "") for prompt in formatted_prompts ] formatted_prompts = [ seq + self.tokenizer.cls_token for seq in formatted_prompts ] if self.first: print(formatted_prompts) self.first = False encoded = self.tokenizer( formatted_prompts, padding=True, return_tensors="pt", add_special_tokens=False, truncation=True, max_length=self.max_length, ) out = { "input_ids": encoded["input_ids"], "attention_mask": encoded["attention_mask"], "labels": labels, } if self.weight: if "weight" in data[0]: out["weights"] = torch.tensor([seq["weight"].tolist() for seq in data]) if self.reweight_scale: out["weights"] *= self.reweight_scale else: out["weights"] = None return out ================================================ FILE: p2l/endpoint.py ================================================ import argparse import json from typing import Dict, Tuple, List, Optional import torch import uvicorn from fastapi import FastAPI, Header, HTTPException from huggingface_hub import hf_hub_download from pydantic import BaseModel from transformers import ( AutoTokenizer, TextClassificationPipeline, pipeline, PreTrainedModel, ) from p2l.model import get_p2l_model, P2LOutputs from contextlib import asynccontextmanager import logging logging.getLogger().setLevel(logging.DEBUG) def parse_args(): parser = argparse.ArgumentParser(description="Run FastAPI with Uvicorn") parser.add_argument( "--model-path", "-m", type=str, default="p2el/Qwen2.5-7B-Instruct-rk-full-train", help="Path to the model repository", ) parser.add_argument( "--model-type", "-mt", type=str, default="qwen2", help="Type of the model", ) parser.add_argument( "--head-type", "-ht", type=str, default="rk", help="Type of model head", ) parser.add_argument( "--loss-type", "-lt", type=str, default="rk", help="Type of the loss function", ) parser.add_argument( "--api-key", "-a", type=str, default="-", help="API key for authorization", ) parser.add_argument( "--host", "-H", type=str, default="0.0.0.0", help="Host to run the server on", ) parser.add_argument( "--port", "-p", type=int, default=10250, help="Port to run the server on", ) parser.add_argument( "--reload", action=argparse.BooleanOptionalAction, default=True, help="Whether to reload the endpoint on detected code change, needs workers to be 1.", ) parser.add_argument( "--workers", type=int, default=1, help="Number of endpoint workers (will hold a model per worker).", ) parser.add_argument( "--cuda", action=argparse.BooleanOptionalAction, default=True, help="Flag to enable using a GPU to host the model. Flag is true by default.", ) args = parser.parse_args() return args @asynccontextmanager async def lifespan(app: FastAPI): args = parse_args() model, tokenizer, model_list = load_model( args.model_path, args.model_type, args.head_type, args.loss_type, ) pipe = pipeline( task="text-classification", model=model, tokenizer=tokenizer, device="cuda" if args.cuda else "cpu", pipeline_class=P2LPipeline, ) app.state.api_key = args.api_key app.state.model_list = model_list app.state.model = model app.state.tokenizer = tokenizer app.state.pipe = pipe try: yield finally: pass # Initialize FastAPI app app = FastAPI(lifespan=lifespan) # Define the input data structure class InputData(BaseModel): prompt: list[str] class OutputData(BaseModel): coefs: List[float] eta: Optional[float] = None class ModelList(BaseModel): models: List[str] class P2LPipeline(TextClassificationPipeline): def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, torch.Tensor]: return_tensors = self.framework inputs = inputs["prompt"] messages = [{"role": "user", "content": p} for p in inputs] formatted = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, add_special_tokens=False, ) formatted = formatted + self.tokenizer.cls_token logging.debug(f"Formatted input: {formatted}") return self.tokenizer( formatted, return_tensors=return_tensors, max_length=8192, padding="longest", truncation=True, ) def postprocess( self, model_outputs: P2LOutputs, function_to_apply=None, top_k=1, _legacy=True ): model_outputs = P2LOutputs(model_outputs) eta = model_outputs.eta return OutputData( coefs=model_outputs.coefs.cpu().float().tolist()[0], eta=eta.cpu().float().item() if eta else None, ) @app.post("/predict") async def predict(input_data: InputData, api_key: str = Header(...)): logging.debug(f"Received Request: {input_data}.") if api_key != app.state.api_key: raise HTTPException(status_code=403, detail="Unauthorized") try: pipe: P2LPipeline = app.state.pipe logging.debug(f"Input Prompt: {input_data.prompt}") output = pipe(inputs=input_data.model_dump()) logging.debug(f"Output: {output}") return output except Exception as e: logging.debug(e) raise HTTPException(status_code=500, detail=str(e)) @app.get("/models") async def models(api_key: str = Header(...)): logging.debug(f"Received Model List Request.") if api_key != app.state.api_key: raise HTTPException(status_code=403, detail="Unauthorized") try: return ModelList( models=app.state.model_list, ) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) def load_model( model_name, model_type, head_type, loss_type ) -> Tuple[PreTrainedModel, AutoTokenizer, List[str]]: # Download and load the model list fname = hf_hub_download( repo_id=model_name, filename="model_list.json", repo_type="model" ) with open(fname) as fin: model_list = json.load(fin) # Initialize tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.truncation_side = "left" tokenizer.padding_side = "right" # Get the model class and load the model model_cls = get_p2l_model(model_type, loss_type, head_type) model = model_cls.from_pretrained( model_name, CLS_id=tokenizer.cls_token_id, num_models=len(model_list), torch_dtype=torch.bfloat16, ) return model, tokenizer, model_list if __name__ == "__main__": args = parse_args() uvicorn.run( "p2l.endpoint:app", port=args.port, host=args.host, reload=args.reload, workers=args.workers, ) ================================================ FILE: p2l/eval.py ================================================ import argparse from p2l.model import get_p2l_model, P2LOutputs from transformers import pipeline, TextClassificationPipeline, AutoTokenizer from huggingface_hub import hf_hub_download from datasets import load_dataset import torch from typing import Dict import pandas as pd import os import json from tqdm.auto import tqdm from torch.utils.data import Dataset from glob import glob class P2LPipeline(TextClassificationPipeline): def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, torch.Tensor]: return_tensors = self.framework messages = [{"role": "user", "content": inputs}] formatted = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, add_special_tokens=False, ) formatted = formatted + self.tokenizer.cls_token return self.tokenizer( formatted, return_tensors=return_tensors, max_length=8192, padding="longest", truncation=True, ) def postprocess( self, model_outputs: P2LOutputs, function_to_apply=None, top_k=1, _legacy=True ): model_outputs = P2LOutputs(model_outputs) eta = model_outputs.eta gamma = model_outputs.gamma return dict( coefs=model_outputs.coefs.cpu().float().numpy(), eta=eta.cpu().float().numpy() if eta else None, gamma=gamma.cpu().float().numpy() if gamma else None, last_hidden_state=model_outputs.last_hidden_state.cpu().float().numpy(), ) class ListDataset(Dataset): def __init__(self, original_list): self.original_list = original_list def __len__(self): return len(self.original_list) def __getitem__(self, i): return self.original_list[i] def main(args, local_file=None): os.makedirs(args.output_dir, exist_ok=True) dataset = load_dataset(args.dataset, split=args.dataset_split) if local_file: fname = os.path.join(local_file, "model_list.json") else: fname = hf_hub_download( repo_id=args.model_path, filename="model_list.json", repo_type="model" ) with open(fname) as fin: model_list = json.load(fin) model_cls = get_p2l_model(args.model_type, args.loss_type, args.head_type) if local_file: tokenizer = AutoTokenizer.from_pretrained(local_file, local_files_only=True) model = model_cls.from_pretrained( local_file, CLS_id=tokenizer.cls_token_id, num_models=len(model_list), torch_dtype=torch.bfloat16, local_files_only=True, ) else: tokenizer = AutoTokenizer.from_pretrained(args.model_path) model = model_cls.from_pretrained( args.model_path, CLS_id=tokenizer.cls_token_id, num_models=len(model_list), torch_dtype=torch.bfloat16, ) device = "cuda" if torch.cuda.is_available() else "cpu" pipe = pipeline( task="text-classification", model=model, tokenizer=tokenizer, device=device, pipeline_class=P2LPipeline, ) prompts = ListDataset(dataset["prompt"]) with torch.no_grad(): outputs = [ out for out in tqdm( pipe(prompts, batch_size=args.batch_size), total=len(prompts) ) ] df = dataset.to_pandas() outputs_df = pd.DataFrame.from_records(outputs) if args.drop_hidden: outputs_df = outputs_df.drop("last_hidden_state", axis=1) df = pd.concat((df, outputs_df), axis=1) if local_file: fname = local_file.split("/")[-1] + ".json" else: fname = args.model_path.split("/")[-1] + ".json" fpath = os.path.join(args.output_dir, fname) df.to_json(fpath, orient="records", indent=4, force_ascii=False) if args.output_hf_path: from datasets import Dataset df = pd.read_json(fpath) hf_dataset = Dataset.from_pandas(df) hf_dataset.push_to_hub(args.output_hf_path, private=True) print("Results pushed to hub!") if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "--model-path", "-m", type=str, default=None, help="Huggingface model path" ) parser.add_argument( "--training-output-dir", "-t", type=str, default=None ) parser.add_argument( "--dataset", "-d", type=str, required=True, help="Huggingface dataset path" ) parser.add_argument("--output-hf-path", "-oh", type=str, default=None) parser.add_argument( "--dataset-split", "-ds", type=str, default="train", help="Huggingface dataset split", ) parser.add_argument( "--model-type", "-mt", type=str, default="qwen2", help="Model type (qwen2, llama, etc)", ) parser.add_argument( "--head-type", "-ht", type=str, default="bt", help="Head type (Bradely Terry, Rao-Kupper, etc)", ) parser.add_argument( "--loss-type", "-lt", type=str, default="bt", help="Loss type (Bradely Terry, Rao-Kupper, etc)", ) parser.add_argument("--batch-size", "-bs", type=int, default=1, help="Batch size") parser.add_argument("--output-dir", "-od", type=str, default="outputs") parser.add_argument("--drop-hidden", action=argparse.BooleanOptionalAction, default=False) args = parser.parse_args() if args.training_output_dir: for file in glob(os.path.join(args.training_output_dir, "*")): main(args, file) else: main(args) ================================================ FILE: p2l/model.py ================================================ import torch from transformers import ( Qwen2Model, Qwen2PreTrainedModel, LlamaModel, LlamaPreTrainedModel, PreTrainedModel, AutoTokenizer, ) from transformers.utils import ModelOutput from dataclasses import dataclass import torch.nn as nn import torch.nn.functional as F from typing import Dict, Tuple, Callable, Optional registered_transformers: Dict[str, Tuple[PreTrainedModel, PreTrainedModel]] = { "qwen2": (Qwen2PreTrainedModel, Qwen2Model), "llama": (LlamaPreTrainedModel, LlamaModel), } registered_losses: Dict[str, Callable] = {} registered_heads: Dict[str, nn.Module] = {} registered_inits: Dict[str, Callable] = {} registered_aggr_models: Dict[str, nn.Module] = {} registered_pairwise_losses: Dict[str, Callable] = {} def register_loss(name: str): def decorator(func: Callable): registered_losses[name] = func return func return decorator def register_head(name: str): def decorator(func: Callable): registered_heads[name] = func return func return decorator def register_init(name: str): def decorator(func: Callable): registered_inits[name] = func return func return decorator def register_aggr_model(name: str): def decorator(func: Callable): registered_aggr_models[name] = func return func return decorator def register_pairwise_loss(name: str): def decorator(func: Callable): registered_pairwise_losses[name] = func return func return decorator def register_init(name: str): def decorator(func: Callable): registered_inits[name] = func return func return decorator @dataclass class HeadOutputs(ModelOutput): coefs: torch.FloatTensor = None eta: Optional[torch.FloatTensor] = None gamma: Optional[torch.FloatTensor] = None @dataclass class P2LOutputs(ModelOutput): coefs: torch.FloatTensor = None eta: Optional[torch.FloatTensor] = None gamma: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None @register_loss("bt") def BT_loss( head_output: HeadOutputs, labels: torch.Tensor, weights: torch.Tensor = None, **kwargs, ): # labels columns are in the form (winner_idx, loser_idx) coefs = head_output.coefs paired_coefs = coefs.gather(dim=-1, index=labels).contiguous() paired_delta_logit = ( paired_coefs[:, 0] - paired_coefs[:, 1] ) # subtract winner bt from loser bt neg_log_sigma = -F.logsigmoid(paired_delta_logit) # get neg log prob if weights is not None: neg_log_sigma = neg_log_sigma * weights loss = neg_log_sigma.mean() return loss @register_loss("bt-tie") def BT_tie_loss( head_output: HeadOutputs, labels: torch.Tensor, weights: torch.Tensor = None, **kwargs, ): # labels columns are in the form (winner_idx, loser_idx, tie_indicator) coefs = head_output.coefs model_idx = labels[:, :2] # (batch_dim, 2) tie_ind = labels[:, -1] paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous() paired_delta_logit = ( paired_coefs[:, 0] - paired_coefs[:, 1] ) # subtract winner bt from loser bt # computes bradley-terry loss where tie is half win and half loss neg_log_sigma = -1 * torch.where( tie_ind == 0, F.logsigmoid(paired_delta_logit), 0.5 * (F.logsigmoid(paired_delta_logit) + F.logsigmoid(-1 * paired_delta_logit)), ) if weights is not None: neg_log_sigma = neg_log_sigma * weights loss = neg_log_sigma.mean() return loss BETA = 0.1 @register_loss("rk") def RK_Loss( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): # labels columns are in form (winner_idx, loser_idx, tie_indicator) coefs = head_output.coefs # eta = torch.exp(head_output.eta).squeeze(-1) # eta > 0 eta = torch.clamp( torch.nn.functional.softplus(head_output.eta - 22.5, BETA).squeeze(-1), min=0.02 ) # eta = torch.abs(head_output.eta).squeeze(-1) model_idx = labels[:, :2] # (batch_dim, 2) paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous() paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1] # compute RK probabilities p_w = torch.sigmoid(paired_delta_logit - eta) p_l = torch.sigmoid(-1 * paired_delta_logit - eta) p_t = 1 - p_w - p_l # point-wise likelihood A = torch.stack((p_w, p_t)) # (2, batch_dim) tie_ind = labels[:, -1].unsqueeze(0) # (1, batch_dim) p = A.take_along_dim(dim=0, indices=tie_ind) # mathematically p_t < 1 always but bfloat smh p = torch.clamp(p, min=1e-3) # eps = 1e-10 loss = -torch.log(p) if weights: loss = loss * weights loss = loss.mean() return loss @register_loss("rk-reparam") def RK_Reparam_Loss( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): coefs = head_output.coefs eta = head_output.eta theta = torch.exp(eta) + 1.000001 winner_idx = labels[:, 0:1] loser_idx = labels[:, 1:2] beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous() beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous() pi_win = torch.exp(beta_win) pi_lose = torch.exp(beta_lose) p_win = pi_win / (pi_win + theta * pi_lose + 1.0) p_lose = pi_lose / (pi_lose + theta * pi_win + 1.0) p_tie = 1.0 - p_win - p_lose assert p_win.shape == p_lose.shape == p_tie.shape P = torch.hstack((p_win, p_tie)) tie_ind = labels[:, -1].unsqueeze(-1) p = P.gather(dim=-1, index=tie_ind).contiguous() p = torch.clamp(p, min=1e-6) loss = -torch.log(p) if weights: loss = loss * weights loss = loss.mean() return loss @register_loss("ba") def BA_loss( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): # labels are (winner_idx, loser_idx, tie_indicator (0 for no tie, 1 for tie, 2 for tie both bad)) coefs = head_output.coefs eta = head_output.eta gamma = head_output.gamma theta = torch.exp(eta) + 1.02 winner_idx = labels[:, 0:1] loser_idx = labels[:, 1:2] beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous() beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous() pi_win = torch.exp(beta_win) pi_lose = torch.exp(beta_lose) pi_gamma = torch.exp(gamma) p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma) p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma) p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose) p_tie = 1.0 - p_win - p_lose - p_tie_bb P = torch.hstack((p_win, p_tie, p_tie_bb)) tie_ind = labels[:, -1].unsqueeze(-1) p = P.gather(dim=-1, index=tie_ind).contiguous() p = torch.clamp(p, min=1e-2) loss = -torch.log(p) if weights: loss = loss * weights loss = loss.mean() print("loss: ", loss.item()) return loss @register_loss("bag") @register_loss("grk") def GRK_loss( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): # labels are (winner_idx, loser_idx, tie_indicator (0 for no tie, 1 for tie, 2 for tie both bad)) coefs = head_output.coefs.float() eta = head_output.eta.float() theta = torch.exp(eta) + 1.000001 winner_idx = labels[:, 0:1] loser_idx = labels[:, 1:2] beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous() beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous() pi_win = torch.exp(beta_win) pi_lose = torch.exp(beta_lose) pi_gamma = 1.0 p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma) p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma) p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose) p_tie = 1.0 - p_win - p_lose - p_tie_bb assert p_win.shape == p_lose.shape == p_tie_bb.shape == p_tie.shape P = torch.hstack((p_win, p_tie, p_tie_bb)) tie_ind = labels[:, -1].unsqueeze(-1) p = P.gather(dim=-1, index=tie_ind).contiguous() p = torch.clamp(p, min=1e-6) loss = -torch.log(p) if weights: loss = loss * weights loss = loss.mean() # print("loss: ", loss.item()) return loss @register_head("bt") class BTHead(nn.Module): def __init__( self, input_dim, output_dim, linear_head_downsize_factor=None, **kwargs ) -> None: super().__init__() if linear_head_downsize_factor: inner_dim = int(output_dim // linear_head_downsize_factor) self.head = nn.Sequential( nn.Linear(in_features=input_dim, out_features=inner_dim, bias=True), nn.Linear(in_features=inner_dim, out_features=output_dim, bias=True), ) else: self.head = nn.Linear( in_features=input_dim, out_features=output_dim, bias=True ) def forward(self, last_hidden_dim: torch.Tensor): coefs = self.head(last_hidden_dim) return HeadOutputs(coefs=coefs) @register_head("rk") class RKHead(nn.Module): def __init__( self, input_dim, output_dim, eta_dim=1, linear_head_downsize_factor=None, eta_downsize=False, **kwargs, ) -> None: super().__init__() # If linear header downsize factor and eta downsize, then eta is calculated off of the downsized dim, not the hidden dim. if linear_head_downsize_factor: inner_dim = output_dim // linear_head_downsize_factor share_layer = nn.Linear( in_features=input_dim, out_features=inner_dim, bias=True ) self.head = nn.Sequential( share_layer, nn.Linear(in_features=inner_dim, out_features=output_dim, bias=True), ) if eta_downsize: self.eta_head = nn.Sequential( share_layer, nn.Linear(in_features=inner_dim, out_features=eta_dim, bias=True), ) else: self.eta_head = nn.Linear( in_features=output_dim, out_features=eta_dim, bias=True ) else: self.head = nn.Linear( in_features=input_dim, out_features=output_dim, bias=True ) self.eta_head = nn.Linear( in_features=input_dim, out_features=eta_dim, bias=True ) def forward(self, last_hidden_dim: torch.Tensor): coefs = self.head(last_hidden_dim) eta = self.eta_head(last_hidden_dim) return HeadOutputs(coefs=coefs, eta=eta) @register_head("ba") class BAHead(nn.Module): def __init__( self, input_dim, output_dim, linear_head_downsize_factor=None, **kwargs, ) -> None: super().__init__() if linear_head_downsize_factor: raise NotImplementedError("Sorry I didn't implement this.") self.head = nn.Linear(in_features=input_dim, out_features=output_dim, bias=True) self.eta_head = nn.Linear(in_features=input_dim, out_features=1, bias=True) self.gamma_head = nn.Linear(in_features=input_dim, out_features=1, bias=True) def forward(self, last_hidden_dim: torch.Tensor): coefs = self.head(last_hidden_dim) eta = self.eta_head(last_hidden_dim) gamma = self.gamma_head(last_hidden_dim) return HeadOutputs(coefs=coefs, eta=eta, gamma=gamma) @register_init("reset_params") def reset_params_init(module): return module.reset_parameters() @register_init("he_unif") def he_unif_init(module): return nn.init.kaiming_uniform_(module.weight, nonlinearity="sigmoid") @register_init("xavier_unif") def xavier_unif_init(module): return nn.init.xavier_uniform_(module.weight) @register_init("tiny_normal") def tiny_normal_init(module): return nn.init.kaiming_normal_(module.weight) def get_p2l_model( model_type: str, loss_type: str, head_type: str, init_type: str = "reset_params" ) -> PreTrainedModel: pretrained_model_cls, model_cls = registered_transformers[model_type] criterion = registered_losses[loss_type] head_layer = registered_heads[head_type] init_func = registered_inits[init_type] class CustomPretrainedModel(pretrained_model_cls): """Defines the appropriate pretrained class for the given model name. This is done so that the value head init scheme is correct.""" def _init_weights(self, module): std = self.config.initializer_range if isinstance(module, nn.Linear): init_func(module) # was reset params if module.bias is not None: module.bias.data.zero_() elif isinstance(module, nn.Embedding): module.weight.data.normal_(mean=0.0, std=std) if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() class P2LModel(CustomPretrainedModel): def __init__( self, config, CLS_id, num_models, linear_head_downsize_factor=None, head_kwargs={}, **kwargs, ): super().__init__(config) self.num_models = num_models self.cls_token_id = CLS_id self.model = model_cls(config) self.head = head_layer( input_dim=config.hidden_size, output_dim=self.num_models, linear_head_downsize_factor=linear_head_downsize_factor, **head_kwargs, ) self.post_init() def freeze_transformer(self): for param in self.model.parameters(): param.requires_grad = False def get_input_embeddings(self): return self.model.embed_tokens def set_input_embeddings(self, value): self.model.embed_tokens = value def forward(self, input_ids, attention_mask, labels=None, weights=None): batch_size = input_ids.shape[0] hidden_outputs = self.model( input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=False, ).last_hidden_state # (bs, num_token, embed_dim) cls_mask = input_ids == self.cls_token_id # double check this is getting the current CLS token cls_hidden_dim = hidden_outputs[cls_mask] assert ( cls_hidden_dim.shape[0] == batch_size ), f"input ids {input_ids.shape}, cls_mask {cls_mask.shape}, cls_logit {cls_hidden_dim.shape}" head_output = self.head(cls_hidden_dim) if labels is not None: loss = criterion(head_output, labels, weights=weights) outputs = P2LOutputs( coefs=head_output.coefs, last_hidden_state=cls_hidden_dim, eta=head_output.eta, gamma=head_output.gamma, loss=loss, ) else: outputs = P2LOutputs( coefs=head_output.coefs, last_hidden_state=cls_hidden_dim, eta=head_output.eta, gamma=head_output.gamma, ) return outputs return P2LModel def get_tokenizer( tokenizer_name, chat_template, pad_token_if_none="<|pad|>", cls_token_if_none="<|cls|>", ): tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) tokenizer.truncation_side = "left" tokenizer.padding_side = "right" if chat_template: tokenizer.chat_template = chat_template if "pad_token" not in tokenizer.special_tokens_map: tokenizer.add_special_tokens({"pad_token": pad_token_if_none}) if "cls_token" not in tokenizer.special_tokens_map: tokenizer.add_special_tokens({"cls_token": cls_token_if_none}) return tokenizer @register_aggr_model("bt") @register_aggr_model("bt-tie") class BTAggrModel(nn.Module): def __init__(self, num_models, batch_size=1): super().__init__() self.coefs = nn.Parameter( nn.init.constant_(torch.empty(batch_size, num_models), 0.5) ) self.eta = None def forward(self): return self.coefs, self.eta @register_aggr_model("rk") @register_aggr_model("rk-reparam") @register_aggr_model("bag") @register_aggr_model("grk") class RKAggrModel(nn.Module): def __init__(self, num_models, batch_size=1): super().__init__() self.coefs = nn.Parameter( nn.init.constant_(torch.empty(batch_size, num_models), 0.5) ) self.eta = nn.Parameter(nn.init.constant_(torch.empty(batch_size, 1), 0.1)) def forward(self): return self.coefs, self.eta @register_pairwise_loss("bt") @register_pairwise_loss("bt-tie") def pairwise_batch_BT_loss( real_output: HeadOutputs, aggregated_output: HeadOutputs, true_probs: torch.tensor ): real_betas = real_output.coefs aggregated_betas = aggregated_output.coefs num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1] pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) beta_i_agg = aggregated_betas[:, pair_indices[:, 0]] beta_j_agg = aggregated_betas[:, pair_indices[:, 1]] pred_probs = torch.sigmoid(beta_i_agg - beta_j_agg) pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1) eps = 1e-9 neg_log_prob = -( true_probs * torch.log(pred_probs_expanded + eps) + (1 - true_probs) * torch.log(1 - pred_probs_expanded + eps) ) batch_losses = neg_log_prob.mean(dim=(1, 2)) loss = batch_losses.mean() return loss # batched loss @register_pairwise_loss("rk") def pairwise_batch_RK_loss( real_output: HeadOutputs, aggregated_output: HeadOutputs, true_probs: torch.tensor ): real_betas = real_output.coefs num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1] aggregated_betas = aggregated_output.coefs BETA = 0.1 aggregated_eta = torch.clamp( torch.nn.functional.softplus(aggregated_output.eta - 22.5, BETA).squeeze(-1), min=0.02, ) pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) beta_i_agg = aggregated_betas[:, pair_indices[:, 0]] beta_j_agg = aggregated_betas[:, pair_indices[:, 1]] aggregated_eta = aggregated_eta.unsqueeze(-1) pred_probs_win = torch.sigmoid(beta_i_agg - beta_j_agg - aggregated_eta) pred_probs_loss = torch.sigmoid(beta_j_agg - beta_i_agg - aggregated_eta) pred_probs_tie = 1 - pred_probs_win - pred_probs_loss pred_probs = torch.stack((pred_probs_win, pred_probs_loss, pred_probs_tie), dim=-1) pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1, -1) eps = 1e-9 neg_log_prob = -torch.sum(true_probs * torch.log(pred_probs_expanded + eps), dim=-1) batch_losses = neg_log_prob.mean(dim=(1, 2)) loss = batch_losses.mean() return loss # batched @register_pairwise_loss("rk-reparam") def pairwise_batch_RK_reparam_loss( real_output: HeadOutputs, aggregated_output: HeadOutputs, true_probs: torch.tensor, **kwargs, ): real_betas = real_output.coefs num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1] aggregated_betas = aggregated_output.coefs aggregrated_theta = torch.exp(aggregated_output.eta) + 1.000001 pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) beta_i_agg = aggregated_betas[:, pair_indices[:, 0]] beta_j_agg = aggregated_betas[:, pair_indices[:, 1]] pi_win = torch.exp(beta_i_agg) pi_lose = torch.exp(beta_j_agg) p_win = pi_win / (pi_win + aggregrated_theta * pi_lose + 1.0) p_lose = pi_lose / (pi_lose + aggregrated_theta * pi_win + 1.0) p_tie = 1.0 - p_win - p_lose pred_probs = torch.stack((p_win, p_lose, p_tie), dim=-1) pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1, -1) eps = 1e-9 neg_log_prob = -torch.sum(true_probs * torch.log(pred_probs_expanded + eps), dim=-1) batch_losses = neg_log_prob.mean(dim=(1, 2)) loss = batch_losses.mean() return loss def get_bag_probs(beta_win, beta_lose, gamma, theta): pi_win = torch.exp(beta_win) pi_lose = torch.exp(beta_lose) pi_gamma = 1.0 p_win = pi_win / (pi_win + theta * pi_lose + pi_gamma) p_lose = pi_lose / (pi_lose + theta * pi_win + pi_gamma) p_tie_bb = pi_gamma / (pi_gamma + pi_win + pi_lose) p_tie = 1.0 - p_win - p_lose - p_tie_bb return torch.stack((p_win, p_lose, p_tie, p_tie_bb), dim=-1) # batched @register_pairwise_loss("bag") @register_pairwise_loss("grk") def pairwise_batch_bag_loss( real_output: HeadOutputs, aggregated_output: HeadOutputs, true_probs: torch.tensor, **kwargs, ): real_betas = real_output.coefs num_prompts, num_models = real_betas.shape[-2], real_betas.shape[-1] aggregated_betas = aggregated_output.coefs aggregrated_theta = torch.exp(aggregated_output.eta) + 1.000001 pair_indices = torch.tensor( [(i, j) for i in range(num_models) for j in range(i + 1, num_models)], dtype=torch.long, ) beta_i_agg = aggregated_betas[:, pair_indices[:, 0]] beta_j_agg = aggregated_betas[:, pair_indices[:, 1]] pred_probs = get_bag_probs(beta_i_agg, beta_j_agg, 1.0, aggregrated_theta) pred_probs_expanded = pred_probs.unsqueeze(1).expand(-1, num_prompts, -1, -1) eps = 1e-9 neg_log_prob = -torch.sum(true_probs * torch.log(pred_probs_expanded + eps), dim=-1) batch_losses = neg_log_prob.mean(dim=(1, 2)) loss = batch_losses.mean() return loss @register_loss("tie-rk") def RK_Tie_Loss( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): coefs = head_output.coefs eta = torch.clamp( torch.nn.functional.softplus(head_output.eta - 22.5, BETA).squeeze(-1), min=0.02 ) model_idx = labels[:, :2] paired_coefs = coefs.gather(dim=-1, index=model_idx).contiguous() paired_delta_logit = paired_coefs[:, 0] - paired_coefs[:, 1] p_w = torch.sigmoid(paired_delta_logit - eta) p_l = torch.sigmoid(-1 * paired_delta_logit - eta) p_t = 1 - p_w - p_l p_not_t = p_w + p_l p_t = p_t A = torch.stack((p_not_t, p_t)) tie_ind = labels[:, -1].unsqueeze(0) p = A.take_along_dim(dim=0, indices=tie_ind) p = torch.clamp(p, min=1e-3) loss = -torch.log(p) if weights: loss = loss * weights loss = loss.mean() return loss @register_loss("tie-bag") @register_loss("tie-grk") def bag_tie_loss( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): coefs = head_output.coefs eta = head_output.eta theta = torch.exp(eta) + 1.000001 winner_idx = labels[:, 0:1] loser_idx = labels[:, 1:2] beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous() beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous() p_win, p_lose, p_tie, p_tie_bb = torch.unbind( get_bag_probs(beta_win, beta_lose, 1.0, theta), dim=-1 ) P = torch.hstack((p_win + p_lose, p_tie + p_tie_bb)) tie_ind = labels[:, -1].unsqueeze(-1) tie_ind = torch.where(tie_ind == 0, 0, 1) # segment into ties and not ties p = P.gather(dim=-1, index=tie_ind).contiguous() p = torch.clamp(p, min=1e-6) loss = -torch.log(p) if weights: loss = loss * weights loss = loss.mean() return loss @register_loss("tie-bb-bag") @register_loss("tie-bb-grk") def bag_tie_bb_loss( head_output: HeadOutputs, labels: Dict, weights: torch.Tensor = None, **kwargs ): coefs = head_output.coefs eta = head_output.eta theta = torch.exp(eta) + 1.000001 winner_idx = labels[:, 0:1] loser_idx = labels[:, 1:2] beta_win = coefs.gather(dim=-1, index=winner_idx).contiguous() beta_lose = coefs.gather(dim=-1, index=loser_idx).contiguous() p_win, p_lose, p_tie, p_tie_bb = torch.unbind( get_bag_probs(beta_win, beta_lose, 1.0, theta), dim=-1 ) P = torch.hstack((p_win + p_lose + p_tie, p_tie_bb)) tie_ind = labels[:, -1].unsqueeze(-1) tie_ind = torch.where(tie_ind == 2, 1, 0) # index should be 1 if tie-bb p = P.gather(dim=-1, index=tie_ind).contiguous() p = torch.clamp(p, min=1e-6) loss = -torch.log(p) if weights: loss = loss * weights loss = loss.mean() return loss ================================================ FILE: p2l/train.py ================================================ import argparse import os import yaml import json import random from transformers import Trainer, TrainingArguments, set_seed from p2l.dataset import DataCollator, get_model_list, get_dataset, translate_val_data from p2l.model import get_p2l_model, get_tokenizer from torch.utils.data import Sampler from typing import Optional from huggingface_hub import HfApi # Want control over data ordering, use no shuffle trainer. class NoShuffleTrainer(Trainer): def _get_train_sampler(self) -> Optional[Sampler]: return None def train_model(args): with open(args.config, "r") as file: config = yaml.safe_load(file) learning_rate = config["learning_rate"] # Microbatch size batch_size = config["batch_size"] # HF data path train_data_path = config["train_data_path"] val_data_path = config["val_data_path"] output_dir = config["output_dir"] pretrain_model_name = config["pretrain_model_name"] # Prompts will be truncted to this length max_length = config["max_length"] gradient_accumulation_steps = config["gradient_accumulation_steps"] # Deepspeed config choices can be found in the deepspeed directory deepspeed_config_path = config["deepspeed_config_path"] # Type of transformer, see model.py for options. model_type = config["model_type"] # Loss type (e.g, bt, rk), see model.py for options. loss_type = config["loss_type"] # The linear head type, see model.py for options. head_type = config["head_type"] # Epsilon value for Adam adam_epsilon = config["adam_epsilon"] # Optional epochs = config.get("num_train_epochs", 1) lr_scheduler = config.get("lr_schedule", "constant") chat_template = config.get("chat_template", None) # Downsize the rank of the classification head. linear_head_downsize_factor = config.get("linear_head_downsize_factor", None) # Whether to weight the loss. If this is true, it expects that the dataset has a "weight" column. weighted_loss = config.get("weighted_loss", False) # kwargs for the head init. head_config = config.get("head_config", {}) # If the tokenizer/model does not already have a cls token, this will be used. cls_token_if_none = config.get("cls_token_if_none", "<|cls|>") # If the tokenizer/model does not already have a pad token, this will be used. pad_token_if_none = config.get("pad_token_if_none", "<|pad|>") # If using weighted loss, scalar reweight factor reweight_scale = config.get("reweight_scale", None) proj_name = config.get("proj_name", None) init_type = config.get("init_type", "reset_params") train_head_only = config.get("train_head_only", False) load_train_data_from_disk = config.get("load_train_data_from_disk", False) load_val_data_from_disk = config.get("load_val_data_from_disk", False) LOCAL_RANK = int(os.environ.get("LOCAL_RANK", -1)) os.makedirs(output_dir, exist_ok=True) # define project name if not proj_name: proj_name = f"{pretrain_model_name.split('/')[1]}_lr{learning_rate}_bs{batch_size}_ep{epochs}" print(f"project name: {proj_name}") output_path = os.path.join(output_dir, proj_name) if args.checkpoint: resume_from_checkpoint = args.checkpoint print("resuming from checkpoint") else: resume_from_checkpoint = False if not resume_from_checkpoint: version = 1 while os.path.exists(output_path): output_path = output_path.replace(f"_{version - 1}", "") output_path = output_path + f"_{version}" version += 1 with open(deepspeed_config_path) as fin: deepspeed_config = json.load(fin) random.seed(42) set_seed(42) training_args = TrainingArguments( output_dir=output_path, report_to="wandb", run_name=proj_name, num_train_epochs=epochs, gradient_accumulation_steps=gradient_accumulation_steps, save_strategy="no" if args.save_steps == -1 else "steps", save_steps=None if args.save_steps == -1 else args.save_steps, save_only_model=True, eval_strategy="no", logging_strategy="steps", logging_steps=1, ddp_timeout=9999999, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, eval_accumulation_steps=1, eval_steps=args.eval_steps, lr_scheduler_type=lr_scheduler, logging_dir="./logs", fp16=False, bf16=True, learning_rate=learning_rate, adam_epsilon=adam_epsilon, load_best_model_at_end=False, gradient_checkpointing=True, do_train=True, bf16_full_eval=True, save_safetensors=True, disable_tqdm=False, remove_unused_columns=False, deepspeed=deepspeed_config, seed=42, data_seed=42, local_rank=LOCAL_RANK, ) tokenizer = get_tokenizer( pretrain_model_name, chat_template, pad_token_if_none=pad_token_if_none, cls_token_if_none=cls_token_if_none, ) data_collator = DataCollator( tokenizer, max_length, weight=weighted_loss, reweight_scale=reweight_scale ) train_data = get_dataset( train_data_path, "train", from_disk=load_train_data_from_disk ) if not args.no_eval: val_data = get_dataset(val_data_path, "train", from_disk=load_val_data_from_disk) # with training_args.main_process_first(): model_list = get_model_list(train_data) if not args.no_eval: val_model_list = get_model_list(val_data) if model_list != val_model_list: print("WARNING: Val model list is different, translating...") val_data = translate_val_data(val_data, model_list, val_model_list) if LOCAL_RANK <= 0: # Document the configuration in the output path. os.makedirs(output_path, exist_ok=False) with open(os.path.join(output_path, "training_config.json"), "w") as fout: json.dump(config, fout, indent=1) # Save the model list so we know which models this model was trained on. The model list is ALWAYS sorted alphabetically. with open(os.path.join(output_path, "model_list.json"), "w") as fout: json.dump(model_list, fout, indent=1) # Get the model class model_cls = get_p2l_model( model_type=model_type, loss_type=loss_type, head_type=head_type, init_type=init_type, ) if resume_from_checkpoint: print(f"Loading model from checkpoint: {resume_from_checkpoint}") model = model_cls.from_pretrained( resume_from_checkpoint, CLS_id=tokenizer.cls_token_id, num_models=len(model_list), linear_head_downsize_factor=linear_head_downsize_factor, ) else: model = model_cls.from_pretrained( pretrain_model_name, CLS_id=tokenizer.cls_token_id, num_models=len(model_list), linear_head_downsize_factor=linear_head_downsize_factor, ) if model.config.vocab_size < len(tokenizer): print("WARNING: Resizing Token Embedding") model.resize_token_embeddings(len(tokenizer)) if train_head_only: print("Freezing transformer, only training head.") model.freeze_transformer() trainer = NoShuffleTrainer( model=model, args=training_args, train_dataset=train_data.with_format("torch"), # eval_dataset=val_data.with_format("torch"), data_collator=data_collator, ) print("begin training") trainer.train(resume_from_checkpoint=resume_from_checkpoint) trainer.save_model(output_path) tokenizer.save_pretrained(output_path) print("saved model and tokenizer") if not args.no_eval: print("starting eval") eval_results = trainer.predict(val_data.with_format("torch")) eval_metrics = eval_results.metrics eval_predictions = eval_results.predictions print(f"Evaluation Results: {eval_metrics}") val_set = val_data.add_column("betas", list(eval_predictions[0])) if LOCAL_RANK <= 0: with open(os.path.join(output_path, "eval_results.json"), "w") as fout: json.dump(eval_metrics, fout, indent=1) val_dir = os.path.join(output_path, "eval_output.jsonl") val_set.to_json(val_dir) print(f"saved merged eval results") if LOCAL_RANK <= 0: if args.push_to_hf: api = HfApi() repo_id = config.get("repo_id", f"p2el/{proj_name}") assert not api.repo_exists( repo_id=repo_id, repo_type="model" ), "repo already exists" api.create_repo(repo_id=repo_id, private=True, repo_type="model") api.upload_folder( folder_path=output_path, repo_id=repo_id, repo_type="model", ) print("pushed to hub") if __name__ == "__main__": parser = argparse.ArgumentParser(description="Argument Parser") parser.add_argument( "--config", type=str, help="path to config file for model training" ) parser.add_argument( "--checkpoint", type=str, help="path to checkpoint directory to resume training from", default=None, ) parser.add_argument( "--push-to-hf", action="store_true", help="True if push directly to huggingface", ) parser.add_argument( "--eval-steps", type=int, default=60, help="Number of steps between evaluation." ) parser.add_argument( "--local_rank", type=int, default=-1, help="Local rank passed by DeepSpeed" ) parser.add_argument( "--no-eval", action="store_true", help="If flagged eval will not end at end of training loop.", ) parser.add_argument("--save-steps", type=int, default=-1) args = parser.parse_args() train_model(args) ================================================ FILE: probe_barrier.py ================================================ # probe_barrier.py import os, sys, time, datetime, argparse import torch import torch.distributed as dist import torch.multiprocessing as mp def log(msg: str, rank: int): """timestamped, unbuffered print""" print(f"[{rank}|{time.time():.3f}] {msg}", flush=True) def worker(rank: int, world_size: int, backend: str): # ─── mandatory NCCL housekeeping ──────────────────────────── os.environ.setdefault("MASTER_ADDR", "127.0.0.1") os.environ.setdefault("MASTER_PORT", "29501") os.environ["RANK"] = str(rank) os.environ["WORLD_SIZE"] = str(world_size) if backend == "nccl": torch.cuda.set_device(rank) # 1 GPU per rank # ──────────────────────────────────────────────────────────── dist.init_process_group( backend = backend, rank = rank, world_size = world_size, timeout = datetime.timedelta(seconds=30) # fail fast ) log("reached barrier()", rank) dist.barrier() log("*** passed barrier()", rank) # Try another collective just to be sure tensor = torch.tensor([rank], device="cuda" if backend == "nccl" else "cpu") dist.all_reduce(tensor) log(f"all_reduce ok, value={tensor.item()}", rank) dist.destroy_process_group() def main(): parser = argparse.ArgumentParser() parser.add_argument("--nprocs", type=int, default=2) parser.add_argument("--backend", choices=["gloo", "nccl"], default="gloo") args = parser.parse_args() mp.spawn( worker, args=(args.nprocs, args.backend), nprocs=args.nprocs, join=True ) if __name__ == "__main__": # Completely unbuffered stdout/stderr os.environ["PYTHONUNBUFFERED"] = "1" main() ================================================ FILE: route/chat.py ================================================ from typing import List, Dict, Iterator, Tuple import openai.resources from abc import ABC, abstractmethod import openai from openai import OpenAI import anthropic from route.utils import get_registry_decorator import time from route.datatypes import ( Roles, ChatMessage, ChatCompletionResponse, Choice, ChatMessageDelta, ChoiceDelta, ChatCompletionResponseChunk, RouterOutput, ModelConfig, ) import logging from openai.types.chat.chat_completion_chunk import ChatCompletionChunk from openai.types.chat.chat_completion import ChatCompletion from anthropic.lib.streaming import MessageStream from anthropic.types.message_start_event import MessageStartEvent from uuid import uuid4 class BaseChatHandler(ABC): @staticmethod @abstractmethod def _create_client(model_config: ModelConfig): pass @staticmethod @abstractmethod def _handle_system_prompt( messages: List[ChatMessage], model_config: ModelConfig ) -> List[ChatMessage]: pass @staticmethod @abstractmethod def generate( messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> ChatCompletionResponse: pass @staticmethod @abstractmethod def generate_stream( messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> Iterator[ChatCompletionResponseChunk]: pass CHAT_HANDLERS: Dict[str, BaseChatHandler] = {} register = get_registry_decorator(CHAT_HANDLERS) @register("openai") class OpenAIChatHandler(BaseChatHandler): @staticmethod def _create_client(model_config: ModelConfig): api_key = model_config.get_api_key() base_url = model_config.get_base_url() if api_key or base_url: client = openai.OpenAI( base_url=base_url, api_key=api_key, ) else: client = openai.OpenAI() return client @staticmethod def _handle_system_prompt( messages: List[ChatMessage], model_config: ModelConfig ) -> List[ChatMessage]: system_prompt = model_config.get_system_prompt() if system_prompt != None and messages[0].role != Roles.SYSTEM.value: system_message = ChatMessage( role=Roles.SYSTEM.value, content=system_prompt, ) messages = [system_message] + messages return messages @staticmethod def _create_completion( client: OpenAI, model_config: ModelConfig, messages: List[ChatMessage], temp: float | None, top_p: float | None, max_tokens: int | None, stream=False, ) -> ChatCompletion | Iterator[ChatCompletionChunk]: completion = client.chat.completions.create( model=model_config.get_name(), messages=messages, temperature=model_config.get_temp() if not temp else temp, top_p=model_config.get_top_p() if not top_p else top_p, max_tokens=( model_config.get_max_tokens(default=openai.NOT_GIVEN) if not max_tokens else max_tokens ), stream=stream, ) return completion @classmethod def generate( cls, messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> ChatCompletionResponse: model_config = router_output.chosen_model_config client = cls._create_client(model_config=model_config) messages = cls._handle_system_prompt( messages=messages, model_config=model_config ) completion: ChatCompletion = cls._create_completion( client=client, model_config=model_config, messages=messages, temp=temp, top_p=top_p, max_tokens=max_tokens, stream=False, ) logging.info(f"{int(time.time())} Chosen Model Completion: {completion}") chat_completion = ChatCompletionResponse( id=str(completion.id), object="chat.completion", created=completion.created, model=completion.model, choices=[ Choice( index=choice.index, message=ChatMessage( role=choice.message.role, content=choice.message.content, model=router_output.chosen_model_name, ), finish_reason=choice.finish_reason, ) for choice in completion.choices ], usage=completion.usage, router_outputs=router_output.model_scores, ) return chat_completion def _skip(chunk: ChatCompletionChunk) -> bool: try: content = chunk.choices[0].delta.content return content == "" or content == None except Exception as e: return True @classmethod def generate_stream( cls, messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> Iterator[ChatCompletionResponseChunk]: model_config = router_output.chosen_model_config client = cls._create_client(model_config=model_config) messages = cls._handle_system_prompt( messages=messages, model_config=model_config ) chunks: Iterator[ChatCompletionChunk] = cls._create_completion( client=client, model_config=model_config, messages=messages, temp=temp, top_p=top_p, max_tokens=max_tokens, stream=True, ) first_chunk = True logging_content = "" for chunk in chunks: if cls._skip(chunk): continue logging_content += chunk.choices[0].delta.content out_chunk = ChatCompletionResponseChunk( id=str(chunk.id), object="chat.completion.chunk", created=chunk.created, model=chunk.model, choices=[ ChoiceDelta( index=choice.index, delta=ChatMessageDelta( role=choice.delta.role, content=choice.delta.content, model=router_output.chosen_model_name, ), ) for choice in chunk.choices ], usage=chunk.usage, router_outputs=router_output.model_scores if first_chunk else None, ).model_dump_json() yield f"data: {out_chunk}\n\n" first_chunk = False logging.info( f"{int(time.time())} Chat Output (OpenAI Client): {logging_content}" ) yield "data: [DONE]\n\n" @register("openai-reasoning") class OpenaiReasoningChatHandler(OpenAIChatHandler): @staticmethod def _create_completion( client: OpenAI, model_config: ModelConfig, messages: List[ChatMessage], temp: float | None, top_p: float | None, max_tokens: int | None, stream=False, ) -> ChatCompletion | Iterator[ChatCompletionChunk]: extra_field = model_config.get_extra_fields() # No max tokens argument completion = client.chat.completions.create( model=model_config.get_name(), messages=messages, stream=stream, reasoning_effort=extra_field.get("reasoning_effort", openai.NOT_GIVEN), ) return completion @register("openai-o1") class OpenaiO1ChatHandler(OpenaiReasoningChatHandler): @classmethod def generate_stream( cls, messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> Iterator[ChatCompletionResponseChunk]: model_config = router_output.chosen_model_config client = cls._create_client(model_config=model_config) messages = cls._handle_system_prompt( messages=messages, model_config=model_config ) chunk: ChatCompletion = cls._create_completion( client=client, model_config=model_config, messages=messages, temp=temp, top_p=top_p, max_tokens=max_tokens, stream=False, ) out_chunk = ChatCompletionResponseChunk( id=str(chunk.id), object="chat.completion.chunk", created=chunk.created, model=chunk.model, choices=[ ChoiceDelta( index=choice.index, delta=ChatMessageDelta( role=choice.message.role, content=choice.message.content, model=router_output.chosen_model_name, ), ) for choice in chunk.choices ], usage=chunk.usage, router_outputs=router_output.model_scores, ).model_dump_json() yield f"data: {out_chunk}\n\n" logging.info( f"{int(time.time())} Chat Output (OpenAI O1 Client): {chunk.choices[0].message.content}" ) yield "data: [DONE]\n\n" @register("anthropic") class AnthropicChatHandler(BaseChatHandler): @staticmethod def _create_client(model_config: ModelConfig): client = anthropic.Anthropic(api_key=model_config.get_api_key()) return client @staticmethod @abstractmethod def _handle_system_prompt( messages: List[ChatMessage], model_config: ModelConfig ) -> Tuple[List[ChatMessage], str | anthropic.NotGiven]: system_message = model_config.get_system_prompt(default=anthropic.NOT_GIVEN) if system_message == None: system_message = anthropic.NOT_GIVEN if messages[0].role == Roles.SYSTEM.value: system_message = messages[0].content messages = messages[1:] return messages, system_message @staticmethod def generate( messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> ChatCompletionResponse: model_config = router_output.chosen_model_config client = AnthropicChatHandler._create_client(model_config=model_config) messages, system_message = AnthropicChatHandler._handle_system_prompt( messages=messages, model_config=model_config ) completion = client.messages.create( model=model_config.get_name(), messages=messages, stop_sequences=[anthropic.HUMAN_PROMPT], temperature=model_config.get_temp() if not temp else temp, top_p=model_config.get_top_p() if not top_p else top_p, max_tokens=model_config.get_max_tokens() if not max_tokens else max_tokens, system=system_message, ) chat_completion = ChatCompletionResponse( id=completion.id, object="chat.completion", created=int(time.time()), model=completion.model, choices=[ Choice( index=i, message=ChatMessage( role=completion.role, content=content.text, model=router_output.chosen_model_name, ), finish_reason=completion.stop_reason, ) for i, content in enumerate(completion.content) ], usage=completion.usage, router_outputs=router_output.model_scores, ) return chat_completion @staticmethod def generate_stream( messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> Iterator[ChatCompletionResponseChunk]: model_config = router_output.chosen_model_config client = AnthropicChatHandler._create_client(model_config=model_config) messages, system_message = AnthropicChatHandler._handle_system_prompt( messages=messages, model_config=model_config ) with client.messages.stream( model=model_config.get_name(), messages=messages, stop_sequences=[anthropic.HUMAN_PROMPT], temperature=model_config.get_temp() if not temp else temp, top_p=model_config.get_top_p() if not top_p else top_p, max_tokens=model_config.get_max_tokens() if not max_tokens else max_tokens, system=system_message, ) as _stream: stream: MessageStream = _stream # This contains the metadata message_start: MessageStartEvent = next(stream) resp_id = message_start.message.id model = message_start.message.model role = message_start.message.role # Ignore this useless chunk. next(stream) first_chunk = True logging_content = "" for text in stream.text_stream: logging_content += text out_chunk = ChatCompletionResponseChunk( id=resp_id, created=int(time.time()), model=model, object="chat.completion.chunk", choices=[ ChoiceDelta( delta=ChatMessageDelta( content=text, role=role, model=router_output.chosen_model_name, ), index=0, ) ], router_outputs=router_output.model_scores if first_chunk else None, ).model_dump_json() yield f"data: {out_chunk}\n\n" first_chunk = False logging.info( f"{int(time.time())} Chat Output (Anthropic Client): {logging_content}" ) yield "data: [DONE]\n\n" import google.generativeai as genai from google.generativeai.types.generation_types import GenerateContentResponse @register("gemini") class GeminiChatHandler(BaseChatHandler): safety_settings = [ {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}, {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"}, {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"}, ] @staticmethod def _create_client(model_config: ModelConfig): api_key = model_config.get_api_key() if api_key: genai.configure(api_key=api_key) @staticmethod def _handle_system_prompt( messages: List[ChatMessage], model_config: ModelConfig ) -> List[ChatMessage]: system_prompt = model_config.get_system_prompt() if system_prompt != None and messages[0].role != Roles.SYSTEM.value: system_message = ChatMessage( role=Roles.SYSTEM.value, content=system_prompt, ) messages = [system_message] + messages return messages @staticmethod def _create_completion( model_config: ModelConfig, messages: List[ChatMessage], temp: float | None, top_p: float | None, max_tokens: int | None, stream=False, ) -> GenerateContentResponse | Iterator[GenerateContentResponse]: generation_config = genai.GenerationConfig( max_output_tokens=model_config.get_max_tokens(default=8192) if not max_tokens else max_tokens, temperature=model_config.get_temp() if not temp else temp, top_p=model_config.get_top_p() if not top_p else top_p, top_k=model_config.get_top_k(), ) history = [] system_prompt = None for message in messages[:-1]: if message.role == Roles.SYSTEM.value: system_prompt = message.content elif message.role == Roles.ASSISTANT.value: history.append({"role": "model", "parts": message.content}) else: history.append({"role": "user", "parts": message.content}) model = genai.GenerativeModel( model_name=model_config.get_name(), system_instruction=system_prompt, generation_config=generation_config, safety_settings=GeminiChatHandler.safety_settings, ) chat_session = model.start_chat(history=history) completion = chat_session.send_message( content=messages[-1].content, stream=stream ) return completion @classmethod def generate( cls, messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> ChatCompletionResponse: model_config = router_output.chosen_model_config cls._create_client(model_config=model_config) messages = cls._handle_system_prompt( messages=messages, model_config=model_config ) completion: GenerateContentResponse = cls._create_completion( model_config=model_config, messages=messages, temp=temp, top_p=top_p, max_tokens=max_tokens, stream=False, ) logging.info(f"{int(time.time())} Chosen Model Completion: {completion}") chat_completion = ChatCompletionResponse( id=str(uuid4()), object="chat.completion", created=int(time.time()), model=model_config.get_name(), choices=[ Choice( index=0, message=ChatMessage( role=Roles.ASSISTANT.value, content=completion.text, model=router_output.chosen_model_name, ), finish_reason="STOP", ) ], router_outputs=router_output.model_scores, ) return chat_completion @classmethod def generate_stream( cls, messages: List[ChatMessage], router_output: RouterOutput, temp: float | None, top_p: float | None, max_tokens: int | None, ) -> Iterator[ChatCompletionResponseChunk]: model_config = router_output.chosen_model_config cls._create_client(model_config=model_config) messages = cls._handle_system_prompt( messages=messages, model_config=model_config ) chunks: Iterator[GenerateContentResponse] = cls._create_completion( model_config=model_config, messages=messages, temp=temp, top_p=top_p, max_tokens=max_tokens, stream=True, ) first_chunk = True chat_id = str(uuid4()) logging_content = "" for chunk in chunks: logging_content += chunk.text out_chunk = ChatCompletionResponseChunk( id=chat_id, object="chat.completion.chunk", created=int(time.time()), model=model_config.get_name(), choices=[ ChoiceDelta( index=0, delta=ChatMessageDelta( role=Roles.ASSISTANT.value, content=chunk.text, model=router_output.chosen_model_name, ), ) ], router_outputs=router_output.model_scores if first_chunk else None, ).model_dump_json() yield f"data: {out_chunk}\n\n" first_chunk = False logging.info( f"{int(time.time())} Chat Output (Gemini Client): {logging_content}" ) yield "data: [DONE]\n\n" ================================================ FILE: route/cost_optimizers.py ================================================ from abc import ABC, abstractmethod from route.utils import get_registry_decorator from typing import List, Dict import numpy as np import cvxpy as cp from scipy.special import expit class UnfulfillableException(Exception): pass class BaseCostOptimizer(ABC): def __init__(self): super().__init__() @staticmethod @abstractmethod def select_model( cost: float, model_list: List[str], model_costs: np.ndarray[float], model_scores: np.ndarray[float], **kwargs, ) -> str: pass @staticmethod def select_max_score_model( model_list: List[str], model_scores: np.ndarray[float] ) -> str: max_idx = np.argmax(model_scores) return model_list[max_idx] COST_OPTIMIZERS: Dict[str, BaseCostOptimizer] = {} register = get_registry_decorator(COST_OPTIMIZERS) @register("strict") class StrictCostOptimizer(BaseCostOptimizer): def __init__(self): super().__init__() @staticmethod def select_model( cost: float | None, model_list: List[str], model_costs: np.ndarray[float], model_scores: np.ndarray[float], **kwargs, ) -> str: if cost == None: return StrictCostOptimizer.select_max_score_model(model_list, model_scores) best_model: str | None = None best_score = -float("inf") for model, model_cost, model_score in zip( model_list, model_costs, model_scores ): if model_cost > cost: continue elif model_score > best_score: best_model = model best_score = model_score if best_model is None: raise UnfulfillableException( f"Cost of {cost} impossible to fulfill with available models {model_list} with costs {model_costs}." ) return best_model @register("simple-lp") class SimpleLPCostOptimizer(BaseCostOptimizer): def __init__(self): super().__init__() @staticmethod def select_model( cost: float | None, model_list: List[str], model_costs: np.ndarray[float], model_scores: np.ndarray[float], **kwargs, ) -> str: if cost == None: return StrictCostOptimizer.select_max_score_model(model_list, model_scores) p = cp.Variable(len(model_costs)) prob = cp.Problem( cp.Maximize(cp.sum(model_scores @ p)), [model_costs.T @ p <= cost, cp.sum(p) == 1, p >= 0], ) status = prob.solve() if status < 0.0: raise UnfulfillableException( f"Cost of {cost} impossible to fulfill with available models {model_list} with costs {model_costs}." ) ps = np.clip(p.value, a_min=0.0, a_max=1.0) ps = ps / ps.sum() return np.random.choice(model_list, p=ps) @register("optimal-lp") class OptimalLPCostOptimizer(BaseCostOptimizer): def __init__(self): super().__init__() @staticmethod def select_model( cost: float | None, model_list: List[str], model_costs: np.ndarray[float], model_scores: np.ndarray[float], opponent_scores: np.ndarray[float] = None, opponent_distribution: np.ndarray[float] = None, ) -> str: if cost == None: return StrictCostOptimizer.select_max_score_model(model_list, model_scores) W = OptimalLPCostOptimizer._construct_W(model_scores, opponent_scores) Wq = W @ opponent_distribution p = cp.Variable(len(model_costs)) prob = cp.Problem( cp.Maximize(p @ Wq), [model_costs.T @ p <= cost, cp.sum(p) == 1, p >= 0] ) status = prob.solve() if status < 0.0: raise UnfulfillableException( f"Cost of {cost} impossible to fulfill with available models {model_list} with costs {model_costs}." ) ps = np.clip(p.value, a_min=0.0, a_max=1.0) ps = ps / ps.sum() return np.random.choice(model_list, p=ps) @staticmethod def _construct_W( router_model_scores: np.ndarray[float], opponent_model_scores: np.ndarray[float] ) -> np.ndarray[float]: num_rows = router_model_scores.shape[-1] num_cols = opponent_model_scores.shape[-1] chosen = np.tile(router_model_scores, (num_cols, 1)).T rejected = np.tile(opponent_model_scores, (num_rows, 1)) assert chosen.shape == rejected.shape, (chosen.shape, rejected.shape) diff_matrix = chosen - rejected W = expit(diff_matrix) return W ================================================ FILE: route/datatypes.py ================================================ from typing import Dict, List, Any, Optional from dataclasses import dataclass from pydantic import BaseModel from enum import Enum class ModelConfig: def __init__(self, config: Dict[str, Any]): self.config = config def get_name(self) -> str: return self.config["name"] def get_temp(self) -> float: return self.config["temp"] def get_top_p(self) -> float: return self.config["top_p"] def get_top_k(self, default=None) -> int: return self.config.get("top_k", default) def get_system_prompt(self, default=None) -> str | None | Any: return self.config.get("system_prompt", default) def get_api_key(self, default=None) -> str | None | Any: return self.config.get("api_key", default) def get_base_url(self, default=None) -> str | None | Any: return self.config.get("base_url", default) def get_type(self) -> str: return self.config["type"] def get_cost(self) -> float: return self.config["cost"] def get_max_tokens(self, default=None) -> int | None | Any: return self.config.get("max_tokens", default) def get_extra_fields(self) -> Dict: return self.config.get("extra_fields", {}) # Maybe should be None... def __repr__(self): return repr( dict( name=self.get_name(), type=self.get_type(), cost=self.get_cost(), ) ) class ModelConfigContainer: def __init__(self, model_config_dicts: Dict[str, Dict[str, Any]]): self.model_configs: Dict[str, ModelConfig] = dict( (name, ModelConfig(config)) for name, config in model_config_dicts.items() ) def get_model_config(self, model_name: str) -> ModelConfig: return self.model_configs[model_name] def list_models(self) -> List[str]: return list(self.model_configs.keys()) def list_costs(self) -> List[float]: costs: List[float] = [] for model_name in self.list_models(): model_config = self.get_model_config(model_name) costs.append(model_config.get_cost()) return costs def __repr__(self): return repr(self.model_configs) class Roles(Enum): USER = "user" ASSISTANT = "assistant" SYSTEM = "system" class ChatMessage(BaseModel): """ Represents a single message in the conversation. role: "system", "user", or "assistant" content: the actual text """ role: str content: str model: Optional[str] = None class ChatCompletionRequest(BaseModel): """ Request body for Chat Completion. """ model: str messages: List[ChatMessage] max_tokens: Optional[int] = None temperature: Optional[float] = None top_p: Optional[float] = None n: Optional[int] = 1 stream: Optional[bool] = False stop: Optional[List[str]] = None cost: Optional[float] = None direct_model: Optional[str] = None class Choice(BaseModel): """ Represents a single choice in the final response (non-streaming mode). """ index: int message: ChatMessage finish_reason: str class ChatCompletionResponse(BaseModel): """ Response model for non-streaming mode. """ id: str object: str created: int model: str choices: List[Choice] usage: Optional[BaseModel] = None router_outputs: Optional[Dict[str, float]] = None class ChatMessageDelta(BaseModel): content: Optional[str] = None role: Optional[str] = None model: Optional[str] = None class ChoiceDelta(BaseModel): delta: ChatMessageDelta finish_reason: Optional[str] = None index: int class ChatCompletionResponseChunk(BaseModel): id: str choices: List[ChoiceDelta] created: int model: str object: str usage: Optional[BaseModel] = None router_outputs: Optional[Dict[str, float]] = None @dataclass class RouterOutput: chosen_model_name: str chosen_model_config: ModelConfig model_scores: Dict[str, float] | None ================================================ FILE: route/example_config.yaml ================================================ model_configs: athene-v2-chat: api_key: base_url: http://38.142.9.21:10245/v1 cost: 0.8097264049 name: im-a-little-birdie temp: 0.7 top_p: 1.0 type: openai claude-3-5-haiku-20241022: api_key: base_url: null cost: 2.1765185825 max_tokens: 8192 name: claude-3-5-haiku-20241022 temp: 0.7 top_p: 0.7 type: anthropic claude-3-5-sonnet-20240620: api_key: base_url: null cost: 9.4453041863 max_tokens: 8192 name: claude-3-5-sonnet-20240620 system_prompt: ' The assistant is Claude, created by Anthropic. The current date is 2025-01-06. Claude''s knowledge base was last updated on April 2024. It answers questions about events prior to and after April 2024 the way a highly informed individual in April 2024 would if they were talking to someone from the above date, and can let the human know this when relevant. Claude cannot open URLs, links, or videos. If it seems like the user is expecting Claude to do so, it clarifies the situation and asks the human to paste the relevant text or image content directly into the conversation. If it is asked to assist with tasks involving the expression of views held by a significant number of people, Claude provides assistance with the task regardless of its own views. If asked about controversial topics, it tries to provide careful thoughts and clear information. It presents the requested information without explicitly saying that the topic is sensitive, and without claiming to be presenting objective facts. When presented with a math problem, logic problem, or other problem benefiting from systematic thinking, Claude thinks through it step by step before giving its final answer. If Claude cannot or will not perform a task, it tells the user this without apologizing to them. It avoids starting its responses with "I''m sorry" or "I apologize". If Claude is asked about a very obscure person, object, or topic, i.e. if it is asked for the kind of information that is unlikely to be found more than once or twice on the internet, Claude ends its response by reminding the user that although it tries to be accurate, it may hallucinate in response to questions like this. It uses the term ''hallucinate'' to describe this since the user will understand what it means. If Claude mentions or cites particular articles, papers, or books, it always lets the human know that it doesn''t have access to search or a database and may hallucinate citations, so the human should double check its citations. Claude is very smart and intellectually curious. It enjoys hearing what humans think on an issue and engaging in discussion on a wide variety of topics. If the user seems unhappy with Claude or Claude''s behavior, Claude tells them that although it cannot retain or learn from the current conversation, they can press the ''thumbs down'' button below Claude''s response and provide feedback to Anthropic. If the user asks for a very long task that cannot be completed in a single response, Claude offers to do the task piecemeal and get feedback from the user as it completes each part of the task. Claude uses markdown for code. Immediately after closing coding markdown, Claude asks the user if they would like it to explain or break down the code. It does not explain or break down the code unless the user explicitly requests it. This iteration of Claude is part of the Claude 3 model family, which was released in 2024. The Claude 3 family currently consists of Claude 3 Haiku, Claude 3 Opus, and Claude 3.5 Sonnet. Claude 3.5 Sonnet is the most intelligent model. Claude 3 Opus excels at writing and complex tasks. Claude 3 Haiku is the fastest model for daily tasks. The version of Claude in this chat is Claude 3.5 Sonnet. Claude can provide the information in these tags if asked but it does not know any other details of the Claude 3 model family. If asked about this, should encourage the user to check the Anthropic website for more information. Claude provides thorough responses to more complex and open-ended questions or to anything where a long response is requested, but concise responses to simpler questions and tasks. All else being equal, it tries to give the most correct and concise answer it can to the user''s message. Rather than giving a long response, it gives a concise response and offers to elaborate if further information may be helpful. Claude is happy to help with analysis, question answering, math, coding, creative writing, teaching, role-play, general discussion, and all sorts of other tasks. Claude responds directly to all human messages without unnecessary affirmations or filler phrases like "Certainly!", "Of course!", "Absolutely!", "Great!", "Sure!", etc. Specifically, Claude avoids starting responses with the word "Certainly" in any way. Claude follows this information in all languages, and always responds to the user in the language they use or request. The information above is provided to Claude by Anthropic. Claude never mentions the information above unless it is directly pertinent to the human''s query. Claude is now being connected with a human. ' temp: 0.7 top_p: 0.7 type: anthropic claude-3-5-sonnet-20241022: api_key: base_url: null cost: 9.3110239362 max_tokens: 8192 name: claude-3-5-sonnet-20241022 system_prompt: null temp: 0.7 top_p: 0.7 type: anthropic deepseek-v3: api_key: base_url: https://api.deepseek.com cost: 0.3002758331 name: deepseek-chat temp: 1.5 top_p: 1.0 type: openai gemini-1.5-flash-001: api_key: cost: 0.4549682765 name: gemini-1.5-flash-001 temp: 0.7 top_p: 1.0 type: gemini gemini-1.5-flash-002: api_key: cost: 0.6330942997 name: gemini-1.5-flash-002 system_prompt: All questions should be answered comprehensively with details, unless the user requests a concise response specifically. Respond in the same language as the query. temp: 0.7 top_p: 1.0 type: gemini gemini-1.5-pro-001: api_key: cost: 6.7456245955 name: gemini-1.5-pro-001 temp: 0.7 top_p: 0.7 type: gemini gemini-1.5-pro-002: api_key: cost: 9.6885059428 name: gemini-1.5-pro-002-test system_prompt: All questions should be answered comprehensively with details, unless the user requests a concise response specifically. Respond in the same language as the query. temp: 0.7 top_p: 1.0 type: gemini gemini-2.0-flash-exp: api_key: cost: 0.8978088229 name: gemini-test-14 temp: 1.0 top_k: 64 top_p: 0.95 type: gemini gemini-2.0-flash-thinking-exp-1219: api_key: cost: 0.4626591495 name: gemini-test-15 temp: 1.0 top_k: 64 top_p: 0.95 type: gemini gemini-exp-1206: api_key: cost: 6.7210154899 name: gemini-test-12 temp: 1.0 top_k: 64 top_p: 0.95 type: gemini gemma-2-27b-it: api_key: cost: 0.4732936067 name: gemma-2-27b-no-filter temp: 0.7 top_p: 0.7 type: gemini gemma-2-9b-it: api_key: cost: 0.0873672873 name: gemma-2-9b-no-filter temp: 0.7 top_p: 1.0 type: gemini glm-4-plus: api_key: base_url: https://open.bigmodel.cn/api/paas/v4 cost: 0.3175377664 name: glm-4-plus temp: 0.7 top_p: 1.0 type: openai gpt-4-1106-preview: api_key: base_url: null cost: 16.3622976323 name: gpt-4-1106-preview system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Current date: 2025-01-06 Image input capabilities: Enabled Personality: v2' temp: 0.7 top_p: 1.0 type: openai gpt-4-turbo-2024-04-09: api_key: base_url: null cost: 17.4092447612 name: gpt-4-turbo-2024-04-09 system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Current date: 2025-01-06 Image input capabilities: Enabled Personality: v2' temp: 0.7 top_p: 1.0 type: openai gpt-4o-2024-05-13: api_key: base_url: null cost: 12.3166873868 name: gpt-4o-2024-05-13 system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Current date: 2025-01-06 Image input capabilities: Enabled Personality: v2' temp: 0.7 top_p: 1.0 type: openai gpt-4o-2024-08-06: api_key: base_url: null cost: 6.9944337124 name: gpt-4o-2024-08-06 system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Current date: 2025-01-06 Image input capabilities: Enabled Personality: v2' temp: 0.7 top_p: 1.0 type: openai gpt-4o-mini-2024-07-18: api_key: base_url: null cost: 0.563652953 name: gpt-4o-mini-2024-07-18 system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Current date: 2025-01-06 Image input capabilities: Enabled Personality: v2' temp: 0.7 top_p: 1.0 type: openai llama-3-70b-instruct: api_key: base_url: https://api.together.xyz/v1 cost: 0.4186380435 name: meta-llama/Llama-3-70b-chat-hf temp: 0.7 top_p: 1.0 type: openai llama-3.1-405b-instruct-fp8: api_key: base_url: https://api.fireworks.ai/inference/v1 cost: 2.4340008579 name: accounts/fireworks/models/llama-v3p1-405b-instruct system_prompt: 'Cutting Knowledge Date: December 2023 Today Date: 06 Jan 2025' temp: 0.6 top_p: 1.0 type: openai llama-3.1-70b-instruct: api_key: base_url: https://api.fireworks.ai/inference/v1 cost: 0.7204016024 name: accounts/fireworks/models/llama-v3p1-70b-instruct system_prompt: "Cutting Knowledge Date: December 2023\nToday Date: 06 Jan 2025\n\ \nCarefully read the user prompt. Your responses are comprehensive and easy\ \ to understand. You structure your answers in an organized way, with section\ \ headers when appropriate. You use consistent formatting in your responses.\ \ You follow user instructions. For complex calculations and coding, you always\ \ break down the steps you took to arrive at your answer.\n\nPay extra attention\ \ to prompts in the following categories:\n * Non-English queries: Read the\ \ prompt carefully and pay close attention to formatting requests and the level\ \ of detail; ensure you are giving factual and precise responses using correct\ \ grammar in the correct language.\n * Coding queries: You prioritize code organization\ \ and documentation. Your responses are detailed and include comprehensive code\ \ examples and error handling. Include comments to explain the code's purpose\ \ and behavior. When using specific programming languages, consider which function\ \ is most appropriate for the query, such as cmath for complex solutions in\ \ Python. Check for errors.\n * For mathematical reasoning: Before responding,\ \ review your output for reasoning, algebraic manipulation and calculation errors\ \ and fix before responding. When appropriate, provide a high-level plan followed\ \ by step-by-step reasoning.\n\nRemember your instructions." temp: 0.7 top_p: 1.0 type: openai llama-3.1-8b-instruct: api_key: base_url: https://api.fireworks.ai/inference/v1 cost: 0.1573721045 name: accounts/fireworks/models/llama-v3p1-8b-instruct system_prompt: "Cutting Knowledge Date: December 2023\nToday Date: 06 Jan 2025\n\ \nCarefully read the user prompt. Your responses are comprehensive and easy\ \ to understand. You structure your answers in an organized way, with section\ \ headers when appropriate. You use consistent formatting in your responses.\ \ You follow user instructions. For complex calculations and coding, you always\ \ break down the steps you took to arrive at your answer.\n\nPay extra attention\ \ to prompts in the following categories:\n * Non-English queries: Read the\ \ prompt carefully and pay close attention to formatting requests and the level\ \ of detail; ensure you are giving factual and precise responses using correct\ \ grammar in the correct language.\n * Coding queries: You prioritize code organization\ \ and documentation. Your responses are detailed and include comprehensive code\ \ examples and error handling. Include comments to explain the code's purpose\ \ and behavior. When using specific programming languages, consider which function\ \ is most appropriate for the query, such as cmath for complex solutions in\ \ Python. Check for errors.\n * For mathematical reasoning: Before responding,\ \ review your output for reasoning, algebraic manipulation and calculation errors\ \ and fix before responding. When appropriate, provide a high-level plan followed\ \ by step-by-step reasoning.\n\nRemember your instructions." temp: 0.7 top_p: 1.0 type: openai llama-3.3-70b-instruct: api_key: base_url: https://api.fireworks.ai/inference/v1 cost: 0.706256804 name: accounts/fireworks/models/llama-v3p3-70b-instruct temp: 0.6 top_p: 1.0 type: openai mistral-large-2407: api_key: base_url: https://api.mistral.ai/v1 cost: 4.3956843814 name: mistral-large-2407 temp: 0.7 top_p: 0.7 type: openai mixtral-8x22b-instruct-v0.1: api_key: base_url: https://api.mistral.ai/v1 cost: 2.5814904104 name: mixtral-8x22b-instruct-v0.1 temp: 0.7 top_p: 0.7 type: openai mixtral-8x7b-instruct-v0.1: api_key: base_url: https://api.together.xyz/v1 cost: 0.2839726899 name: mistralai/Mixtral-8x7B-Instruct-v0.1 temp: 0.7 top_p: 0.7 type: openai o1-2024-12-17: api_key: cost: 72.3693462194 name: o1-2024-12-17 system_prompt: Formatting re-enabled. temp: 1.0 top_p: 1.0 type: openai-o1 o1-mini: api_key: base_url: null cost: 16.4809912657 name: o1-mini-2024-09-12 system_prompt: null temp: 1.0 top_p: 1.0 type: openai-reasoning o1-preview: api_key: base_url: null cost: 72.481802295 name: o1-preview system_prompt: null temp: 1.0 top_p: 1.0 type: openai-reasoning qwen2.5-72b-instruct: api_key: base_url: https://dashscope.aliyuncs.com/compatible-mode/v1 cost: 1.1805173434 name: qwen2.5-72b-instruct temp: 0.7 top_p: 1.0 type: openai yi-lightning: api_key: base_url: https://api.lingyiwanwu.com/v1 cost: 0.0057351688 name: yi-lightning temp: 0.6 top_p: 1.0 type: openai chatgpt-4o-latest-20241120: api_key: cost: 12.9070929223 name: gpt-4o-2024-11-20 temp: 0.7 top_p: 1.0 system_prompt: 'You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Current date: 2025-01-06 Image input capabilities: Enabled Personality: v2' type: openai name: test-router ================================================ FILE: route/openai_server.py ================================================ import argparse from fastapi import FastAPI, HTTPException, Header from fastapi.responses import StreamingResponse from route.datatypes import ( ModelConfigContainer, ChatCompletionRequest, ChatCompletionResponse, ChatCompletionResponseChunk, ) from route.chat import CHAT_HANDLERS from route.routers import ROUTERS, BaseRouter import uvicorn import yaml from contextlib import asynccontextmanager from typing import List import logging import time import sys logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) logging.getLogger().setLevel(logging.DEBUG) def parse_args(): parser = argparse.ArgumentParser() parser.add_argument("--config", "-c", type=str, required=True) parser.add_argument("--router-type", type=str, required=True) parser.add_argument("--router-model-name", type=str, default=None) parser.add_argument("--router-model-endpoint", type=str, default=None) parser.add_argument("--router-api-key", type=str, default="-") parser.add_argument("--cost-optimizer", type=str, default="simple-lp") parser.add_argument("--port", "-p", type=int, default=8000) parser.add_argument("--host", type=str, default="0.0.0.0") parser.add_argument("--api-key", type=str, default="-") parser.add_argument("--reload", action=argparse.BooleanOptionalAction, default=True) parser.add_argument("--workers", type=int, default=1) args = parser.parse_args() return args @asynccontextmanager async def lifespan(app: FastAPI): """ This context manager is called once at startup and once at shutdown. We move all config-loading and router-creation logic here. """ # --- PARSE ARGS & LOAD CONFIG --- logging.info(f"Starting up...") args = parse_args() with open(args.config) as cfile: config = yaml.safe_load(cfile) model_config_dicts = config["model_configs"] model_config_container = ModelConfigContainer(model_config_dicts) router_cls = ROUTERS[args.router_type] router_kwargs = { "router_model_name": args.router_model_name, "router_model_endpoint": args.router_model_endpoint, "router_api_key": args.router_api_key, } router = router_cls(model_config_container, args.cost_optimizer, **router_kwargs) app.state.router = router app.state.model_config_container = model_config_container app.state.api_key = args.api_key logging.info(f"Finished startup.") try: yield finally: pass app = FastAPI(lifespan=lifespan) # ====== API Endpoint ====== @app.post("/v1/chat/completions") async def create_chat_completion( request: ChatCompletionRequest, authorization: str = Header(None), ) -> ChatCompletionResponse | ChatCompletionResponseChunk: """ Mimics the OpenAI Chat Completions endpoint (both streaming and non-streaming). """ logging.info(f"{int(time.time())} Recieved Request: {request}") if not authorization or not authorization.startswith("Bearer "): raise HTTPException(status_code=401, detail="Invalid or missing API key") # Strip out the 'Bearer ' portion to isolate the token token = authorization.removeprefix("Bearer ") if token != app.state.api_key: raise HTTPException(status_code=403, detail="Unauthorized") try: router_output = None type = None direct_model = request.direct_model router: BaseRouter = app.state.router messages = request.messages if direct_model: router_output = router.get_model_direct(direct_model) else: router_output = router.route(messages, request.cost) logging.info(f"{int(time.time())} Router Output: {router_output}") type = router_output.chosen_model_config.get_type() chat_handler = CHAT_HANDLERS[type] except Exception as e: logging.info( f"{int(time.time())} ***Routing Error Start***\nError Message: {e}\nRouter Output: {router_output}\nChat Handler: {type}\nDirect Model: {direct_model}.***Routing Error End***" ) raise HTTPException(status_code=500, detail=str(e)) try: if request.stream: chat_output_chunk = chat_handler.generate_stream( messages=messages, router_output=router_output, temp=request.temperature, top_p=request.top_p, max_tokens=request.max_tokens, ) return StreamingResponse(chat_output_chunk, media_type="text/event-stream") else: chat_output = chat_handler.generate( messages=messages, router_output=router_output, temp=request.temperature, top_p=request.top_p, max_tokens=request.max_tokens, ) return chat_output except Exception as e: logging.info( f"{int(time.time())} ***Endpoint Error Start***\nError Message: {e}\nRouter Output: {router_output}\nChat Handler: {type}.***Endpoint Error End***" ) raise e @app.get("/v1/models") async def models(authorization: str = Header(None)) -> List[str]: logging.info(f"Recieved Get Request for Models.") if not authorization or not authorization.startswith("Bearer "): raise HTTPException(status_code=401, detail="Invalid or missing API key") # Strip out the 'Bearer ' portion to isolate the token token = authorization.removeprefix("Bearer ") if token != app.state.api_key: raise HTTPException(status_code=403, detail="Unauthorized") router: BaseRouter = app.state.router return router.model_list if __name__ == "__main__": args = parse_args() uvicorn.run( "route.openai_server:app", port=args.port, host=args.host, reload=args.reload, workers=args.workers, ) ================================================ FILE: route/requirements.txt ================================================ uvicorn fastapi openai anthropic google-generativeai scipy cvxpy ================================================ FILE: route/routers.py ================================================ from abc import ABC, abstractmethod from typing import Dict, List, Tuple from route.utils import ( get_registry_decorator, query_p2l_endpoint, get_p2l_endpoint_models, ) from route.datatypes import ModelConfigContainer, Roles, ChatMessage, RouterOutput from route.cost_optimizers import COST_OPTIMIZERS, BaseCostOptimizer import numpy as np from scipy.special import expit class BaseRouter(ABC): def __init__( self, model_config_container: ModelConfigContainer, cost_optimizer_type: str, **kwargs, ): super().__init__() self.model_config_container = model_config_container self.model_list: List[str] = None self.model_costs: np.ndarray[float] = None self.cost_optimizer: BaseCostOptimizer = COST_OPTIMIZERS[cost_optimizer_type] @abstractmethod def _get_model_scores(self, messages: List[ChatMessage]) -> np.ndarray[float]: pass def _get_previous_response_model(self, messages: List[ChatMessage]) -> str | None: for message in reversed(messages): if message.role == Roles.ASSISTANT.value: return message.model return None def _get_prompt(self, messages: List[ChatMessage]) -> list[str]: prompts = [] for message in messages: if message.role == Roles.USER.value: prompts.append(message.content) if len(prompts) == 0: raise Exception(f"No user prompt found in messages {messages}.") return prompts def get_model_direct(self, model_name: str) -> RouterOutput: return RouterOutput( chosen_model_name=model_name, chosen_model_config=self.model_config_container.get_model_config( model_name=model_name ), model_scores=None, ) def route(self, messages: List[ChatMessage], cost: float = None) -> RouterOutput: model_scores = self._get_model_scores(messages) chosen_model_name = self.cost_optimizer.select_model( cost, self.model_list, self.model_costs, model_scores ) model_scores_dict = dict(zip(self.model_list, model_scores)) chosen_model_config = self.model_config_container.get_model_config( chosen_model_name ) return RouterOutput( chosen_model_name=chosen_model_name, chosen_model_config=chosen_model_config, model_scores=model_scores_dict, ) ROUTERS: Dict[str, BaseRouter] = {} register = get_registry_decorator(ROUTERS) @register("random") class RandomRouter(BaseRouter): """For debugging and gamblers.""" def __init__( self, model_config_container: ModelConfigContainer, cost_optimizer_type: str, **kwargs, ): super().__init__( model_config_container=model_config_container, cost_optimizer_type=cost_optimizer_type, ) self.model_list = model_config_container.list_models() self.model_costs = np.array(model_config_container.list_costs()) def _get_model_scores(self, messages: List[ChatMessage]) -> np.ndarray[float]: return np.random.uniform(0.0, 1.0, size=len(self.model_list)) @register("bt-endpoint") class EndpointP2LRouter(BaseRouter): # Hardcoding this because I'm tired man... SAMPLING_WEIGHTS = { "chatgpt-4o-latest-20241120": 4, "o1-mini": 4, "o1-2024-12-17": 4, "gpt-4o-mini-2024-07-18": 2, "gemma-2-27b-it": 2, "gemma-2-9b-it": 2, "gemma-2-2b-it": 2, "claude-3-5-sonnet-20241022": 4, "claude-3-opus-20240229": 4, "claude-3-5-haiku-20241022": 4, "qwen2.5-72b-instruct": 2, "qwen2.5-plus-1127": 4, "llama-3.1-405b-instruct-bf16": 4, "mistral-large-2411": 4, "grok-2-2024-08-13": 4, "grok-2-mini-2024-08-13": 2, "deepseek-v3": 6, "gemini-1.5-pro-002": 4, "gemini-1.5-flash-002": 2, "gemini-1.5-flash-8b-001": 2, "c4ai-aya-expanse-32b": 2, "c4ai-aya-expanse-8b": 2, "athene-v2-chat": 4, "gemini-exp-1206": 4, "gemini-2.0-flash-exp": 4, "llama-3.3-70b-instruct": 4, "amazon-nova-pro-v1.0": 4, "amazon-nova-lite-v1.0": 2, "amazon-nova-micro-v1.0": 2, "llama-3.1-tulu-3-8b": 6, "llama-3.1-tulu-3-70b": 6, "granite-3.1-8b-instruct": 6, "granite-3.1-2b-instruct": 6, } def __init__( self, model_config_container: ModelConfigContainer, cost_optimizer_type: str, router_model_endpoint: str, router_api_key: str, **kwargs, ): super().__init__( model_config_container=model_config_container, cost_optimizer_type=cost_optimizer_type, ) self.base_url = router_model_endpoint self.api_key = router_api_key router_model_list = get_p2l_endpoint_models(self.base_url, self.api_key) config_model_list = model_config_container.list_models() self.mask = [ router_model in config_model_list for router_model in router_model_list ] self.q_mask = [ router_model in self.SAMPLING_WEIGHTS for router_model in router_model_list ] self.q = np.array( [ float(self.SAMPLING_WEIGHTS[router_model]) for router_model in router_model_list if router_model in self.SAMPLING_WEIGHTS ] ) self.model_list = [ model for model, keep in zip(router_model_list, self.mask) if keep ] self.model_costs = np.array( [ model_config_container.get_model_config(model).get_cost() for model in self.model_list ] ) def _get_model_scores( self, messages: List[ChatMessage] ) -> Tuple[np.ndarray[float], float]: prompt = self._get_prompt(messages) p2l_output = query_p2l_endpoint(prompt, self.base_url, self.api_key) coefs = np.array(p2l_output["coefs"]) return coefs def route(self, messages: List[ChatMessage], cost: float = None) -> RouterOutput: model_scores = self._get_model_scores(messages) router_choice_scores = model_scores[self.mask] router_opponent_scores = model_scores[self.q_mask] chosen_model_name = self.cost_optimizer.select_model( cost, self.model_list, self.model_costs, router_choice_scores, opponent_scores=router_opponent_scores, opponent_distribution=self.q, ) model_scores_dict = dict(zip(self.model_list, router_choice_scores)) chosen_model_config = self.model_config_container.get_model_config( chosen_model_name ) return RouterOutput( chosen_model_name=chosen_model_name, chosen_model_config=chosen_model_config, model_scores=model_scores_dict, ) @register("bag-endpoint") @register("grk-endpoint") class EndpointP2LRouter(BaseRouter): def __init__( self, model_config_container: ModelConfigContainer, cost_optimizer_type: str, router_model_endpoint: str, router_api_key: str, **kwargs, ): super().__init__( model_config_container=model_config_container, cost_optimizer_type=cost_optimizer_type, ) self.base_url = router_model_endpoint self.api_key = router_api_key router_model_list = get_p2l_endpoint_models(self.base_url, self.api_key) config_model_list = model_config_container.list_models() self.mask = [ router_model in config_model_list for router_model in router_model_list ] self.model_list = [ model for model, keep in zip(router_model_list, self.mask) if keep ] self.model_costs = np.array( [ model_config_container.get_model_config(model).get_cost() for model in self.model_list ] ) def _get_model_scores(self, messages: List[ChatMessage]) -> np.ndarray[float]: prompt = self._get_prompt(messages) p2l_output = query_p2l_endpoint(prompt, self.base_url, self.api_key) coefs = np.array(p2l_output["coefs"]) model_scores: np.ndarray[float] = expit(coefs) return model_scores[self.mask] ================================================ FILE: route/utils.py ================================================ from typing import Dict, Callable, List import requests import json def get_registry_decorator(registry: Dict) -> Callable: def register(name: str): def decorator(cls: Callable): assert ( not name in registry ), f"No duplicate registry names. '{name}' was registerd more than once." registry[name] = cls return cls return decorator return register def query_p2l_endpoint( prompt: list[str], base_url: str, api_key: str ) -> Dict[str, List]: headers = { "Content-Type": "application/json", "api-key": api_key, } payload = {"prompt": prompt} try: response = requests.post( f"{base_url}/predict", headers=headers, data=json.dumps(payload) ) response.raise_for_status() result = response.json() return result except Exception as err: raise err def get_p2l_endpoint_models(base_url: str, api_key: str) -> List[str]: headers = { "Content-Type": "application/json", "api-key": api_key, } try: response = requests.get(f"{base_url}/models", headers=headers) response.raise_for_status() result = response.json() return result["models"] except Exception as err: print(f"An error occurred: {err}") ================================================ FILE: serve_requirements.txt ================================================ numpy<2.0.0 torch<=2.4.0 transformers transformers[torch] hf_transfer wandb scipy uvicorn fastapi ================================================ FILE: train_requirements.txt ================================================ numpy<2.0.0 torch<=2.4.0 deepspeed<=0.15.3 datasets>=3.2.0 transformers transformers[torch] hf_transfer wandb scipy ================================================ FILE: training_configs/Llama3.1-8B-full-train.yaml ================================================ proj_name: Llama-3.1-8B-Instruct-full-train learning_rate: 4.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: meta-llama/Llama-3.1-8B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}" model_type: "llama" head_type: "bt" loss_type: "bt_tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true pad_token_if_none: <|finetune_right_pad_id|> cls_token_if_none: <|reserved_special_token_3|> ================================================ FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.016-04302025.yaml ================================================ proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.016-04302025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 5 max_length: 16384 num_train_epochs: 1 train_data_path: naive_replay_buffer_eps_0.016 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 13 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.032-04302025.yaml ================================================ proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.032-04302025-2 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 1 max_length: 16384 num_train_epochs: 1 train_data_path: naive_replay_buffer_eps_0.032 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 66 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.06-04302025.yaml ================================================ proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.06-04302025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 16384 num_train_epochs: 1 train_data_path: naive_replay_buffer_eps_0.06 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 17 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.112-04302025.yaml ================================================ proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.112-04302025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 16384 num_train_epochs: 1 train_data_path: naive_replay_buffer_eps_0.112 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 18 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-bag-chrono-eps-0.2-04302025.yaml ================================================ proj_name: Qwen2.5-1.5B-bag-chrono-eps-0.2-04302025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 16384 num_train_epochs: 1 train_data_path: naive_replay_buffer_eps_0.2 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 20 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-bag-full-train-02222025.yaml ================================================ proj_name: Qwen2.5-1.5B-Instruct-bag-02222025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-bag-data-02222025 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-full-train.yaml ================================================ proj_name: Qwen2.5-1.5B-Instruct-full-train learning_rate: 4.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt_tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-rk-full-train-half-batch.yaml ================================================ proj_name: Qwen2.5-1.5B-Instruct-rk-full-train-half-batch learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "rk" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-1.5B-rk-full-train.yaml ================================================ proj_name: Qwen2.5-1.5B-Instruct-rk-full-train learning_rate: 1.0e-5 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "rk" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-3B-bag-full-train-02222025.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-bag-02222025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-bag-data-02222025 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-3B-bag-full-train-02242025.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-bag-02242025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 2 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-bag-data-02242025 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-3B-freeze-test-part-2.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-freeze-test-part-2 learning_rate: 1.0e-06 adam_epsilon: 1.0e-8 batch_size: 2 max_length: 8192 num_train_epochs: 1 train_data_path: p2el/tie_included_canonical_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: p2el/Qwen2.5-3B-Instruct-freeze-test gradient_accumulation_steps: 64 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt_tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params ================================================ FILE: training_configs/Qwen2.5-3B-freeze-test.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-freeze-test learning_rate: 1.13e-05 adam_epsilon: 1.0e-8 batch_size: 2 max_length: 8192 num_train_epochs: 1 train_data_path: p2el/tie_included_canonical_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 256 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt_tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params train_head_only: true ================================================ FILE: training_configs/Qwen2.5-3B-full-train-double-batch.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-full-train learning_rate: 1.0e-5 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt_tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-3B-full-train.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-full-train learning_rate: 4.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt_tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-3B-rk-full-train-half-batch.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-rk-full-train learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "rk" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-3B-rk-full-train.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-rk-full-train learning_rate: 1.0e-5 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "rk" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-3B-training-bt_data_11092024 copy.yaml ================================================ proj_name: Qwen2.5-3B-Instruct-bt_data_11092024 learning_rate: 1.13e-05 adam_epsilon: 1.0e-08 batch_size: 2 max_length: 4096 num_train_epochs: 1 train_data_path: p2el/canonical_train_data_11092024 val_data_path: p2el/canonical_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: 'Qwen/Qwen2.5-3B-Instruct' gradient_accumulation_steps: 64 chat_template: "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json ================================================ FILE: training_configs/Qwen2.5-7B-bag-full-train-02222025.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-bag-02222025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-bag-data-02222025 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-bag-full-train-02242025.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-bag-02242025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 2 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-bag-data-02242025 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-bag-full-train-03132025.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-bag-03132025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-bag-data-03132025 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 32 # 4 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-bag-full-train-chrono.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-bag-chrono learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 2 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-bag-data-chrono val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-bt-full-train-02222025.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-bt-02222025 learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 2 max_length: 16384 num_train_epochs: 1 train_data_path: full-p2l-data-02222025 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt-tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-full-train.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-full-train learning_rate: 4.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt_tie" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-rk-full-train-abs.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-rk-full-train-abs learning_rate: 1.0e-5 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "rk" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-rk-full-train-half-batch.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-rk-full-train-half-batch learning_rate: 8.0e-6 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 16 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "rk" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/Qwen2.5-7B-rk-full-train.yaml ================================================ proj_name: Qwen2.5-7B-Instruct-rk-full-train learning_rate: 1.0e-5 adam_epsilon: 1.0e-8 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: full-p2l-data val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-7B-Instruct gradient_accumulation_steps: 32 # drop to 32 since 8 gpus chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "rk" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params load_train_data_from_disk: true ================================================ FILE: training_configs/debug.yaml ================================================ proj_name: debug-Qwen2.5-0.5B-Instruct-bt_data_11092024 learning_rate: 2.0e-06 batch_size: 4 max_length: 4096 num_train_epochs: 1 data_path: p2el/bt_data_11092024 output_dir: 'training_outputs' pretrain_model_name: 'Qwen/Qwen2.5-0.5B-Instruct' gradient_accumulation_steps: 32 chat_template: "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n<|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json ================================================ FILE: training_configs/init_debug_qwen_1.5b_he.yaml ================================================ proj_name: he-Debug-Init-Qwen2.5-1.5B-Instruct learning_rate: 1.13e-05 adam_epsilon: 7.071068e-09 batch_size: 2 max_length: 4096 num_train_epochs: 1 train_data_path: p2el/canonical_bt_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 64 chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: he_unif ================================================ FILE: training_configs/init_debug_qwen_1.5b_reset_params.yaml ================================================ proj_name: reset_param-Debug-Init-Qwen2.5-1.5B-Instruct learning_rate: 1.13e-05 adam_epsilon: 7.071068e-09 batch_size: 2 max_length: 4096 num_train_epochs: 1 train_data_path: p2el/canonical_bt_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 64 chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params ================================================ FILE: training_configs/init_debug_qwen_1.5b_xavier.yaml ================================================ proj_name: xaiver-Debug-Init-Qwen2.5-1.5B-Instruct learning_rate: 1.13e-05 adam_epsilon: 7.071068e-09 batch_size: 2 max_length: 4096 num_train_epochs: 1 train_data_path: p2el/canonical_bt_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-1.5B-Instruct gradient_accumulation_steps: 64 chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: xavier_unif ================================================ FILE: training_configs/init_debug_qwen_3b_he.yaml ================================================ proj_name: he-Debug-Init-Qwen2.5-3B-Instruct learning_rate: 1.13e-05 adam_epsilon: 7.071068e-09 batch_size: 2 max_length: 4096 num_train_epochs: 1 train_data_path: p2el/canonical_bt_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 64 chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: he_unif ================================================ FILE: training_configs/init_debug_qwen_3b_reset_params.yaml ================================================ proj_name: reset_param-Debug-Init-Qwen2.5-3B-Instruct learning_rate: 1.13e-05 adam_epsilon: 7.071068e-09 batch_size: 2 max_length: 4096 num_train_epochs: 1 train_data_path: p2el/canonical_bt_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 64 chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: reset_params ================================================ FILE: training_configs/init_debug_qwen_3b_xavier.yaml ================================================ proj_name: xaiver-Debug-Init-Qwen2.5-3B-Instruct learning_rate: 1.13e-05 adam_epsilon: 7.071068e-09 batch_size: 2 max_length: 4096 num_train_epochs: 1 train_data_path: p2el/canonical_bt_train_data_11092024 val_data_path: p2el/canonical_bt_val_data_11092024 output_dir: 'training_outputs' pretrain_model_name: Qwen/Qwen2.5-3B-Instruct gradient_accumulation_steps: 64 chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "bt" loss_type: "bt" weighted_loss: false deepspeed_config_path: deepspeed/zero1.json init_type: xavier_unif ================================================ FILE: training_configs/qwen_1.5B_geom_test.yaml ================================================ proj_name: "Qwen2.5-1.5B-Instruct-Geom-Test" learning_rate: 8.0e-06 adam_epsilon: 1.0e-08 batch_size: 4 max_length: 8192 num_train_epochs: 1 train_data_path: "/root/chrono_train_data" val_data_path: "p2el/canonical_bt_val_data_11092024" output_dir: "training_outputs" pretrain_model_name: "Qwen/Qwen2.5-1.5B-Instruct" gradient_accumulation_steps: 16 chat_template: "{%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n\n" model_type: "qwen2" head_type: "rk" loss_type: "bag" weighted_loss: false deepspeed_config_path: "deepspeed/zero1.json" init_type: "reset_params" load_train_data_from_disk: true