Repository: karpathy/llm-council
Branch: master
Commit: 92e1fccb1bdc
Files: 34
Total size: 66.9 KB
Directory structure:
gitextract_tdw_7erc/
├── .gitignore
├── .python-version
├── CLAUDE.md
├── README.md
├── backend/
│ ├── __init__.py
│ ├── config.py
│ ├── council.py
│ ├── main.py
│ ├── openrouter.py
│ └── storage.py
├── frontend/
│ ├── .gitignore
│ ├── README.md
│ ├── eslint.config.js
│ ├── index.html
│ ├── package.json
│ ├── src/
│ │ ├── App.css
│ │ ├── App.jsx
│ │ ├── api.js
│ │ ├── components/
│ │ │ ├── ChatInterface.css
│ │ │ ├── ChatInterface.jsx
│ │ │ ├── Sidebar.css
│ │ │ ├── Sidebar.jsx
│ │ │ ├── Stage1.css
│ │ │ ├── Stage1.jsx
│ │ │ ├── Stage2.css
│ │ │ ├── Stage2.jsx
│ │ │ ├── Stage3.css
│ │ │ └── Stage3.jsx
│ │ ├── index.css
│ │ └── main.jsx
│ └── vite.config.js
├── main.py
├── pyproject.toml
└── start.sh
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info
# Virtual environments
.venv
# Keys and secrets
.env
# Data files
data/
# Frontend
frontend/node_modules/
frontend/dist/
frontend/.vite/
================================================
FILE: .python-version
================================================
3.10
================================================
FILE: CLAUDE.md
================================================
# CLAUDE.md - Technical Notes for LLM Council
This file contains technical details, architectural decisions, and important implementation notes for future development sessions.
## Project Overview
LLM Council is a 3-stage deliberation system where multiple LLMs collaboratively answer user questions. The key innovation is anonymized peer review in Stage 2, preventing models from playing favorites.
## Architecture
### Backend Structure (`backend/`)
**`config.py`**
- Contains `COUNCIL_MODELS` (list of OpenRouter model identifiers)
- Contains `CHAIRMAN_MODEL` (model that synthesizes final answer)
- Uses environment variable `OPENROUTER_API_KEY` from `.env`
- Backend runs on **port 8001** (NOT 8000 - user had another app on 8000)
**`openrouter.py`**
- `query_model()`: Single async model query
- `query_models_parallel()`: Parallel queries using `asyncio.gather()`
- Returns dict with 'content' and optional 'reasoning_details'
- Graceful degradation: returns None on failure, continues with successful responses
**`council.py`** - The Core Logic
- `stage1_collect_responses()`: Parallel queries to all council models
- `stage2_collect_rankings()`:
- Anonymizes responses as "Response A, B, C, etc."
- Creates `label_to_model` mapping for de-anonymization
- Prompts models to evaluate and rank (with strict format requirements)
- Returns tuple: (rankings_list, label_to_model_dict)
- Each ranking includes both raw text and `parsed_ranking` list
- `stage3_synthesize_final()`: Chairman synthesizes from all responses + rankings
- `parse_ranking_from_text()`: Extracts "FINAL RANKING:" section, handles both numbered lists and plain format
- `calculate_aggregate_rankings()`: Computes average rank position across all peer evaluations
**`storage.py`**
- JSON-based conversation storage in `data/conversations/`
- Each conversation: `{id, created_at, messages[]}`
- Assistant messages contain: `{role, stage1, stage2, stage3}`
- Note: metadata (label_to_model, aggregate_rankings) is NOT persisted to storage, only returned via API
**`main.py`**
- FastAPI app with CORS enabled for localhost:5173 and localhost:3000
- POST `/api/conversations/{id}/message` returns metadata in addition to stages
- Metadata includes: label_to_model mapping and aggregate_rankings
### Frontend Structure (`frontend/src/`)
**`App.jsx`**
- Main orchestration: manages conversations list and current conversation
- Handles message sending and metadata storage
- Important: metadata is stored in the UI state for display but not persisted to backend JSON
**`components/ChatInterface.jsx`**
- Multiline textarea (3 rows, resizable)
- Enter to send, Shift+Enter for new line
- User messages wrapped in markdown-content class for padding
**`components/Stage1.jsx`**
- Tab view of individual model responses
- ReactMarkdown rendering with markdown-content wrapper
**`components/Stage2.jsx`**
- **Critical Feature**: Tab view showing RAW evaluation text from each model
- De-anonymization happens CLIENT-SIDE for display (models receive anonymous labels)
- Shows "Extracted Ranking" below each evaluation so users can validate parsing
- Aggregate rankings shown with average position and vote count
- Explanatory text clarifies that boldface model names are for readability only
**`components/Stage3.jsx`**
- Final synthesized answer from chairman
- Green-tinted background (#f0fff0) to highlight conclusion
**Styling (`*.css`)**
- Light mode theme (not dark mode)
- Primary color: #4a90e2 (blue)
- Global markdown styling in `index.css` with `.markdown-content` class
- 12px padding on all markdown content to prevent cluttered appearance
## Key Design Decisions
### Stage 2 Prompt Format
The Stage 2 prompt is very specific to ensure parseable output:
```
1. Evaluate each response individually first
2. Provide "FINAL RANKING:" header
3. Numbered list format: "1. Response C", "2. Response A", etc.
4. No additional text after ranking section
```
This strict format allows reliable parsing while still getting thoughtful evaluations.
### De-anonymization Strategy
- Models receive: "Response A", "Response B", etc.
- Backend creates mapping: `{"Response A": "openai/gpt-5.1", ...}`
- Frontend displays model names in **bold** for readability
- Users see explanation that original evaluation used anonymous labels
- This prevents bias while maintaining transparency
### Error Handling Philosophy
- Continue with successful responses if some models fail (graceful degradation)
- Never fail the entire request due to single model failure
- Log errors but don't expose to user unless all models fail
### UI/UX Transparency
- All raw outputs are inspectable via tabs
- Parsed rankings shown below raw text for validation
- Users can verify system's interpretation of model outputs
- This builds trust and allows debugging of edge cases
## Important Implementation Details
### Relative Imports
All backend modules use relative imports (e.g., `from .config import ...`) not absolute imports. This is critical for Python's module system to work correctly when running as `python -m backend.main`.
### Port Configuration
- Backend: 8001 (changed from 8000 to avoid conflict)
- Frontend: 5173 (Vite default)
- Update both `backend/main.py` and `frontend/src/api.js` if changing
### Markdown Rendering
All ReactMarkdown components must be wrapped in `
` for proper spacing. This class is defined globally in `index.css`.
### Model Configuration
Models are hardcoded in `backend/config.py`. Chairman can be same or different from council members. The current default is Gemini as chairman per user preference.
## Common Gotchas
1. **Module Import Errors**: Always run backend as `python -m backend.main` from project root, not from backend directory
2. **CORS Issues**: Frontend must match allowed origins in `main.py` CORS middleware
3. **Ranking Parse Failures**: If models don't follow format, fallback regex extracts any "Response X" patterns in order
4. **Missing Metadata**: Metadata is ephemeral (not persisted), only available in API responses
## Future Enhancement Ideas
- Configurable council/chairman via UI instead of config file
- Streaming responses instead of batch loading
- Export conversations to markdown/PDF
- Model performance analytics over time
- Custom ranking criteria (not just accuracy/insight)
- Support for reasoning models (o1, etc.) with special handling
## Testing Notes
Use `test_openrouter.py` to verify API connectivity and test different model identifiers before adding to council. The script tests both streaming and non-streaming modes.
## Data Flow Summary
```
User Query
↓
Stage 1: Parallel queries → [individual responses]
↓
Stage 2: Anonymize → Parallel ranking queries → [evaluations + parsed rankings]
↓
Aggregate Rankings Calculation → [sorted by avg position]
↓
Stage 3: Chairman synthesis with full context
↓
Return: {stage1, stage2, stage3, metadata}
↓
Frontend: Display with tabs + validation UI
```
The entire flow is async/parallel where possible to minimize latency.
================================================
FILE: README.md
================================================
# LLM Council

The idea of this repo is that instead of asking a question to your favorite LLM provider (e.g. OpenAI GPT 5.1, Google Gemini 3.0 Pro, Anthropic Claude Sonnet 4.5, xAI Grok 4, eg.c), you can group them into your "LLM Council". This repo is a simple, local web app that essentially looks like ChatGPT except it uses OpenRouter to send your query to multiple LLMs, it then asks them to review and rank each other's work, and finally a Chairman LLM produces the final response.
In a bit more detail, here is what happens when you submit a query:
1. **Stage 1: First opinions**. The user query is given to all LLMs individually, and the responses are collected. The individual responses are shown in a "tab view", so that the user can inspect them all one by one.
2. **Stage 2: Review**. Each individual LLM is given the responses of the other LLMs. Under the hood, the LLM identities are anonymized so that the LLM can't play favorites when judging their outputs. The LLM is asked to rank them in accuracy and insight.
3. **Stage 3: Final response**. The designated Chairman of the LLM Council takes all of the model's responses and compiles them into a single final answer that is presented to the user.
## Vibe Code Alert
This project was 99% vibe coded as a fun Saturday hack because I wanted to explore and evaluate a number of LLMs side by side in the process of [reading books together with LLMs](https://x.com/karpathy/status/1990577951671509438). It's nice and useful to see multiple responses side by side, and also the cross-opinions of all LLMs on each other's outputs. I'm not going to support it in any way, it's provided here as is for other people's inspiration and I don't intend to improve it. Code is ephemeral now and libraries are over, ask your LLM to change it in whatever way you like.
## Setup
### 1. Install Dependencies
The project uses [uv](https://docs.astral.sh/uv/) for project management.
**Backend:**
```bash
uv sync
```
**Frontend:**
```bash
cd frontend
npm install
cd ..
```
### 2. Configure API Key
Create a `.env` file in the project root:
```bash
OPENROUTER_API_KEY=sk-or-v1-...
```
Get your API key at [openrouter.ai](https://openrouter.ai/). Make sure to purchase the credits you need, or sign up for automatic top up.
### 3. Configure Models (Optional)
Edit `backend/config.py` to customize the council:
```python
COUNCIL_MODELS = [
"openai/gpt-5.1",
"google/gemini-3-pro-preview",
"anthropic/claude-sonnet-4.5",
"x-ai/grok-4",
]
CHAIRMAN_MODEL = "google/gemini-3-pro-preview"
```
## Running the Application
**Option 1: Use the start script**
```bash
./start.sh
```
**Option 2: Run manually**
Terminal 1 (Backend):
```bash
uv run python -m backend.main
```
Terminal 2 (Frontend):
```bash
cd frontend
npm run dev
```
Then open http://localhost:5173 in your browser.
## Tech Stack
- **Backend:** FastAPI (Python 3.10+), async httpx, OpenRouter API
- **Frontend:** React + Vite, react-markdown for rendering
- **Storage:** JSON files in `data/conversations/`
- **Package Management:** uv for Python, npm for JavaScript
================================================
FILE: backend/__init__.py
================================================
"""LLM Council backend package."""
================================================
FILE: backend/config.py
================================================
"""Configuration for the LLM Council."""
import os
from dotenv import load_dotenv
load_dotenv()
# OpenRouter API key
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
# Council members - list of OpenRouter model identifiers
COUNCIL_MODELS = [
"openai/gpt-5.1",
"google/gemini-3-pro-preview",
"anthropic/claude-sonnet-4.5",
"x-ai/grok-4",
]
# Chairman model - synthesizes final response
CHAIRMAN_MODEL = "google/gemini-3-pro-preview"
# OpenRouter API endpoint
OPENROUTER_API_URL = "https://openrouter.ai/api/v1/chat/completions"
# Data directory for conversation storage
DATA_DIR = "data/conversations"
================================================
FILE: backend/council.py
================================================
"""3-stage LLM Council orchestration."""
from typing import List, Dict, Any, Tuple
from .openrouter import query_models_parallel, query_model
from .config import COUNCIL_MODELS, CHAIRMAN_MODEL
async def stage1_collect_responses(user_query: str) -> List[Dict[str, Any]]:
"""
Stage 1: Collect individual responses from all council models.
Args:
user_query: The user's question
Returns:
List of dicts with 'model' and 'response' keys
"""
messages = [{"role": "user", "content": user_query}]
# Query all models in parallel
responses = await query_models_parallel(COUNCIL_MODELS, messages)
# Format results
stage1_results = []
for model, response in responses.items():
if response is not None: # Only include successful responses
stage1_results.append({
"model": model,
"response": response.get('content', '')
})
return stage1_results
async def stage2_collect_rankings(
user_query: str,
stage1_results: List[Dict[str, Any]]
) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
"""
Stage 2: Each model ranks the anonymized responses.
Args:
user_query: The original user query
stage1_results: Results from Stage 1
Returns:
Tuple of (rankings list, label_to_model mapping)
"""
# Create anonymized labels for responses (Response A, Response B, etc.)
labels = [chr(65 + i) for i in range(len(stage1_results))] # A, B, C, ...
# Create mapping from label to model name
label_to_model = {
f"Response {label}": result['model']
for label, result in zip(labels, stage1_results)
}
# Build the ranking prompt
responses_text = "\n\n".join([
f"Response {label}:\n{result['response']}"
for label, result in zip(labels, stage1_results)
])
ranking_prompt = f"""You are evaluating different responses to the following question:
Question: {user_query}
Here are the responses from different models (anonymized):
{responses_text}
Your task:
1. First, evaluate each response individually. For each response, explain what it does well and what it does poorly.
2. Then, at the very end of your response, provide a final ranking.
IMPORTANT: Your final ranking MUST be formatted EXACTLY as follows:
- Start with the line "FINAL RANKING:" (all caps, with colon)
- Then list the responses from best to worst as a numbered list
- Each line should be: number, period, space, then ONLY the response label (e.g., "1. Response A")
- Do not add any other text or explanations in the ranking section
Example of the correct format for your ENTIRE response:
Response A provides good detail on X but misses Y...
Response B is accurate but lacks depth on Z...
Response C offers the most comprehensive answer...
FINAL RANKING:
1. Response C
2. Response A
3. Response B
Now provide your evaluation and ranking:"""
messages = [{"role": "user", "content": ranking_prompt}]
# Get rankings from all council models in parallel
responses = await query_models_parallel(COUNCIL_MODELS, messages)
# Format results
stage2_results = []
for model, response in responses.items():
if response is not None:
full_text = response.get('content', '')
parsed = parse_ranking_from_text(full_text)
stage2_results.append({
"model": model,
"ranking": full_text,
"parsed_ranking": parsed
})
return stage2_results, label_to_model
async def stage3_synthesize_final(
user_query: str,
stage1_results: List[Dict[str, Any]],
stage2_results: List[Dict[str, Any]]
) -> Dict[str, Any]:
"""
Stage 3: Chairman synthesizes final response.
Args:
user_query: The original user query
stage1_results: Individual model responses from Stage 1
stage2_results: Rankings from Stage 2
Returns:
Dict with 'model' and 'response' keys
"""
# Build comprehensive context for chairman
stage1_text = "\n\n".join([
f"Model: {result['model']}\nResponse: {result['response']}"
for result in stage1_results
])
stage2_text = "\n\n".join([
f"Model: {result['model']}\nRanking: {result['ranking']}"
for result in stage2_results
])
chairman_prompt = f"""You are the Chairman of an LLM Council. Multiple AI models have provided responses to a user's question, and then ranked each other's responses.
Original Question: {user_query}
STAGE 1 - Individual Responses:
{stage1_text}
STAGE 2 - Peer Rankings:
{stage2_text}
Your task as Chairman is to synthesize all of this information into a single, comprehensive, accurate answer to the user's original question. Consider:
- The individual responses and their insights
- The peer rankings and what they reveal about response quality
- Any patterns of agreement or disagreement
Provide a clear, well-reasoned final answer that represents the council's collective wisdom:"""
messages = [{"role": "user", "content": chairman_prompt}]
# Query the chairman model
response = await query_model(CHAIRMAN_MODEL, messages)
if response is None:
# Fallback if chairman fails
return {
"model": CHAIRMAN_MODEL,
"response": "Error: Unable to generate final synthesis."
}
return {
"model": CHAIRMAN_MODEL,
"response": response.get('content', '')
}
def parse_ranking_from_text(ranking_text: str) -> List[str]:
"""
Parse the FINAL RANKING section from the model's response.
Args:
ranking_text: The full text response from the model
Returns:
List of response labels in ranked order
"""
import re
# Look for "FINAL RANKING:" section
if "FINAL RANKING:" in ranking_text:
# Extract everything after "FINAL RANKING:"
parts = ranking_text.split("FINAL RANKING:")
if len(parts) >= 2:
ranking_section = parts[1]
# Try to extract numbered list format (e.g., "1. Response A")
# This pattern looks for: number, period, optional space, "Response X"
numbered_matches = re.findall(r'\d+\.\s*Response [A-Z]', ranking_section)
if numbered_matches:
# Extract just the "Response X" part
return [re.search(r'Response [A-Z]', m).group() for m in numbered_matches]
# Fallback: Extract all "Response X" patterns in order
matches = re.findall(r'Response [A-Z]', ranking_section)
return matches
# Fallback: try to find any "Response X" patterns in order
matches = re.findall(r'Response [A-Z]', ranking_text)
return matches
def calculate_aggregate_rankings(
stage2_results: List[Dict[str, Any]],
label_to_model: Dict[str, str]
) -> List[Dict[str, Any]]:
"""
Calculate aggregate rankings across all models.
Args:
stage2_results: Rankings from each model
label_to_model: Mapping from anonymous labels to model names
Returns:
List of dicts with model name and average rank, sorted best to worst
"""
from collections import defaultdict
# Track positions for each model
model_positions = defaultdict(list)
for ranking in stage2_results:
ranking_text = ranking['ranking']
# Parse the ranking from the structured format
parsed_ranking = parse_ranking_from_text(ranking_text)
for position, label in enumerate(parsed_ranking, start=1):
if label in label_to_model:
model_name = label_to_model[label]
model_positions[model_name].append(position)
# Calculate average position for each model
aggregate = []
for model, positions in model_positions.items():
if positions:
avg_rank = sum(positions) / len(positions)
aggregate.append({
"model": model,
"average_rank": round(avg_rank, 2),
"rankings_count": len(positions)
})
# Sort by average rank (lower is better)
aggregate.sort(key=lambda x: x['average_rank'])
return aggregate
async def generate_conversation_title(user_query: str) -> str:
"""
Generate a short title for a conversation based on the first user message.
Args:
user_query: The first user message
Returns:
A short title (3-5 words)
"""
title_prompt = f"""Generate a very short title (3-5 words maximum) that summarizes the following question.
The title should be concise and descriptive. Do not use quotes or punctuation in the title.
Question: {user_query}
Title:"""
messages = [{"role": "user", "content": title_prompt}]
# Use gemini-2.5-flash for title generation (fast and cheap)
response = await query_model("google/gemini-2.5-flash", messages, timeout=30.0)
if response is None:
# Fallback to a generic title
return "New Conversation"
title = response.get('content', 'New Conversation').strip()
# Clean up the title - remove quotes, limit length
title = title.strip('"\'')
# Truncate if too long
if len(title) > 50:
title = title[:47] + "..."
return title
async def run_full_council(user_query: str) -> Tuple[List, List, Dict, Dict]:
"""
Run the complete 3-stage council process.
Args:
user_query: The user's question
Returns:
Tuple of (stage1_results, stage2_results, stage3_result, metadata)
"""
# Stage 1: Collect individual responses
stage1_results = await stage1_collect_responses(user_query)
# If no models responded successfully, return error
if not stage1_results:
return [], [], {
"model": "error",
"response": "All models failed to respond. Please try again."
}, {}
# Stage 2: Collect rankings
stage2_results, label_to_model = await stage2_collect_rankings(user_query, stage1_results)
# Calculate aggregate rankings
aggregate_rankings = calculate_aggregate_rankings(stage2_results, label_to_model)
# Stage 3: Synthesize final answer
stage3_result = await stage3_synthesize_final(
user_query,
stage1_results,
stage2_results
)
# Prepare metadata
metadata = {
"label_to_model": label_to_model,
"aggregate_rankings": aggregate_rankings
}
return stage1_results, stage2_results, stage3_result, metadata
================================================
FILE: backend/main.py
================================================
"""FastAPI backend for LLM Council."""
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import List, Dict, Any
import uuid
import json
import asyncio
from . import storage
from .council import run_full_council, generate_conversation_title, stage1_collect_responses, stage2_collect_rankings, stage3_synthesize_final, calculate_aggregate_rankings
app = FastAPI(title="LLM Council API")
# Enable CORS for local development
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:5173", "http://localhost:3000"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class CreateConversationRequest(BaseModel):
"""Request to create a new conversation."""
pass
class SendMessageRequest(BaseModel):
"""Request to send a message in a conversation."""
content: str
class ConversationMetadata(BaseModel):
"""Conversation metadata for list view."""
id: str
created_at: str
title: str
message_count: int
class Conversation(BaseModel):
"""Full conversation with all messages."""
id: str
created_at: str
title: str
messages: List[Dict[str, Any]]
@app.get("/")
async def root():
"""Health check endpoint."""
return {"status": "ok", "service": "LLM Council API"}
@app.get("/api/conversations", response_model=List[ConversationMetadata])
async def list_conversations():
"""List all conversations (metadata only)."""
return storage.list_conversations()
@app.post("/api/conversations", response_model=Conversation)
async def create_conversation(request: CreateConversationRequest):
"""Create a new conversation."""
conversation_id = str(uuid.uuid4())
conversation = storage.create_conversation(conversation_id)
return conversation
@app.get("/api/conversations/{conversation_id}", response_model=Conversation)
async def get_conversation(conversation_id: str):
"""Get a specific conversation with all its messages."""
conversation = storage.get_conversation(conversation_id)
if conversation is None:
raise HTTPException(status_code=404, detail="Conversation not found")
return conversation
@app.post("/api/conversations/{conversation_id}/message")
async def send_message(conversation_id: str, request: SendMessageRequest):
"""
Send a message and run the 3-stage council process.
Returns the complete response with all stages.
"""
# Check if conversation exists
conversation = storage.get_conversation(conversation_id)
if conversation is None:
raise HTTPException(status_code=404, detail="Conversation not found")
# Check if this is the first message
is_first_message = len(conversation["messages"]) == 0
# Add user message
storage.add_user_message(conversation_id, request.content)
# If this is the first message, generate a title
if is_first_message:
title = await generate_conversation_title(request.content)
storage.update_conversation_title(conversation_id, title)
# Run the 3-stage council process
stage1_results, stage2_results, stage3_result, metadata = await run_full_council(
request.content
)
# Add assistant message with all stages
storage.add_assistant_message(
conversation_id,
stage1_results,
stage2_results,
stage3_result
)
# Return the complete response with metadata
return {
"stage1": stage1_results,
"stage2": stage2_results,
"stage3": stage3_result,
"metadata": metadata
}
@app.post("/api/conversations/{conversation_id}/message/stream")
async def send_message_stream(conversation_id: str, request: SendMessageRequest):
"""
Send a message and stream the 3-stage council process.
Returns Server-Sent Events as each stage completes.
"""
# Check if conversation exists
conversation = storage.get_conversation(conversation_id)
if conversation is None:
raise HTTPException(status_code=404, detail="Conversation not found")
# Check if this is the first message
is_first_message = len(conversation["messages"]) == 0
async def event_generator():
try:
# Add user message
storage.add_user_message(conversation_id, request.content)
# Start title generation in parallel (don't await yet)
title_task = None
if is_first_message:
title_task = asyncio.create_task(generate_conversation_title(request.content))
# Stage 1: Collect responses
yield f"data: {json.dumps({'type': 'stage1_start'})}\n\n"
stage1_results = await stage1_collect_responses(request.content)
yield f"data: {json.dumps({'type': 'stage1_complete', 'data': stage1_results})}\n\n"
# Stage 2: Collect rankings
yield f"data: {json.dumps({'type': 'stage2_start'})}\n\n"
stage2_results, label_to_model = await stage2_collect_rankings(request.content, stage1_results)
aggregate_rankings = calculate_aggregate_rankings(stage2_results, label_to_model)
yield f"data: {json.dumps({'type': 'stage2_complete', 'data': stage2_results, 'metadata': {'label_to_model': label_to_model, 'aggregate_rankings': aggregate_rankings}})}\n\n"
# Stage 3: Synthesize final answer
yield f"data: {json.dumps({'type': 'stage3_start'})}\n\n"
stage3_result = await stage3_synthesize_final(request.content, stage1_results, stage2_results)
yield f"data: {json.dumps({'type': 'stage3_complete', 'data': stage3_result})}\n\n"
# Wait for title generation if it was started
if title_task:
title = await title_task
storage.update_conversation_title(conversation_id, title)
yield f"data: {json.dumps({'type': 'title_complete', 'data': {'title': title}})}\n\n"
# Save complete assistant message
storage.add_assistant_message(
conversation_id,
stage1_results,
stage2_results,
stage3_result
)
# Send completion event
yield f"data: {json.dumps({'type': 'complete'})}\n\n"
except Exception as e:
# Send error event
yield f"data: {json.dumps({'type': 'error', 'message': str(e)})}\n\n"
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
}
)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8001)
================================================
FILE: backend/openrouter.py
================================================
"""OpenRouter API client for making LLM requests."""
import httpx
from typing import List, Dict, Any, Optional
from .config import OPENROUTER_API_KEY, OPENROUTER_API_URL
async def query_model(
model: str,
messages: List[Dict[str, str]],
timeout: float = 120.0
) -> Optional[Dict[str, Any]]:
"""
Query a single model via OpenRouter API.
Args:
model: OpenRouter model identifier (e.g., "openai/gpt-4o")
messages: List of message dicts with 'role' and 'content'
timeout: Request timeout in seconds
Returns:
Response dict with 'content' and optional 'reasoning_details', or None if failed
"""
headers = {
"Authorization": f"Bearer {OPENROUTER_API_KEY}",
"Content-Type": "application/json",
}
payload = {
"model": model,
"messages": messages,
}
try:
async with httpx.AsyncClient(timeout=timeout) as client:
response = await client.post(
OPENROUTER_API_URL,
headers=headers,
json=payload
)
response.raise_for_status()
data = response.json()
message = data['choices'][0]['message']
return {
'content': message.get('content'),
'reasoning_details': message.get('reasoning_details')
}
except Exception as e:
print(f"Error querying model {model}: {e}")
return None
async def query_models_parallel(
models: List[str],
messages: List[Dict[str, str]]
) -> Dict[str, Optional[Dict[str, Any]]]:
"""
Query multiple models in parallel.
Args:
models: List of OpenRouter model identifiers
messages: List of message dicts to send to each model
Returns:
Dict mapping model identifier to response dict (or None if failed)
"""
import asyncio
# Create tasks for all models
tasks = [query_model(model, messages) for model in models]
# Wait for all to complete
responses = await asyncio.gather(*tasks)
# Map models to their responses
return {model: response for model, response in zip(models, responses)}
================================================
FILE: backend/storage.py
================================================
"""JSON-based storage for conversations."""
import json
import os
from datetime import datetime
from typing import List, Dict, Any, Optional
from pathlib import Path
from .config import DATA_DIR
def ensure_data_dir():
"""Ensure the data directory exists."""
Path(DATA_DIR).mkdir(parents=True, exist_ok=True)
def get_conversation_path(conversation_id: str) -> str:
"""Get the file path for a conversation."""
return os.path.join(DATA_DIR, f"{conversation_id}.json")
def create_conversation(conversation_id: str) -> Dict[str, Any]:
"""
Create a new conversation.
Args:
conversation_id: Unique identifier for the conversation
Returns:
New conversation dict
"""
ensure_data_dir()
conversation = {
"id": conversation_id,
"created_at": datetime.utcnow().isoformat(),
"title": "New Conversation",
"messages": []
}
# Save to file
path = get_conversation_path(conversation_id)
with open(path, 'w') as f:
json.dump(conversation, f, indent=2)
return conversation
def get_conversation(conversation_id: str) -> Optional[Dict[str, Any]]:
"""
Load a conversation from storage.
Args:
conversation_id: Unique identifier for the conversation
Returns:
Conversation dict or None if not found
"""
path = get_conversation_path(conversation_id)
if not os.path.exists(path):
return None
with open(path, 'r') as f:
return json.load(f)
def save_conversation(conversation: Dict[str, Any]):
"""
Save a conversation to storage.
Args:
conversation: Conversation dict to save
"""
ensure_data_dir()
path = get_conversation_path(conversation['id'])
with open(path, 'w') as f:
json.dump(conversation, f, indent=2)
def list_conversations() -> List[Dict[str, Any]]:
"""
List all conversations (metadata only).
Returns:
List of conversation metadata dicts
"""
ensure_data_dir()
conversations = []
for filename in os.listdir(DATA_DIR):
if filename.endswith('.json'):
path = os.path.join(DATA_DIR, filename)
with open(path, 'r') as f:
data = json.load(f)
# Return metadata only
conversations.append({
"id": data["id"],
"created_at": data["created_at"],
"title": data.get("title", "New Conversation"),
"message_count": len(data["messages"])
})
# Sort by creation time, newest first
conversations.sort(key=lambda x: x["created_at"], reverse=True)
return conversations
def add_user_message(conversation_id: str, content: str):
"""
Add a user message to a conversation.
Args:
conversation_id: Conversation identifier
content: User message content
"""
conversation = get_conversation(conversation_id)
if conversation is None:
raise ValueError(f"Conversation {conversation_id} not found")
conversation["messages"].append({
"role": "user",
"content": content
})
save_conversation(conversation)
def add_assistant_message(
conversation_id: str,
stage1: List[Dict[str, Any]],
stage2: List[Dict[str, Any]],
stage3: Dict[str, Any]
):
"""
Add an assistant message with all 3 stages to a conversation.
Args:
conversation_id: Conversation identifier
stage1: List of individual model responses
stage2: List of model rankings
stage3: Final synthesized response
"""
conversation = get_conversation(conversation_id)
if conversation is None:
raise ValueError(f"Conversation {conversation_id} not found")
conversation["messages"].append({
"role": "assistant",
"stage1": stage1,
"stage2": stage2,
"stage3": stage3
})
save_conversation(conversation)
def update_conversation_title(conversation_id: str, title: str):
"""
Update the title of a conversation.
Args:
conversation_id: Conversation identifier
title: New title for the conversation
"""
conversation = get_conversation(conversation_id)
if conversation is None:
raise ValueError(f"Conversation {conversation_id} not found")
conversation["title"] = title
save_conversation(conversation)
================================================
FILE: frontend/.gitignore
================================================
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*
node_modules
dist
dist-ssr
*.local
# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
================================================
FILE: frontend/README.md
================================================
# React + Vite
This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
Currently, two official plugins are available:
- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react) uses [Babel](https://babeljs.io/) (or [oxc](https://oxc.rs) when used in [rolldown-vite](https://vite.dev/guide/rolldown)) for Fast Refresh
- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
## React Compiler
The React Compiler is not enabled on this template because of its impact on dev & build performances. To add it, see [this documentation](https://react.dev/learn/react-compiler/installation).
## Expanding the ESLint configuration
If you are developing a production application, we recommend using TypeScript with type-aware lint rules enabled. Check out the [TS template](https://github.com/vitejs/vite/tree/main/packages/create-vite/template-react-ts) for information on how to integrate TypeScript and [`typescript-eslint`](https://typescript-eslint.io) in your project.
================================================
FILE: frontend/eslint.config.js
================================================
import js from '@eslint/js'
import globals from 'globals'
import reactHooks from 'eslint-plugin-react-hooks'
import reactRefresh from 'eslint-plugin-react-refresh'
import { defineConfig, globalIgnores } from 'eslint/config'
export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{js,jsx}'],
extends: [
js.configs.recommended,
reactHooks.configs.flat.recommended,
reactRefresh.configs.vite,
],
languageOptions: {
ecmaVersion: 2020,
globals: globals.browser,
parserOptions: {
ecmaVersion: 'latest',
ecmaFeatures: { jsx: true },
sourceType: 'module',
},
},
rules: {
'no-unused-vars': ['error', { varsIgnorePattern: '^[A-Z_]' }],
},
},
])
================================================
FILE: frontend/index.html
================================================
frontend
================================================
FILE: frontend/package.json
================================================
{
"name": "frontend",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"lint": "eslint .",
"preview": "vite preview"
},
"dependencies": {
"react": "^19.2.0",
"react-dom": "^19.2.0",
"react-markdown": "^10.1.0"
},
"devDependencies": {
"@eslint/js": "^9.39.1",
"@types/react": "^19.2.5",
"@types/react-dom": "^19.2.3",
"@vitejs/plugin-react": "^5.1.1",
"eslint": "^9.39.1",
"eslint-plugin-react-hooks": "^7.0.1",
"eslint-plugin-react-refresh": "^0.4.24",
"globals": "^16.5.0",
"vite": "^7.2.4"
}
}
================================================
FILE: frontend/src/App.css
================================================
* {
box-sizing: border-box;
}
.app {
display: flex;
height: 100vh;
width: 100vw;
overflow: hidden;
background: #ffffff;
color: #333;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
sans-serif;
}
================================================
FILE: frontend/src/App.jsx
================================================
import { useState, useEffect } from 'react';
import Sidebar from './components/Sidebar';
import ChatInterface from './components/ChatInterface';
import { api } from './api';
import './App.css';
function App() {
const [conversations, setConversations] = useState([]);
const [currentConversationId, setCurrentConversationId] = useState(null);
const [currentConversation, setCurrentConversation] = useState(null);
const [isLoading, setIsLoading] = useState(false);
// Load conversations on mount
useEffect(() => {
loadConversations();
}, []);
// Load conversation details when selected
useEffect(() => {
if (currentConversationId) {
loadConversation(currentConversationId);
}
}, [currentConversationId]);
const loadConversations = async () => {
try {
const convs = await api.listConversations();
setConversations(convs);
} catch (error) {
console.error('Failed to load conversations:', error);
}
};
const loadConversation = async (id) => {
try {
const conv = await api.getConversation(id);
setCurrentConversation(conv);
} catch (error) {
console.error('Failed to load conversation:', error);
}
};
const handleNewConversation = async () => {
try {
const newConv = await api.createConversation();
setConversations([
{ id: newConv.id, created_at: newConv.created_at, message_count: 0 },
...conversations,
]);
setCurrentConversationId(newConv.id);
} catch (error) {
console.error('Failed to create conversation:', error);
}
};
const handleSelectConversation = (id) => {
setCurrentConversationId(id);
};
const handleSendMessage = async (content) => {
if (!currentConversationId) return;
setIsLoading(true);
try {
// Optimistically add user message to UI
const userMessage = { role: 'user', content };
setCurrentConversation((prev) => ({
...prev,
messages: [...prev.messages, userMessage],
}));
// Create a partial assistant message that will be updated progressively
const assistantMessage = {
role: 'assistant',
stage1: null,
stage2: null,
stage3: null,
metadata: null,
loading: {
stage1: false,
stage2: false,
stage3: false,
},
};
// Add the partial assistant message
setCurrentConversation((prev) => ({
...prev,
messages: [...prev.messages, assistantMessage],
}));
// Send message with streaming
await api.sendMessageStream(currentConversationId, content, (eventType, event) => {
switch (eventType) {
case 'stage1_start':
setCurrentConversation((prev) => {
const messages = [...prev.messages];
const lastMsg = messages[messages.length - 1];
lastMsg.loading.stage1 = true;
return { ...prev, messages };
});
break;
case 'stage1_complete':
setCurrentConversation((prev) => {
const messages = [...prev.messages];
const lastMsg = messages[messages.length - 1];
lastMsg.stage1 = event.data;
lastMsg.loading.stage1 = false;
return { ...prev, messages };
});
break;
case 'stage2_start':
setCurrentConversation((prev) => {
const messages = [...prev.messages];
const lastMsg = messages[messages.length - 1];
lastMsg.loading.stage2 = true;
return { ...prev, messages };
});
break;
case 'stage2_complete':
setCurrentConversation((prev) => {
const messages = [...prev.messages];
const lastMsg = messages[messages.length - 1];
lastMsg.stage2 = event.data;
lastMsg.metadata = event.metadata;
lastMsg.loading.stage2 = false;
return { ...prev, messages };
});
break;
case 'stage3_start':
setCurrentConversation((prev) => {
const messages = [...prev.messages];
const lastMsg = messages[messages.length - 1];
lastMsg.loading.stage3 = true;
return { ...prev, messages };
});
break;
case 'stage3_complete':
setCurrentConversation((prev) => {
const messages = [...prev.messages];
const lastMsg = messages[messages.length - 1];
lastMsg.stage3 = event.data;
lastMsg.loading.stage3 = false;
return { ...prev, messages };
});
break;
case 'title_complete':
// Reload conversations to get updated title
loadConversations();
break;
case 'complete':
// Stream complete, reload conversations list
loadConversations();
setIsLoading(false);
break;
case 'error':
console.error('Stream error:', event.message);
setIsLoading(false);
break;
default:
console.log('Unknown event type:', eventType);
}
});
} catch (error) {
console.error('Failed to send message:', error);
// Remove optimistic messages on error
setCurrentConversation((prev) => ({
...prev,
messages: prev.messages.slice(0, -2),
}));
setIsLoading(false);
}
};
return (
Each model evaluated all responses (anonymized as Response A, B, C, etc.) and provided rankings.
Below, model names are shown in bold for readability, but the original evaluation used anonymous labels.